PCA_FEATURES
Applies to PredictiveInsight only.
Syntax
PCA_FEATURES(num_features, data [, PCA(base_data)])
Parameters
num_features
The number of features to extract from the specified data range using principal component analysis (PCA). This value must be a positive integer between one and the number of columns in the data range specified by data.
data
The numerical values to extract features from. This can be a column, a cell range, or an expression evaluating to either of the above. For the format definition of data, see the "Macro Function Parameters" section in the chapter in this guide for your IBM® product.
PCA(base_data)
If this optional parameter is provided, PCA is performed on this base_data data range and the resulting eigenvectors are used to extract features from the data data range. For the format definition of base_data (same as data), see the "Macro Function Parameters" section in the chapter in this guide for your IBM® product. The number of columns in base_data must be the same as the number of columns in data.
Description
PCA_FEATURES extracts the top num_features features from the specified data range. It returns num_features columns using the eigenvectors generated by principal component analysis on the data range base_data, if provided. If not provided, it uses data to generate the eigenvectors. In this case, data is automatically normalized using the zero-mean unit-variant method prior to the principal component analysis.
The features are computed as follows:
*
The data range data is automatically normalized using the zero-mean unit-variant method. In other words,
PCA_FEATURES(num_features, data)
is equivalent to
PCA_FEATURES(num_features, data,PCA(data, COL))
No normalization of data is provided automatically. To normalize data using NORM_ZSCORE, you can specify the following:
PCA_FEATURES(num_features, data, PCA(NORM_ZSCORE(data, COL)))
*
Principal component analysis is performed on the normalized data range to generate its eigenvectors (see details described for the PCA macro function). This occurs automatically for data if base_data is not provided. It is performed by the explicit call to the PCA macro function if base_data is provided.
*
Each row () of the data range (data) is transformed into a new coordinate system () based on the top num_features (m) ranked eigenvectors which compose :
*
The k rows of the transformed data ( to ) are returned (n columns).
If the base_data data range is provided, it must have the same number of columns as the data data range, otherwise an error is returned.
*
Because calculating PCA on a data range can be compute intensive, using the BUFFER macro function on the PCA calculation is much more efficient. For example: PCA_FEATURES(num_features, range, BUFFER(PCA(base_data)))
Examples
Creates five new columns named TEMP, VW, VX, VY, and VZ, containing the top five features of the data range V1:V7. The data range V1:V7 is used as the basis for the transformation.
Creates three new columns named TEMP, VX, and VY, containing the top three features of the data range V1:V4. The data range V10:V13 is used as the basis for the transformation.
Creates three new columns named TEMP, VX, and VY, containing the top three features of the data range V1:V4. The data range V10:V13 is used as the basis for the transformation. Once the principal components of the data range V10:V13 are calculated, those values are stored as constants. If the data values in columns V10 - V13 change, they will not effect this function definition.
Related Functions