Questions from Our Users – Pirouette Algorithms
Scaling of scores and loadings
Q. Is the scaling on all PC factors (1, 2, 3, 4) (scores and loadings) the same?
A. The loadings are on a scale bounded by 1 and -1. In fact, they are orthonormal: orthogonal (multivariate perpendicular, if you will) and normalized. The normalization is such that if you square each loading value, then sum the squares down each factor, the column sum will = one.
Recall that the scores are actually the product of the two matrices. In fact, go back to the decomposition of the data matrix X, as:
X = U*S*V^T
where the * are matrix multiplies and ^T implies transpose. The Vs are the loadings and the product U*S are the scores. The matrix U is sometimes called the left eigenvectors; important is that they have similar properties to the right eigenvectors, the Vs (i.e., the loadings). That is the Us are also orthonormal. Thus, the scores are really the left eigenvectors scaled by the S values. The S matrix is a diagonal matrix, that is, all values off of the diagonal are zero. And the diagonal values, individually called the singular values, are the square root of the eigenvalues. The eigenvalue for a factor is the variance represented by that factor. So, again, the scores are the left eigenvectors (orthonormal) scaled, per factor, by the square root of the corresponding variance magnitude.
Since the decomposition of X produces factors with continually decreasing amount of variance, the magnitude of information in the scores is correspondingly decreasing. Thus, doing the same math as for the loadings, compute the sum of squares for each column of the scores; these will be equal to the corresponding eigenvalues. The amount of information in each score correlates exactly to the variance (eigenvalue). Now, it MAY happen that the RANGE of values in score 1 is larger than that in score 2. However, the magnitude of score 1, as a vector, will always be larger than that for score 2.
Note that the units on the scores are only indirectly related to data units. You can visualize it this way: if you take the sum of squares of all data values in your data matrix, this will equal the sum of squares in all values in the scores matrix, when you have calculated all possible factors. So, since the scores are ordered in decreasing amount of variance, the magnitude of values in score 1 are larger than the average values of the original data values. If the original data values are very different in ranges among the different variables, however, it could happen that their values are larger than the score 1 values.
The magnitudes of the scores and loadings values will only be similar by coincidence, i.e., if data values are between 0 and 1. You can see this happen if you normalize your data before running PCA (or, Divide by 2-norm). Thus, it is unlikely to find scores and loadings on the same scale.
However, interpretations of differences along the 2nd and 3rd factors (and, occasionally, higher) are legitimate. Only when the variance for a factor drops down to a level you consider to be noise would you no longer consider the information important.
PC1 vs. PC2
Q. We were reviewing PAH concentration data where PC1 typically ran in the 80% range for explanatory power while PC2 typically ran in the 10% range. The originator of the plot then went on to place great significance in the capability of PC2 to differentiate PAH sources. My comment was that the visual separation of the samples along the PC2-axis may not be as significant as the PCA originator may have hoped; i.e., to get a better sense of the relative magnitude of the separation along both plots, one might stretch the PC1 axis out 8 times the length of PC2 and better visualize the compositional differences among samples. In this way, the distance on the PC1 and PC2 axes would be more directly comparable when scaled for the explanatory power of the analysis. Is my comment correct or does the math behind the PCA equations offer other insight into evaluating the separation on the PC1 v PC2 axes?
A. You are both right. Yes, PC1, thus the highly loaded associated variables, does explain more about apparent trends in the data. You can easily visualize this in Pirouette when looking at the Scores plot: choose Display > Plot Scaling > Data Scaling which does exactly as you described when stretching PC1 by 8 times.
On the other hand, if there is separation and it is clear that PC2 is responsible for this separation, do we care that PC1 has explained more variance? Remember that we are really doing two things at once. First, PCA helps isolate information in the data set in as few factors as possible, with most information in PC1, second most in PC2, etc. Second, by isolating information in a small number of factors, we will be more likely to observe trends, clusters, etc., that are naturally in the data than if we were to look at the original data, one variable at a time.
It may be that, although PC1 describes the majority of the variation, PC1 has nothing to do with category separation, rather it may be explaining baseline differences, scatter, etc. It may be that PC2, or PC3, are key to category separation. Each data set is different and interpretation must look at several objects at once.
Normalization to a different region
Q. I would like to normalize spectra where the area used for normalization is different from the analytical region used. Is there a function/method that I can use in Pirouette for this function? When reading the manual I found area normalization within the same analytical region and vector normalization, but not area normalization when using a different analytical region. Did I miss something?
A. I think we have what you want. Note that the transform called Divide by > Sample 2-norm is the same as the original transform called Normalize (both are vector norms; they are two separate entries for historical reasons). The Divide by options offer an additional parameter of a Mask row #. As you may be aware, we use a Mask as a vector of do/don’t toggles–values in a mask row must be only ones and zeros. If a Mask value is a 1, then that variable will be used in computing the divide-by factor. Otherwise, it is ignored.
Thus, make an excluded row containing ones and zeros and indicate that row # in the transform option.
- Highlight row 1 and do Edit > Insert. Note that the row is inserted excluded (but, you will lose results; sorry).
- Switch to a line plot view.
- With the Range tool select the region you want considered in your normalization calculation. If you want more than one region, be sure to hold down the Ctrl key while selecting the extra regions.
- Switch back to the table view. Note that the row is still highlighted but now columns are also highlighted.
- Choose Edit > Fill > As Mask (this Fill option requires that both rows and columns are highlighted simultaneously). This automatically fills the row with ones where the row and column highlights intersect and with zeros where there are no column highlights.
- Transform with the Divide by > Sample 2-norm and indicate Row #1 as the Mask.
Normalize to a specific peak
Q. I’d like to normalize the data. They are in different concentrations. The peak at 23.476 min. may be a indicator of the concentration. So we want to get that peak to be same on all samples. How could we do that?
A. In order to do a normalization to a given peak, you need to use the Divide by, Sample max option. However, if the peak you want to normalize on is not the tallest peak in the chromatogram, you need to tell Pirouette to look for the largest value in a restricted region. We do this by creating a mask (dummy sample) of ones and zeros to indicate the region in which the normalization factor will be determined. See above for instructions on creating a mask.
After creating the mask, do Process > Run, and from the algorithm list, select only XFORM. In the Transforms, select Divide By, Sample Max, with Use as Mask checked, and indicate for Row # the row you inserted for the mask. View the results to verify that the normalization has done what you expected. You can then use this Transform in your algorithm.
PCA and normalisation
Q. We use Pirouette for processing of spectroscopy data, mainly for ATR, FTIR and photothermal microspectroscopy.
1. After talking to a stats professor here last week, I came to realise that the PCA analysis uses its own normalisation (or that the normalisation is “built into” the PCA algorithm). In our case, we all use mean-centre preprocessing in PCA. It seems from your online help-file that this might be a way of normalising/scaling all the spectra prior to comparison. Is this correct?
2. We also use the XFORM function to normalise the spectra by the “divide by” option to a prominent peak, either amide 1 or amide 2 (at 1660ish or 1550ish wavenumbers). The data loaded into pirouette have only been corrected for atmospheric background, but have not been baseline corrected or normalised before doing PCA. We have been under the impression that this (i.e. baseline correction and normalisation) can be done by Pirouette PCA, but now I am not so sure that the PCA function does what we think it has been doing thus far. Would it be correct to think that the PCA and XFORM operations are carried out independently of each other?
3. My main query is therefore this: presuming that the normalisation (using “divide by”) in XFORM is independent from the preprocessing/scaling of spectra in PCA, would the value chosen for normalisation in XFORM at all affect the PCA results?
A. Your second question is perhaps the most relevant. Yes, spectral transformation and preprocessing are independent. The transforms are applied to the spectra and, for all but one algorithm, each spectrum is handled independently of all other spectra. After the spectra are transformed (and more than one transform can be applied, sequentially), the data are preprocessed and then are finally passed to the algorithm (e.g., PCA). The PCA algorithm does not itself apply any normalization except during the matrix decomposition step
The differences between transforms and preprocessing, then, are 3:
– order of computation: transforms first, preprocessing second
– direction of application: transforms are applied to rows (spectra) while preprocessing applies to columns (variables)
– data dependence: the result of a transform will be the same for a spectrum whether it is alone or in a group of spectra, but the result of preprocessing is dependent on which spectra are included in the set to be analyzed.
A word about normalization. In Pirouette, and in much of the literature, normalization is the group of techniques that scale the spectra, usually to remove a source of unrelated variation. For example, in chromatography it is common to apply an area % normalization (we call this the 1-norm) which tries to minimize the effect of variation of sample size that actually hits the column. However, other uses of the term normalization, often in the statistics literature, refer to the steps of centering and scaling that most in the chemometrics field call autoscaling. In this terminology, it is the variables that are being ‘normalized’ to reduce the effects of varying intensity among variables, to put them all on a ‘level playing field’ as it were.
Finally, to answer your last question, yes, it can make a significant difference which transform(s) is(are) applied. Thus, we allow you to view the results of the transform (XFORM) as a separate computed object so you can view the transformed spectra before deciding to use said transform in your final processing. We recommend doing so, in fact, to be sure that you have not inadvertently contorted the data in some way. For spectroscopic data, the transforms are unlikely to cause any unusual behavior, though, so the approach you have described (transform=divide by a prominent peak; preprocessing=mean center) is probably reasonable.
Misclassification object not computed
Q. Is there anything that would cause a SIMCA prediction to not compute the misclassification object? We have a 2 class model that computes it for one test set and not for another. For both test sets, the class variable assigned is not used in the model. The other prediction objects are computed normally. We had one real outlier in the test set in question, and removing it did not change the result.
A. The SIMCA Misclassification Object is computed during prediction ONLY if there is a class variable in the prediction set of the same name (case sensitive) as the class variable that was active during model building.
This also holds for regression algorithms (PLS, PCR, CLS): if you want to get an error report (the Error Analysis object), the Y variable name(s) must be the same as those used when building the model.
Mean Centering and PLS Models
Q. Without mean centering, and using transforms such as second derivative, multiplicative scattering correction (MSC), and standard normal variate (SNV), PLS models account for 99.9% or more of the variation with only 3 factors. If we use mean centering then the calibration model requires 2 – 3 more factors, and the variation accounted for is less.
A. Often, when no preprocessing is applied, the information represented by the first factor will be dominated by the mean, with percent explained close to 100%. It would appear, then, that a small number of factors would be suitable.
However, if mean centering is applied, then the percent variance explained in the first few factors is not nearly so great, and more factors appear to be necessary for the regression.
If you place a table view of the two scenarios side-by-side, so that you can see the actual variance values, it is likely that the first factor of the mean-centered case is of a magnitude similar to that of the second factor without mean centering.
This can be even more striking if you do a leave-one-out cross validation, then compare a tabulation of the PRESS values for the cases of no vs. mean centering.
The goal of regression is to find those factors that best describe the correlation of the spectral data to the Y variable. If mean centering is not done, the first factor can greatly mask the information in later factors even though they do contain information as well.
One way to observe the result of the regression is to look at a line plot of each loading. Without mean centering, it is likely the first loading exhibits a shape very similar to the mean spectrum. Other loadings will have shapes in which the peaks correspond to important regions of the spectrum for the correlation.
Problem with a SIMCA classification
Q. We are using Pirouette to analyze near-infrared spectra collected with a Control Development near-infrared diode array spectrometer operating in the 900 – 1700 nm spectral range. To test the SIMCA analysis method we put together a set of 3 samples of polymer sheet material of different polymer types (about 1 mm thick) and measured both a set of calibration absorbance spectra (in Grams .spc format) and a set of validation spectra. We created a SIMCA calibration model from the calibration spectra (6 spectra for each sample taken at different positions) and then applied it to the validation spectra (same samples but different position) to try to identify the polymer type. We used 2nd derivative pre-processing. The Pirouette SIMCA analysis was not able to correctly identify all of the 3 polymer types. The signal/noise in the spectra appears to be good and one can clearly see by eye distinct differences between the spectral absorption patterns for the 3 polymers.
Could you help us figure out if there is anything that could be corrected or optimized in this calibration model so that it will be able to properly identify the polymer type for the validation spectra?
A. You are not doing anything explicitly wrong, but a small problem in your approach has caused another small problem that resulted in your observation.
Your calibration set has only 6 samples per category. This is really pushing the limits, particularly when there is notable variation among the samples in one category. When you developed your SIMCA model, you set the PCA model for the first category to 5 factors. Since you had selected to mean-center the data, which reduces the number of degrees of freedom by one (occurs when data are sample-limited), you thus set a model to use all remaining degrees of freedom. When calculation of the decision threshold is made (look at the Class Distances object, subplot 1, then zoom in around the three samples on the left), with all degrees of freedom used, the threshold line is essentially at 0 because with all factors used, there are no residuals. This explains why samples of this category were not classified.
If you change the #factors for class one to 4, then predict, you will see that all 3 of the samples now correctly classify (and, with 4 factors, over 97% of variance is explained). Moral: the training set should have a sufficient number of samples in each category to adequately capture the variability; and, don’t set the #factors to the maximum possible.
When I predict, my datafile icon is greyed out
Q. I’m working with my old PCA model. I saved it under a useful name. This was based on a divide-by row (line 51) to normalize NIR and eNOSE data. You may recognize that the divide-by routine is so noted when I subsequently open the model and view its contents in a new sheet where I’ve opened my prediction set.
I’m now trying to predict an new set of 19 samples, which too includes a divide-by row in row #20 which I excluded. When I then try to predict using the model, my datafile icon is greyed out? Why is that? Do I need to do something since my calibration set using row #51 and now my prediction set uses row #20? Or, would the (calibration) model KNOW what row to divide through by. Thanks.
A. If the exclusion set name is grayed out when you do Process > Predict, that indicates that the number of variables in your prediction set is different than the number of variables in the model. Note that I said variables. The fact that you did a divide by row when making the model but also have a row excluded in your test set will not influence the availability of the exclusion set.
So, check the number of variables in the model:
- Do Process > Predict, then click once on the model name so as to show its biography
- Look at X variable count > Total
Then check the number of variables in your test set:
- Cancel the Predict setup dialog
- In the Object Manager, do a right click on Full Data (or, the data subset you are predicting on)
- Inspect # Variables … # Independent
This number should be the same as the X variable count above. Also, be sure there are NO column exclusions in the subset which you intend to predict–Pirouette already knows which variables to exclude when it does the prediction.
Pirouette won’t let me select the model name for prediction
Q. I made a PLS model with my training data, but when I try to predict on a different data set, I can’t select the model. What is wrong?
A. The problem you encountered is one of two related problems that are caused by a mismatch in the number of variables in the model and in the data. [see above for more information]
When Pirouette makes a model, it stores, among many other things, the number of variables that were in the data set that you used to make the model. Since the number of variables must be the same in order to do a prediction, when you load a new data set, then open the model, Pirouette compares the numbers of variables in each. If they do not match, then you are prevented from selecting the model name.
When you open the model, then do Process > Predict, if you click on the model name, even if disabled, you will see in the model information that is shown an item called X variable count. Look at the value for Total. This is the number of variables in the model. You should then check to be sure that the number of X variables in your prediction data is the same number. Since you created your prediction set, as you described to me, by merging data, it is possible that you ended up with a different number of variables that were in the model. Check also that, if you merged in Y data, that you need to indicate that these are Y variables (Edit > Column Type).
Signs of PCA scores are inverted in 2 nearly identical sets
Q. Would you or one of your colleagues be willing to provide us with some assistance to help us explain an unexpected result when using Pirouette 3.11 for PCA and understand its cause?
The goal of this study was to compare 2 methods.
The experiment consists of 11 samples each analyzed for 255 variables using 2 methods (the first method and the second method).
The absolute intensities of the data collected for each sample differ between methods, however the relative intensities do not (i.e. “scaling differences” are observed between methods)
These “scaling differences” between the 2 methods may vary from sample to sample
A sample-by sample visual inspection of the data from 2 methods revealed nearly no differences between the two methods once scaled (i.e. once scaled there is virtually no difference between the results from two methods)
A PCA for 9 factors was performed on the data collected with the first method in Pirouette 3.11 using mean-center preprocessing and a divide by (Sample 1-norm) transform
A PCA for 9 factors was performed on the data collected with the second method in Pirouette 3.11 using mean-center preprocessing and a divide by (Sample 1-norm) transform
The PC1/PC2/PC3 scores plot from the first method PCA was nearly the “mirror image” of the PC1/PC2/PC3 scores plot from the second method PCA
The PC1/PC2/PC3 loadings plot from the first method PCA was nearly the “mirror image” of the PC1/PC2/PC3 loadings plot from the second method PCA
Comparison of the PC1/PC2/PC3 scores from the first method PCA to those of the second method PCA revealed:
For all 11 samples, the signs of the Factor 1 scores in the first method PCA were inverted when compared to those of the second method PCA, although the magnitudes were nearly identical.
For all 11 samples, Factor 2 scores in the first method PCA were nearly identical to those of the second method PCA (identical in sign, nearly identical in magnitude).
For all 11 samples, the signs of the Factor 3 scores in the first method PCA were inverted when compared to those of the second method PCA, although the magnitudes were nearly identical.
Comparison of the PC1/PC2/PC3 loadings from the first method PCA to those of the second method PCA revealed the same sign inversion in Factors 1 & 3.
We do not understand why the signs of the scores and loadings of Factors 1 & 3 are inverted when comparing the first method PCA to that of the second.
Could the “scaling differences” of the initial data be responsible for this result?
What conventions does Pirouette use to determine the sign of the scores & loadings (the orientation of the PCs)?
A. Did you not get the smoke and mirrors add-on to Pirouette? You have stumbled upon a case where the latter is used.
Seriously, this is not an unexpected result. Without getting into too much detail, let me explain how PCA works. The core algorithm behind the data decomposition step in PCA is usually called something like NIPALS (based on the Power Method, if you needed to know). This algorithm does the decomposition one eigenvector at a time. To derive the first eigenvector, the algorithm forms a working vector that is then optimized through a series of iterations. That initial working vector can be a vector of ones, or ones and zeros, or random numbers, or even a vector extracted from the data set. You will see all of these in the literature.
Once the iterations have converged and the first eigenvector produced, the information represented in this first factor is “removed” from the data set, and this new data matrix is then processed to extract the second eigenvector, etc. If this process is repeated until as many eigenvectors as the rank of the matrix are computed, you could then reconstruct the original data matrix exactly by multiplying the scores (complete rank) by these eigenvectors (or, loadings, also complete rank). If you flip the sign on all values in any pair of eigenvectors, the reconstructed matrix will be the same. Thus, the sign is not critical, but the magnitudes are.
The data sets you are working with are similar but not identical. Therefore, it is not unexpected that the sign on the converged 1st eigenvector from the 2nd data set could be opposite that derived from the 1st data set. The reconstructed data will be appropriate, however, if an additional eigenvector also has its sign inverted. This is what you observed in the sign on the 3rd eigenvector.
In your case, as you described, the scores and loadings plot look like mirror images, but, and this is key, the relationships among samples in the score plot or among variables in the loadings plot will be unchanged with sign inversion.
But, how to compare the results from the two data sets? If you only want to make plots, then you could rotate the scores of one data set to be 180 degrees from that of the other. Remember, the sign is not critical, just the magnitudes.
Another way would be to project all data (i.e., from both sets combined) into the PCA space of just one of the two sets (i.e., do a PCA prediction). The resulting PCA prediction scores should show just how similar are the two sets, but in one plot rather than two, and without sign inversion between the two sets.
Predicting with PLS regression vector
Q. I have a question regarding the implementation of PLS in Pirouette. I was under the impression that if I take the scalar product of the regression vector that is produced by the PLS calibration with the X block of one of the calibration samples, then I should get the predicted value of the Y variable for that sample. However, when I try to do this, I find that the predicted Y value which I compute can be quite different to that given in the Y Fit object that is produced by Pirouette for the same sample.
Am I doing something dumb – is there another processing step that is needed here in order to produce the correct Y prediction? The model that I produced used autoscale preprocessing but no other transforms.
A. Your assumption about the multiplication is correct. But, you need to keep in mind the treatments of the raw data, as you suggest.
- Transform the X vector first. If no transforms (your case), this step is ignored.
- Preprocess the X vector. For example, if you selected mean centering, then you must subtract the training set mean (do this for each variable independently of the other variables). Thus, if you just have the X vector in hand, you would need to first compute the mean vector from all training set X vectors. If you selected autoscale, you first subtract the mean then divide by the standard deviation of the variable. If no preprocessing is selected, this step can also be ignored (you might try this with no preprocessing first to demonstrate to yourself that your math is working).
- Do the prediction step (treated X vector multiplied element by element with the regression vector, then summed).
- Reverse the preprocessing step for the predicted Y value. This requires knowing the appropriate values for the training set Y values. Thus, if mean centering, you will need to add back the training set Y mean to the predicted Y; if autoscaling, you would first multiply the predicted Y by the standard deviation of the training set Ys, then add back the training set Y mean.
Q. In our company, we make use of several packages of Pirouette for different applications. One of the problems we still experience is although we have a lot of data we cannot get a good system to do proper sample selection from this huge set of data. Is there an algorithm in Pirouette that can do that or separate software that we can purchase?
A. You’re in luck: there are several methods of performing sample selection in Pirouette. The different strategies for each of these methods are described in the User Guide. Selection can be done on a percentage basis or by requesting explicit numbers of samples to retain. The selections can be gathered from the full data set or any other exclusion set, and you can ask to make the selections on a per category basis.