Choosing Number of Factors
When creating a regression model, it is critical to choose the number of factors that best represents the correlation of information in the data block to the property of interest. There are many methods that suggest looking at the shape of–or values in–a plot of eigenvalues: where the shape of the curve bends, like in a hockey stick; or stopping when the eigenvalues drop below 1; or simply assuring that 95% (or other subjective cutoff) of the variance is explained. But these methods are looking only at variance and not necessarily correlation to the Y value. And, they pertain only to modeling and not prediction.
We could look at errors following a prediction on the calibration data: as the PRESS–the prediction residual error sum of squares–or as the RMSEC, the root mean square error of calibration–correcting PRESS for the number of samples and the number of factors in the model. However, this approach is biased and optimistic because the error diagnostic is derived from data used to make the model in the first place. Ideally, we want to evaluate the model size based on how well it predicts a set of unknowns. But, which unknowns?
One way to simplify this selection is to use cross-validation. In this procedure, one or more samples are excluded from the data set, a model is made on the remaining data, then this model is used to predict the excluded samples. Error diagnostics from this prediction are saved, then different samples are excluded and the process repeated. After all samples have been removed once, a summary of the prediction errors can be shown as the root mean square error of cross validation or RMSECV. This diagnostic is less biased than the RMSEC.
Ideally, the shape of the RMSECV curve, as a function of number of factors, will rise when too many factors are chosen because such a model has incorporated noise which cannot be predicted. Such a model is overfitting the data. Thus, choosing the minimum in this curve is a good choice for number of factors.
It is often the case that the RMSECV for a model with fewer factors than that at the minimum of the curve is not statistically different. In general we recommend using the more parsimonious model, which further reduces the risk of overfitting. In addition to looking at the RMSECV, you should also look at other computed objects, such as the loadings, the regression vector and the spectral residuals.
For spectroscopic data, neighboring values of the X data are highly correlated such that the curve is quite smooth. Similarly, the regression vector should be smooth. When too many factors are retained in the model, the regression vector will begin to show a jagged appearance, indicating that noise is being added to the model. Another diagnostic, simply called the jaggedness, attempts to characterize this tendency of the regression vector to become noisy when the model is overfit; it will usually have a U shape where the minimum would be the suggested model optimum. This metric seems to more consistently suggest the optimal number of factors and we now routinely use jaggedness as our preferred metric. Jaggedness can also be computed from the regression vector during calibration, thus does not require a cross-validation to suggest a reasonable number of factors to use in model.
What reference books on chemometrics can you suggest?
- Beebe, Pell and Seasholtz: As a practical guide to applications in the chemical industry, this book steps through the chemometrics approach and spans many of the approaches seen in Pirouette.
- The Pirouette manual: Although not formally a book, this PDF is free and covers a lot of the basics for chemometrics. There are walkthroughs for both regression and classification analysis and a mathematical summary of all of the common techniques.
- Massart, Vandeginste, et al.: For the more adventurous, the Massart book is a complete and well-written tome that provides references to more cutting edge applications.
- Martens and Næs: For those seeking detail on the mathematics of multivariate regression, this book is the most complete reference.
- Brown: not really a book, but a good introductory resource, A Short Primer on Chemometrics for Spectroscopists.
A more extensive list of texts and journals is on the References page.
How do I choose among the Pirouette algorithms?
Our philosophy is to provide more than one algorithm to do any given task. For example, it is advisable to run both PCA and HCA when conducting an exploratory analysis of your data.
- Because PCA has the same factor basis as the other modeling algorithms (such as SIMCA and PLS), it should always be run as part of exploratory data analysis.
- Choosing between SIMCA and KNN is a function of the number of data points and their distribution; if clusters are reasonably separated and well-populated, SIMCA is the better choice in that it supplies outlier diagnostics.
- In regression, PLS usually models data with fewer factors than PCR because it assumes that there is some measurement error in the Y-block, usually the case.
- In mixture analysis, ALS is our preferred route in that the alternate (MCR) is restricted to 2 components, and ALS covers the most common closure constraints. ALS is weaker when you do not have samples that are relatively pure representatives of the end-members, a case for which MCR is better suited.
How do I get data into Pirouette from a source that Pirouette does not directly support?
Although Pirouette attempts to support the more common analytical file formats, we do not cover them all. Best is to look at your instrument system and see if it offers a means of saving some sort of standardized file format. In spectroscopy, this usually means the Galactic .SPC form or, a less consistently implemented standard, JCAMP. In chromatography, the AIA format is preferred and is fairly ubiquitous. If those still do not suit the needs, ASCII or Excel formats are accepted within Pirouette. Check the format options in the Pirouette manual (Chapter 14), and be aware that the Excel 256 column limitation often means that the exported data will contain samples in column orientation; you will need to transpose them once in Pirouette.
How can I capture a spinning 3D graphic for use in a PowerPoint presentation?
Several users have asked about capturing a spinning 3-D scatter plot, in motion. We use a product called SnagIt (http://www.techsmith.com/products/snagit/default.asp) for many purposes: for publications, for our user guide, for reports and for movies. You can set it to start recording the screen, at a specified rate, with which you can capture a Pirouette spinning plot, in motion. Then, you can save this to a file (in one of several formats) and drop the file into a PowerPoint presentation. When running the presentation, you can start the graphic to spin, with a resemblance to Pirouette.
Printing a Long Dendrogram
Printing a dendrogram of a large number of samples may produce several pieces of paper that you need to paste together to make one continuous graphic. Take advantage of the continuous roll printers available in service bureaus with this procedure that makes a custom PDF file of your dendrogram.
Loading Transposed Data
Some data sources, for example, spectroscopy data systems, write text files in which the sample values are stored in columns. If there are more than 255 values, you cannot use a spreadsheet to transpose the data before reading into Pirouette because of the limit on number of spreadsheet columns. You can, however, configure the text file so that on reading into Pirouette, it will read columnar data into row format.
Run Pirouette on a Mac
Although the newest Macintosh computers are now running on Intel processors, the Pirouette code base has not been optimized for this platform. However, you can still run Pirouette, and other Infometrix products, on a Mac by using virtual desktop software. Here is a brief report on how to run Pirouette on a Mac.
Technical Tip: Cloaking
Visualization is a process we emphasize for understanding your data. In Pirouette, two complementary features facilitate visualization: dynamic linking and cloaking. You are probably already familiar with dynamic linking: if you highlight a sample or group of samples in one view, say a 2D scatter plot, those samples will appear highlighted in any other sample-oriented plot, including table views and line plots.
Sometimes, the plots are too busy to see where the highlighted samples are located. Click on the Cloak button in the ribbon (just to the left of the label button with the upper case A). The cloaking button is a 3-way toggle: the first time you click on it, it will show only the highlighted samples; the second time, only those samples not highlighted; the third time, all samples will be shown, back in the initial state.
By toggling the cloak state, you will be able to see which samples are highlighted in the other view state. This may help reveal more nuance about the differences between the selected and unselected samples. Clicking to change the state to unhighlighted samples might be just as informational. Try this with the data in both scatter plot views and as a line plot as well.
Because the Pirouette software has unique display characteristics for multivariate data, we have enabled a series of features in the demonstration version of the product that are available license-free. The Pirouette Demo has always been a useful tool for learning how to approach problems in that the manual contains data walkthroughs that teach the basics and much of the advanced concepts. All data used in the examples is supplied with the program and all analyses can be duplicated on your own computer as described.
Another very useful feature of the Pirouette Demo is the ability to load any data of supported data types. That includes .PIR files. Thus, you can develop a set of analyses, using exclusion subsets, save to the PIR format, then send the file to a colleague who has only the demo, and they can view and manipulate your data, license-free. In addition, any file that you can load can then be saved in any of the file saving formats that Pirouette supports. Thus, the Pirouette Demo becomes a file converter as well. All license-free.
When performing predictions in Pirouette, there are a set of computed objects generated for each sample; the objects vary with algorithm. Examples include the Outlier Diagnostics and X Residuals. In addition, there are other objects that are created for the prediction data set as a whole, such as an Error Analysis in PLS or a Misclassification Matrix in SIMCA. However, these objects require that information about the prediction data be present before the calculations can be accomplished. Pirouette looks for this information before performing the calculations.
Specifically, if the data set you intend to use in prediction is for validating a model, then, by definition, it is expected that values for the properties of interest have already been defined for these data: either Y values for regression algorithms or category values for classification algorithms.
What does Pirouette actually look for then when asked to do a prediction? In a regression algorithm, Pirouette first looks to see if a Y variable exists that has a name that matches any of the Y variables in the model. Note that the name is case sensitive. Next, Pirouette verifies if a matching Y variable is also included. Finally, Pirouette checks to see if any values for the Y variable are missing. If none of these conditions are true, then Pirouette will be able to compare predicted values to known values, allowing computation of the Error Analysis items as well as make a plot of Measured vs Predicted for the Y Fit object. If any of these conditions are true, then the underlying comparison of known to predicted cannot be performed and neither the Error Analysis nor Y Fit can be computed. Note that the Y Fit object will still be created because this object also computes prediction limits around the predicted values which are not affected by these conditions.
In classification algorithms, the checks are a little more lenient. The primary condition is that a category variable of the same name must be present. Again, this name is case sensitive. If this condition is met, the Misclassification Matrix will be produced because it will be possible to compare the predicted category with a known value. Note that a category variable with matching name can be excluded; the Misclassification Matrix will still be generated. And, the category may have missing values yet the classification will still proceed–the only difference is that there will be fewer values in the Misclassification Matrix. Of course, if the Category column is present but all values are missing, then no Misclassification Matrix can be formed.
Validation is an important part of any modeling workflow and Pirouette has been designed to anticipate this objective by computing specific objects that will help model evaluation. It is up to the analyst to ensure that the appropriate validation data—in the form of known Y or category values—are present in the prediction data set.