There are many instances where a single chemometric model may not provide optimal results for a particular application. In these situations, it may be possible to develop a set of models, each more focused on a particular subset of data. How might these models be deployed?
- In near infrared analysis, if we attempt to implement a regression model (PLS) for a non-linear system, predictions may be unreliable. We can adapt to the instrument response by using a local model. One way to go about this task would be to recompute the PLS model based on a set of samples nearest to the prediction sample. This could be done as part of the prediction process (locally-weighted regression). Alternatively, a series of local models can be constructed ahead of time and bundled into a single prediction method.
- In classification, some groups may be easily distinguished while other samples may be very similar and resist separation. Selecting a different range in variables or a different data pretreatment might allow separation of the more similar samples. Thus, it might be practical to create separate models, one for the easily distinguished group and another for the more difficult group. Using both models appropriately could produce more robust classifications.
Situations such as the above motivated Infometrix to develop InStep, a product which allows the combination of any number of qualitative and/or quantitative models into a single prediction system. The product was first introduced in 1995 as a combined effort with the FDA, the CDC and Pfizer, for food, clinical, and pharmaceutical applications. Since then, applications using InStep have expanded into environmental, refining and chemical areas. I also note that a similar approach to combining models into a single system has been added to the GRAMS product (Thermo) and to Opus-7 (Bruker).
Mycobacteria
Identification methods for Mycobacteria, the genus which contains M. tuberculosis, the bug responsible for the disease of the same name, have evolved considerably over the last few decades. From the early days when labs had to employ a battery of biochemical tests, to developments in thin layer, gas and liquid chromatography, and now with tools from DNA analysis being brought into practice, the problem of analysis has always been accompanied by another of data interpretation.
Liquid chromatography is still very much in use in many labs because of its simplicity, is amenable to automation and produces reasonably reproducible results. Fortunately, there is species specificity in the collection of mycolic acids found in the cell walls of these bugs, and LC is capable of separating most of them. Several years ago, the CDC prepared a data set of a couple dozen species of Mycobacteria to evaluate the method of LC combined with pattern recognition.
We found it was possible to use KNN to do the first pass at an identification but as KNN does not have statistics-based diagnostics, an outcome based on KNN alone was not acceptable in the clinical setting. On the other hand, SIMCA does have outcomes based on statistical evaluations. By combining these two methods in a two-level InStep method–KNN to do an identification and SIMCA to do a confirmation–the labs have a tool that provides a reliable ID–with confidence levels–provided in the InStep-generated output.
OilMOD
Characterizing the source of petroleum is beneficial to an oil refinery for preparing subsequent processing. To demonstrate a system capable of discriminating oil types, our collaborator GeoMark assembled a collection of oils from various locations, then created a set of Pirouette models, each for distinguishing between two or more source rocks. These models were arranged in an InStep method that uses 3 levels of decision-making:
The first model that is applied to an unknown sample distinguishes between terrigenous (land-based) and aqueous source rocks. If the model suggests the oil has come from an aqueous source, then a second model is applied; this model distinguishes between marine and lacustrine (freshwater) sources. Finally, the sample is passed to the final step: if the oil is from a marine source, the model distinguishes among 4 types of marine source rocks; the lacustrine model separates the oil into 2 types; and the terrigenous model also suggests one of 2 types.
In this demonstration, all of the models were derived from KNN but could have easily been done with the SIMCA algorithm. In the final analysis, the InStep method makes it easy to traverse a multi-level, multivariate decision tree for sample classification.
Octane Number
Not long ago, researchers learned that NIR could be used to determine octane number in gasoline products, replacing the costly and labor-intensive knock engine traditionally used for that purpose at most refineries. The analysis requires the use of chemometrics, an advance that also enables automation. Of course, to make a reliable regression model (using, for example, PLS) requires homogeneous calibration data sets. But, it also means that a tool like InStep can simplify automatic octane number predictions. An example of such an approach is included as a demo with the InStep product.
In this scenario, the NIR spectrum of the gasoline of interest is processed using a model that classifies it as containing oxygenated compounds or not. If it does, a second classification suggestions which type of oxygenate was added to the gas. But, if the product has been determined to not have added oxygenates, it is instead processed with a global PLS octane number model.
This is a sort of regression- based approach to classification because the outcome of this PLS prediction is used to determine which final, and more specific, PLS model to use for the octane number estimate that will be reported. The accompanying figure shows this decision logic graphically.
Summary
In our experience, particularly with large, complex data sets, a step-wise approach to either classification or quantitation is a means for improving the overall multivariate assessment of routine samples. As always, the first step is to visualize your data; look for clustering in regression problems or a lack of clustering in classification tasks where several classes are to be in the model. Check in with us if you want additional detail on the use of hierarchical models.