Questions from Our Users – Pirouette Interface
Cluster Colors in PCA
Q. My data points in the PCA Scores plot remain multicolored even though they are clustered into different colors in the HCA dendrogram. I have been trying to fix this, but to no avail. Any guidance is appreciated.
A. The colors in these two plots will NOT be the same, unless you make it so, in one direct way…..
The data points in a scatter plot, such as PCA scores, are derived from one of two schemes:
– based on the category values in a class variable, IF that variable has been activated (look in the middle of the status bar for the name of the currently activated class; or None). If so, then samples assigned to class one get the first color in the Color Sequence, those in class two get the second color in the sequence, etc.
– based on the row number if NO class variable is active; sample index 1 gets the first color, and so on.
The colors in the dendrogram are assigned by cluster order, top to bottom (which cluster is on the ‘top’ of the dendrogram is arbitrary): first cluster gets the first color in the color sequence, etc.
Now, here is how you make the colors the same for samples in both plots. With the dendrogram window active, do Edit > Activate Class. That action does two things: make a NEW class variable, with each sample within a cluster assigned the same category value; AND, makes that the active class. The scores plot will now take on the same colors as those in the dendrogram.
Note that you can change the colors in the color sequence: Windows > Preferences > Chart > Color Sequence.
Scatter plots with labels and no points
Q. Is there a way to show a 2D or 3D plot with labels and no points? I know I can change the color of the points to white on a white background, but I was wondering if there is a setting that would enable me to simply shut them off.
A. We have not provided a mechanism to turn off plotted points completely. Instead, make the points as small as possible (Preferences > View > 3D > Point Size > 3), then make the points the same color as the background. There are two ways to do this:
- Make your Color Sequence have only one color and make that color the same as the plot interior.
- Leave your color sequence with the colors as they were, but modify one color, or add a new color to the list, that is the same as the background. Then, make a class variable in which all samples have the same value and that value corresponds to the color index in the color sequence that you set for the background color (then, you must be sure that you Activate the class variable you just modified).
Q. My question has to do with the position of the origin. The clearest exposition relevant to my question is in “Chemometrics: A Practical Guide” by Beebe, Pell and Seascholtz on p.89, fig. 4.27. Basically, it says that the first principal component in PCR should align with the most dominant trend in the data in factor space, so long as the data is mean-centered. If the data is not mean-centered, then PC1 must point from the origin to the centroid of the data. My understanding is that the factors in PLS behave much like the principal components in PCR. What confuses me in Pirouette plots is that I always make sure to choose mean-centering as my preprocessing option, but I always seem to get plots where the origin (intersection of factor 1, 2 and 3 axes) is nowhere close to where I would eyeball the centroid of my data. This is especially obvious when I have a collection of mostly massed-together data and a spur of a few outliers – the origin is located along the spur, not in the middle of the mass! It seems as though the algorithm chooses the midpoint of the most extreme values (e.g. [XMAX+XMIN]/2 ) for the origin, rather than the centroid. Does the origin of the factor space plot coincide with the origin of the program’s internal representation of the data, or is the origin simply translated (or something else) for the sake of plotting? If the origin of the plot is really the same as the origin of factor space, can you please tell me how to tell the program to put the origin at the “right” place, and not where the graphics programmers have chosen to put it?
A. You are correct that Pirouette chooses to place its axis at the midpoint of available points. As you may have also noticed, the default presentation of scatter plots is to fill the plot region, that is to use what is called range scaling. (If the range of values of points on their y-axis is smaller than that for the x-axis, then the scale along the two axes will be different.) What we are trying to do with the default view is to make a pleasing presentation that utilizes the real estate optimally. In a 3D scatter plot, you could imagine then that range scaled data would fill a cube (well, not exactly a cube; this depends on the shape of the window you are viewing). Some plotting packages show the walls of the cube, with the points inside. We have chosen to show the ‘positive’ halves of the axis rays emanating from the midpoint of the cube. When rotating the 3D plot, this seems to be a satisfactory way to present a point of reference for the viewer.
Regarding PCR and PLS scores, in particular, the description given by Beebe, et al., is mostly accurate in that the direction of PC1 will align with the maximum information in the data, the direction of PC2 will align with the maximum spread of information not already described by PC1, with the restriction that this direction is perpendicular to PC1, etc. If the data are not mean-centered, then the direction of PC1 MAY point from the origin to the centroid, but not necessarily. If the variability among samples is quite small compared to the variability in a ‘feature’ in the data (a significant peak, for example), the information in this feature may be extracted as the first PC. Nevertheless, the general trend is as they describe, and the ordering of PCR PCs is in decreasing amount of variance described.
In PLS, on the other hand, the order of the PCs is in decreasing amount of correlation to the Y variable. If you look at the variance information for the PLS factors, the amount of variance may actually be larger in factor 2 than in factor 1 if it happens that the variables contributing to factor 1 are more correlated to the Y variable. In many data sets, the order of PLS factors does coincide with the order of decreasing variance, but in many other data sets, that does not happen.
Q. I’m new, so please excude this stupid question. How can I select the column to sort on?
A. The column which will be the sort key is the one in which the ‘active cell’ is located. That is, with default colors, the selected columns will all be in blue but the active cell within that range will have a white background. You can use the Tab key to move the active cell to another location. Note that when you select the columns, the last column on which you clicked will retain the active cell and the first visible row will fix the active cell in that row.
Q. I have about 100 spectra. Is there a way to import them into Pirouette 3.11 to create a large set other than one-at-a-time?
A. Load the first file with the normal File > Open command. Then, do a File > Merge Samples and select all the rest.
Alternatively, if the files are all in the same directory, highlight them all, then drag into Pirouette.
Q. You mention highlighting a group of files and then “drag into Pirouette”. What do I drag into–a particular Window?
A. With Pirouette running, in Windows Explorer highlight a collection of appropriate files. Click and drag them to the Toolbar, hovering over the Pirouette icon until the Pirouette window reappears. Drag the files over any part of the Pirouette interface and release. You will be shown a dialog that asks the orientation of the data. Select Samples. Hit OK. You’re done.
Precision when copy and paste from Excel
Q. We copied reference values (Y values) for a data set from EXCEL to Pirouette and the format for EXCEL was set to 1 decimal place. The data that ended up in Pirouette was then of the same precision as it was formatted in EXCEL. We assumed we would get the full precision of the numbers from EXCEL but we did not. I tried copying between EXCEL sheets and obtain the full precision but I see if I copy and paste from EXCEL to WORD I get the formatted precision. I guess Pirouette behaves like WORD rather than EXCEL in this regard?
A. What you observe is actually a feature of Excel, not of Word or Pirouette. When you set the number of decimals to show in the Excel sheet, those cells actually obtain that precision as a descriptor of the cell. When those data are placed in the Clipboard, the data are rounded according to the specified precision. Thus, any program that pastes data from the Clipboard will see the effect of rounding.
Saving the Excel file to a .txt format will show similar behavior. I have actually used this behavior to do automatic rounding of data in files that had way more ‘precision’ than was warranted, such as comes from some analytical instruments.
To avoid this problem, be sure to set the display precision to at least 6 decimals in Excel before copying. Alternatively, save the data in .xls format. Pirouette will then read the data in full precision, even if there is display formatting in the Excel sheet.
Saving Exclusion sets
Q. I have a large data set comprising many variables and sample types. I have created 3 class variables. I work primarily with the subsets (exclusion/ inclusion sets) and compare the subsets to each other. How can I save these subsets? They don’t get saved when I save the data file.
A. If you save in the native format (*.PIR), all subsets and corresponding computed results are saved.