|
Categorical Principal Components
Analysis (CATPCA) with Optimal Scaling
Categorical principal components analysis (CATPCA) is appropriate for data
reduction when variables are categorical (e.g. ordinal) and the researcher is
concerned with identifying the underlying components of a set of variables (or
items) while maximizing the amount of variance accounted for in those items (by
the principal components). The primary benefit of using CATPCA rather than traditional PCA is the
lack of assumptions associated with CATPCA. CATPCA does not assume linear
relationships among numeric data nor does it require assuming multivariate normal data. Furthermore,
optimal scaling is used in SPSS during the CATPCA analysis and allows the
researcher to specify which level of measurement he or she wants to maintain
(e.g. nominal, ordinal, interval/ratio, spline-nominal, & spline-ordinal) in the
optimally scaled variables.
For the duration of this tutorial
we will be using the
Items001.sav file; which is fictitious and contains 797
participants' responses on 25 items. The first 10 items each have a 7-point
Likert response format and compose one scale. The next 15 items have a 5-point
Likert response format and compose a second scale. Clearly this data lends
itself to a solution with two dimensions or componenets but, typically the solution would not
be so apparent.
CATPCA should be approached in a similar manner as one would approach a
traditional PCA. Both are data reduction techniques and often require multiple
runs of the analysis with different numbers of variables (referred to as items
from this point forward) and different numbers of dimensions retained in order
to arrive at a meaningful solution.
1.) The first example will include all 25 items. Begin by clicking on
Analyze, Dimension Reduction, Optimal Scaling...
Next, click the circle next to "Some variable(s) are not multiple nominal"
and then click the Define button.
One of things you may want to explore here is the Missing... button; which does
multiple imputation of the mode for nominal and ordinal variables by default
during the analysis.
Next, highlight / select all the items and use the top arrow to move them to
the Analysis Variables: box. Then, click on the "Define Scale and Weight..."
button. Select the Ordinal for all items then click the Continue button.
Next, click on the Output button. By default Object scores and Component
loadings should be selected. Select the other four choices; Iteration history,
Correlations of original variables, Correlations of transformed variables, and
Variance accounted for. Then, highlight / select all the items and use the top
arrow to move them to the Category Quantifications box. Then, highlight / select
all the items again (in the Quantified Variables: box) and use the second arrow
to move them to the Descriptive Statistics: box. Then, click the Continue
button.
Next, under Plots, click on the Object... button. By default, Object points
should be selected; go ahead and also select Objects and variables (biplot) with
Loadings specified as the Variable coordinates. Then, click the Continue button.
Next, under Plots, click on the Loading... button. By default, Display
component loadings should be selected; go ahead and also select Include
centroids, then click the Continue button.
Next, notice the Dimensions in solution: is listed as 2; but could be
changed. Our example here clearly contains two dimensions but, if you did not
know the number of dimensions, you could specify as many as there are items in
the analysis. Remember, we generally approach CATPCA in a similar fashion as we
would a traditional exploratory PCA.
Finally, you should click the Paste button;
highlighted by the red ellipse in the picture below. The reason we paste instead
of simply clicking the OK button is because, a bug (or fault) has been noticed
periodically with the CATPCA function. This will be discussed in greater detail
below; but it involves a missing space that should be present in the syntax and
its absence causes SPSS to leave absent a desirable (and specified) table from
the output under certain conditions.
Next, review the newly created syntax, in the newly opened syntax editor
window. First, you'll likely notice there is a substantial amount of syntax
associated with this analysis; most of which is attributed to the number of
items. Also notice that because we specified several optional tables, we have a
particularly long "/PRINT" statement. Attention
should be paid to this line or lines. The fault mentioned above occurs within
the "/PRINT" statement. See if you can find the
fault (which is present in the picture below)....
Most of you will likely notice that "OBJECTOCORR" makes no sense and is one
of only a handful of things listed in black font. Here is where the notorious
missing space should be; between "OBJECT" and "OCORR" (which indicates the
Original variable Correlation matrix). To correct the fault, simply type a space
between "OBJECT" and "OCORR" as can be seen below. Regardless of options
specified in the point-and-click menus; you should always review the syntax
associated with a CATPCA because other missing space errors can occur in the
/PRINT statement.
Notice in the /PRINT statement when the missing
space is inserted, the 'smart editor' recognizes the correct commands for "OBJECT"
and "OCORR" by listing them in red.
Next, we can highlight / select the entire syntax and then click the
run selection button to
complete the analysis.
The (rather substantial) output should be similar to what is presented below.
A text description of each output element appears below each picture.
The top of the output begins with a log of the syntax used to produce the
output. Then, there are the Title, Notes (hidden by default), Credit (citation),
and then the Case Processing summary -- which displays the number of cases and
number of cases with missing values.
Then, there are the Descriptive Statistics tables associated with each item
(variable) included in the analysis. Each of these frequency tables displays the
number of cases for each response choice in the original variables. Reviewing
these tables allows one to see how cases are distributed among the response
choices of each variable. After reviewing them; it is recommended you use the
minus sign (-) in the left panel of the output
window to hide those tables by collapsing the output. The minus sign is marked
by a red ellipse in the figure above. Collapsing
the output by hiding these tables can allow us to navigate between tables more
easily.
The next table, Iteration History, displays the eigenvalues for each
iteration of the analysis. If, we had not specified the iteration history in the
options, only the zero iteration and the last (11th) iteration would be
displayed. Recall that in standard PCA, we use the eigenvalues to determine how
many principal components should be retained. Generally, we expect eigenvalues
greater than one to be retained. Here, we see that the standard PCA solution
(iteration 0 -- with all variables/items treated as numeric) results in an
eigenvalue of 8.44 while the CATPCA begins with an eigenvalue of 8.77 and
increases with each iteration. Eigenvalues are used to determine the percentage
of variance accounted for (a type of effect size) and therefore, larger
eigenvalues are preferred over smaller ones. The point here being, because we
take into account the ordinal nature of the items (rather than simply running a
traditional PCA), we get a better solution (higher eigenvalue).
Next,
is the Model Summary table, which displays the internal consistency
coefficient (Cronbach's Alpha) for each dimension we specified (2 dimensions)
and the combination of both dimensions (Total). NOW; according to page 143 of
the
Categories user manual (for SPSS version 18; which was used here), there
should be a third column in this table which should include the percentage of
variance accounted for by each dimension and both dimensions (total). However,
using the eigenvalues, we can calculate the percentage of variance accounted
for, for each dimension and for both dimensions. To calculate the variance
accounted for, simply divide the eigenvalue by the number of items included in
the analysis. For instance, the first dimension accounts for 19.988 % of the
variance in the optimally scaled matrix of 25 items.
Dimension 1:
4.997 / 25 = .19988 = 19.988 %
Dimension 2:
3.917 / 25 = .15668 = 15.668 %
Total:
8.914 / 25 = .35656 = 35.656 %
So, our total model (both dimensions) accounts for 35.656 % of the variance
in the optimally scaled items. Notice, the total eigenvalue is also displayed in
the iteration history table (above).
The Quantifications tables display the frequency , the quantification value
assigned, the centroid coordinates, and the vector coordinates of each response
category for each item. The centroid coordinates are the average of all cases'
object scores for a particular category on each dimension. The vector
coordinates refer to the coordinates for each response category when the
categories are represented by a straight line between dimensions 1 (x-axis) and
dimension 2 (y-axis) in a scatter plot. We could have generated these scatter
plots in the output; but their usefulness is not terribly great. Instead; the
items (rather than each item's categorical responses) are the focus. So, like
the descriptive statistics tables, we can hide the quantification tables using
the minus sign (-) in the left panel of the output window.

The next table is the Variance Accounted for table; which is not intuitively
named, as it does not display the variance accounted for. It does however,
display the coordinates for each item on each dimension in relation to the
centroid (0, 0) and when all the items are represented by a straight line
between dimension 1 (x-axis) and dimension 2 (y-axis). One thing to look for
here is items that display a very small mean coordinate; which indicates
these items are not contributing substantially to the principal components.
Notice, items 3, 4, 8, 11, 16, and 22 are all very close to or below 0.100.
These items may not be suitably contributing to the principal components.
The Correlations Original Variables table displays those correlations; after
missing values have been imputed with the mode of the variables on which they
were missing.
The Correlations Transformed Variables table displays those correlations.
Recall, this is the correlation matrix after optimal scaling has taken place and
this is the matrix used for the PCA. Notice too, the eigenvalues for each
dimension are displayed. We specified only 2 dimensions / principal components;
but you can see here what the eigenvalues are for each subsequent dimension /
component.
The next table displays the Object Scores for each case, although; PASW /
SPSS abbreviates tables to 100 rows by default. You could double click on the
table to enter the chart editor and increase the number of rows displayed. These
object scores are really the coordinates associated with each case on each of
the two dimensions; which are plotted in the next element of the output, the
scatter plot shown above-right. We can see here that most cases are located near
the centroid (0, 0) with the majority of cases located between -2 and 2 on
dimension 1 and between -2 and 2 on dimension 2. We can also see clearly one
extreme outlying case (case 703).
The next table, Component Loadings, shows the coordinates for each item on
each dimension; which are plotted in the next element of the output, the scatter
plot displayed above-right. Here, we can see how the items related to one
another and to the two dimensions. We can see that the first ten items tend to
coalesce together in the upper range of both dimension 1 and dimension 2; where
as the other 15 items tend to coalesce at the lower range of dimension 1 and
they tend to vary substantially along dimension 2. Recall from above, items 3,
4, 8, 11, 16, and 22 were suspect, based on their average centroid coordinates
and total vector coordinates from the ineptly named Variance Accounted For
table. Here, we see that those items are closest to the centroid and noticably
distant from what we can see are the two principal components (the cluster of
items 1 - 10 and the cluster of items 11 - 15).
Incidentally, for those unfamiliar with eigenvectors and eigenvalues; one can
say that the lines going from the centroid to each item are 'eigenvectors' and
the item is at the 'eigenvalue' for its vector. So, an eigenvalue can be thought
of as a distance point along an eigenvector. In traditional PCA, we often use a
rotation strategy to ease interpretation. So, imagine rotating both dimensions
45 degrees counter-clockwise (or anti-clockwise). Then, each
dimension axis would essentially be going through a cloud of points / items.
Finally, we get a scatter plot with each item (black) and each case (blue)
plotted along dimension 1 and dimension 2. Here, we can see that dimension 1,
which is able to capture more of the variance among the items and cases, can
explain the variance better than dimension 2 -- on which items and cases are
more condensed (less variable/variance) and overlapping of one another.
2.) The second example will include only the 19 retained items after
dropping items 3, 4, 8, 11, 16, and 22..
Now, rather than go back through each step and each element of the output; we
have below selected output from a second CATPCA in which we removed items 3, 4,
8, 11, 16, and 22. It is important to note that with an iterative analysis,
results may vary slightly.
We can see in the Model Summary table our internal consistency coefficient
increased from 0.925 with all 25 items to 0.929 with only 19 items. If we
calculate the variance accounted for, we come up with 24.047 % of the variance
accounted for by dimension 1; 19.900 % of the variance accounted for by
dimension 2; and 43.947 % of the variance accounted for in our 19 items by the
total model (both dimensions). This compares well with the lower variance
accounted for of 35.656 % total when all 25 items were included. So, we have
fewer items, but we are accounting for more of the variance in those 19 items
than the amount of variance accounted for in 25 items, when 25 items were
included.
Here we see the 'clean' or tight grouping of items on each of the two
principal components. Notice too that without the six poor items, our items have
'moved' in relation to the dimensions; essentially switching orientation.
As with most of the tutorials / pages within this site, this page should not
be considered an exhaustive review of the topic covered and it should not be
considered a substitute for a good textbook.
|