|
Correspondence Analysis
Correspondence analysis is appropriate when attempting to determine the proximal
relationships among two or more categorical variables. Using correspondence
analysis with categorical variables is analogous to using correlation analysis
and principal components analysis for continuous or nearly continuous variables.
They provide the research with insight as to the relationships among variables
and the dimensions or eigenvectors underlying them. A key part of correspondence
analysis is the multi-dimensional map produced as part of the output. The
correspondence map allows researchers to visualize the relationships among
categories spatially on dimensional axes; in other words, which categories are
close to other categories on empirically derived dimensions.
Unlike correlation, correspondence analysis is nonparametric and does not offer
a statistical significance test because it is not based on a distribution (or
distributional assumption). Comparison of different models (e.g. different
variables entered/removed) should be done with categorical or logistic
regression. Again, correspondence analysis requires categorical variables only.
Correspondence analysis accepts nominal variables, ordinal variables, and/or
discretized interval - ratio variables (e.g. quartiles), although creating
discrete categories from a continuous variable is generally discouraged.
For the duration of this tutorial
we will be using the
IntroPsych_Fall2009.sav file; which is fictitious and contains 1500
participants' responses on the following variables: code (sequential numbers
which identify each participant); sex; age; family_income (four income
brackets); HS_GPA (high school grade point average brackets); IQ (intelligence
as measured by the Wechsler-Adult Intelligence Scale version IV); class_standing
(freshman, sophomore, junior, senior); drinks_week (number of alcoholic drinks
consumed in a typical week); confidence (self rating of how much confidence the
student has in their ability to achieve desired grades in college courses
[possible range: 0-20]); hardworker (self rating of how much effort the student
puts toward their college classes [possible range: 0-20]); number_grade (numeric
course grade for the Introduction to Psychology course); final_grade (course
letter grade for the Introduction to Psychology course).
1.) The first example will explore a 2 way relationship between the 4
categories of family_income and the 4 categories of class_standing. We would
expect weak relationships between family income and the members of each class;
for example, family income should have no relation with a student being a
freshman, sophomore, junior or senior.
Begin by clicking on Analyze, Data Reduction, Correspondence Analysis...
Next, highlight / select the family_income variable and use the top arrow
button to move it into the Row: box. Then, click the top Define Range... button
and type a 1 for the minimum value and type a 4 for the maximum value. Then
click the Update button; then click the Continue button.
Next, highlight / select the class_standing variable and use the bottom arrow
button to move it to the Column: box. Then, click the Define Range... button.
Next, type a 1 in the minimum value: box and type a 4 in the Maximum value: box,
then click the Update button. Then, click the Continue button.
Next, click on the Statistics... button. By default the following should be
selected: Correspondence table, Overview of row points, and Overview of column
points. Also select, Row profiles, Column profiles as well as Confidence
Statistics for Row points and Column points. Then, click the Continue button.
Next, click on the Plots... button and select: Row points, Column points,
Transformed row categories, and Transformed column categories. By default, the
Biplot should be selected already. Next, click the Continue button, then click
the OK button.
The output should be similar to what is displayed below.
The Correspondence Table displays the frequency for each category of each
variable; it is essentially a cross-tabulation frequency table.
The Row Profiles table displays the proportions of each column value across
each row. For instance, there are 23 Freshman out of all 207 students whose
family income is 00000 - 25000; 23 is 11.1% of 207. The Mass values across the
bottom refer to the column's proportion of the total sample size. For instance,
213 freshmen represent 14.2% of the 1500 student total sample.
The Column Profiles table displays the proportions of each row value down
each column. For instance, 23 students' family income is 00000 - 25000 out of
all 213 students who are freshmen; 23 is 10.8% of 213. The Mass values down the
right-most column represent each row's proportion of the total sample size. For
instance, 207 students whose family income is 00000 - 25000 represent 13.8% of
the 1500 student total sample.
The Summary table displays a variety of useful information. First, we see
that 3 dimensions were derived, but only two are interpretable (i.e. only two
dimensions account for a supposedly meaningful proportion of the total inertia
value). The Singular Value column displays the canonical correlation between the
two variables for each dimension. The Inertia column displays the inertia value
for each dimension and the total inertia value. The total inertia value
represents the amount of variance accounted for in the original correspondence
table by the total model. Each dimension's inertia value, thus refers to the
amount of that total variance which is accounted for by each dimension.
So for instance, we could say that dimension 1 accounts for 0.8% of the 0.9% of
the total variance our model explains in the original correspondence table.
Stated another way; our model accounts for only 0.9% of the variance in the
original correspondence table and of that (small) percentage, dimension 1
explains 0.8%. The chi-square test is testing the hypothesis that the total
inertia value is / is not different than zero. Here, our sig. or p-value is
greater than 0.05 (a common cutoff value); which indicates our total inertia
value is not significantly different than zero. Keep in mind, this
chi-square is not a model fit statistic; it does not lend itself to comparing
models with different variables as chi-square is often used. It is only testing
the inertia value against zero. The Proportion of Inertia columns represent the
proportion of total inertia for each dimension; for example, dimension 1 (.008)
accounts for 86.6% of total inertia (.009). The Standard Deviation column refers
to the standard deviation of the Singular Value(s) and the correlation column
refers to the correlation between dimensions.
The Overview Row Points table displays values which allow the research to
evaluate how each row contributes to the dimensions and how each dimension
contributes to the rows. The Mass (as mentioned above), is simply the proportion
of each row to the total (1500). The Score in Dimension displays each row's
score on dimension 1 and dimension 2. The scores are derived based on the
proportions (mass) for each cell, column, and row when compared to total sample;
the scores are representative of dimensional distance and are used in the graphs
below. The Inertia column shows the amount of variance each row accounts for of
the total inertia value. The contribution Of Point to Inertia of Dimension
columns show the role each row plays in each dimension; these are analogous to
factor or component loadings. The contribution Of Dimension to Inertia of Point
columns show the role each dimension plays in each row -- these are not the
inverse or opposite of the previous two columns because each dimension is
composed of multiple points. The Total column represents the sum of each
dimensions role in the row.
The Overview Column Points table displays values which allow the research to
evaluate how each column contributes to the dimensions and how each dimension
contributes to the columns. The Mass (as mentioned above), is simply the
proportion of each column to the total (1500). The Score in Dimension displays
each column's score on dimension 1 and dimension 2. The scores are derived based
on the proportions (mass) for each cell, column, and row when compared to total
sample; the scores are representative of dimensional distance and are used in
the graphs below. The Inertia column shows the amount of variance each column
accounts for of the total inertia value. The contribution Of Point to Inertia of
Dimension columns show the role each column plays in each dimension; these are
analogous to factor or component loadings. The contribution Of Dimension to
Inertia of Point columns show the role each dimension plays in each column --
these are not the inverse or opposite of the previous two columns because each
dimension is composed of multiple points. The Total column represents the sum of
each dimensions role in the column.
The confidence points tables display the standard deviation of each point's
dimension score, as well as the correlation between each point's dimension
scores. Recall, the scores themselves are displayed in previous tables (above).
The first two graphs show the score for each category of Family Income on
dimension 1 and dimension 2.
The next two graphs show show the score for each category of Class Standing
on dimension 1 and dimension 2.
The next two graphs show the scores for each category on both dimensions (at
once) for Family Income and Class Standing.

Finally, the correspondence map shows each category score on both dimensions
(at once) for both family income and class standing (at once). Now we can see
the usefulness of scores as measures of distance on the two interpreted
dimensions of our model. The scores allow us to compare categories across
variables in (this case) two dimensional space. Remember, correlation is a
standardized measure of relationship between two (typically) continuous
variables. Correspondence is a standardized measure of relationship (in
space/distance) between categories of multiple variables (in this case two). It
is important to note that the dimensions are empirically derived axes or
eigenvectors and not simply the variables entered into the analysis. So, we
could say that Juniors appear to have family incomes between 50 and 75 thousand
dollars. BUT, given our not significantly different from zero total inertia value
of
0.009, we really can not have confidence in this data's ability to offer
conclusions about the general population. The model is not good at all with only
00.9% of the variance in the original correspondence table accounted for by the
total model (all three dimensions; only two of which were interpreted).
As with most of the tutorials / pages within this site, this page should not
be considered an exhaustive review of the topic covered and it should not be
considered a substitute for a good textbook.
|