Research and Statistical Support

MODULE 9

Correspondence Analysis

Correspondence analysis is appropriate when attempting to determine the proximal relationships among two or more categorical variables. Using correspondence analysis with categorical variables is analogous to using correlation analysis and principal components analysis for continuous or nearly continuous variables. They provide the research with insight as to the relationships among variables and the dimensions or eigenvectors underlying them. A key part of correspondence analysis is the multi-dimensional map produced as part of the output. The correspondence map allows researchers to visualize the relationships among categories spatially on dimensional axes; in other words, which categories are close to other categories on empirically derived dimensions.

Unlike correlation, correspondence analysis is nonparametric and does not offer a statistical significance test because it is not based on a distribution (or distributional assumption). Comparison of different models (e.g. different variables entered/removed) should be done with categorical or logistic regression. Again, correspondence analysis requires categorical variables only. Correspondence analysis accepts nominal variables, ordinal variables, and/or discretized interval - ratio variables (e.g. quartiles), although creating discrete categories from a continuous variable is generally discouraged.

For the duration of this tutorial we will be using the IntroPsych_Fall2009.sav file; which is fictitious and contains 1500 participants' responses on the following variables: code (sequential numbers which identify each participant); sex; age; family_income (four income brackets); HS_GPA (high school grade point average brackets); IQ (intelligence as measured by the Wechsler-Adult Intelligence Scale version IV); class_standing (freshman, sophomore, junior, senior); drinks_week (number of alcoholic drinks consumed in a typical week); confidence (self rating of how much confidence the student has in their ability to achieve desired grades in college courses [possible range: 0-20]); hardworker (self rating of how much effort the student puts toward their college classes [possible range: 0-20]); number_grade (numeric course grade for the Introduction to Psychology course); final_grade (course letter grade for the Introduction to Psychology course).

1.) The first example will explore a 2 way relationship between the 4 categories of family_income and the 4 categories of class_standing. We would expect weak relationships between family income and the members of each class; for example, family income should have no relation with a student being a freshman, sophomore, junior or senior.

Begin by clicking on Analyze, Data Reduction, Correspondence Analysis...

Next, highlight / select the family_income variable and use the top arrow button to move it into the Row: box. Then, click the top Define Range... button and type a 1 for the minimum value and type a 4 for the maximum value. Then click the Update button; then click the Continue button.

Next, highlight / select the class_standing variable and use the bottom arrow button to move it to the Column: box. Then, click the Define Range... button. Next, type a 1 in the minimum value: box and type a 4 in the Maximum value: box, then click the Update button. Then, click the Continue button.

Next, click on the Statistics... button. By default the following should be selected: Correspondence table, Overview of row points, and Overview of column points. Also select, Row profiles, Column profiles as well as Confidence Statistics for Row points and Column points. Then, click the Continue button.

Next, click on the Plots... button and select: Row points, Column points, Transformed row categories, and Transformed column categories. By default, the Biplot should be selected already. Next, click the Continue button, then click the OK button.

The output should be similar to what is displayed below.

The Correspondence Table displays the frequency for each category of each variable; it is essentially a cross-tabulation frequency table.

The Row Profiles table displays the proportions of each column value across each row. For instance, there are 23 Freshman out of all 207 students whose family income is 00000 - 25000; 23 is 11.1% of 207. The Mass values across the bottom refer to the column's proportion of the total sample size. For instance, 213 freshmen represent 14.2% of the 1500 student total sample.

The Column Profiles table displays the proportions of each row value down each column. For instance, 23 students' family income is 00000 - 25000 out of all 213 students who are freshmen; 23 is 10.8% of 213. The Mass values down the right-most column represent each row's proportion of the total sample size. For instance, 207 students whose family income is 00000 - 25000 represent 13.8% of the 1500 student total sample.

The Summary table displays a variety of useful information. First, we see that 3 dimensions were derived, but only two are interpretable (i.e. only two dimensions account for a supposedly meaningful proportion of the total inertia value). The Singular Value column displays the canonical correlation between the two variables for each dimension. The Inertia column displays the inertia value for each dimension and the total inertia value. The total inertia value represents the amount of variance accounted for in the original correspondence table by the total model. Each dimension's inertia value, thus refers to the amount of that total variance which is accounted for by each dimension. So for instance, we could say that dimension 1 accounts for 0.8% of the 0.9% of the total variance our model explains in the original correspondence table. Stated another way; our model accounts for only 0.9% of the variance in the original correspondence table and of that (small) percentage, dimension 1 explains 0.8%. The chi-square test is testing the hypothesis that the total inertia value is / is not different than zero. Here, our sig. or p-value is greater than 0.05 (a common cutoff value); which indicates our total inertia value is not significantly different than zero. Keep in mind, this chi-square is not a model fit statistic; it does not lend itself to comparing models with different variables as chi-square is often used. It is only testing the inertia value against zero. The Proportion of Inertia columns represent the proportion of total inertia for each dimension; for example, dimension 1 (.008) accounts for 86.6% of total inertia (.009). The Standard Deviation column refers to the standard deviation of the Singular Value(s) and the correlation column refers to the correlation between dimensions.

The Overview Row Points table displays values which allow the research to evaluate how each row contributes to the dimensions and how each dimension contributes to the rows. The Mass (as mentioned above), is simply the proportion of each row to the total (1500). The Score in Dimension displays each row's score on dimension 1 and dimension 2. The scores are derived based on the proportions (mass) for each cell, column, and row when compared to total sample; the scores are representative of dimensional distance and are used in the graphs below. The Inertia column shows the amount of variance each row accounts for of the total inertia value. The contribution Of Point to Inertia of Dimension columns show the role each row plays in each dimension; these are analogous to factor or component loadings. The contribution Of Dimension to Inertia of Point columns show the role each dimension plays in each row -- these are not the inverse or opposite of the previous two columns because each dimension is composed of multiple points. The Total column represents the sum of each dimensions role in the row.

The Overview Column Points table displays values which allow the research to evaluate how each column contributes to the dimensions and how each dimension contributes to the columns. The Mass (as mentioned above), is simply the proportion of each column to the total (1500). The Score in Dimension displays each column's score on dimension 1 and dimension 2. The scores are derived based on the proportions (mass) for each cell, column, and row when compared to total sample; the scores are representative of dimensional distance and are used in the graphs below. The Inertia column shows the amount of variance each column accounts for of the total inertia value. The contribution Of Point to Inertia of Dimension columns show the role each column plays in each dimension; these are analogous to factor or component loadings. The contribution Of Dimension to Inertia of Point columns show the role each dimension plays in each column -- these are not the inverse or opposite of the previous two columns because each dimension is composed of multiple points. The Total column represents the sum of each dimensions role in the column.

The confidence points tables display the standard deviation of each point's dimension score, as well as the correlation between each point's dimension scores. Recall, the scores themselves are displayed in previous tables (above).

The first two graphs show the score for each category of Family Income on dimension 1 and dimension 2.

The next two graphs show show the score for each category of Class Standing on dimension 1 and dimension 2.

The next two graphs show the scores for each category on both dimensions (at once) for Family Income and Class Standing.

Finally, the correspondence map shows each category score on both dimensions (at once) for both family income and class standing (at once). Now we can see the usefulness of scores as measures of distance on the two interpreted dimensions of our model. The scores allow us to compare categories across variables in (this case) two dimensional space. Remember, correlation is a standardized measure of relationship between two (typically) continuous variables. Correspondence is a standardized measure of relationship (in space/distance) between categories of multiple variables (in this case two). It is important to note that the dimensions are empirically derived axes or eigenvectors and not simply the variables entered into the analysis. So, we could say that Juniors appear to have family incomes between 50 and 75 thousand dollars. BUT, given our not significantly different from zero total inertia value of 0.009, we really can not have confidence in this data's ability to offer conclusions about the general population. The model is not good at all with only 00.9% of the variance in the original correspondence table accounted for by the total model (all three dimensions; only two of which were interpreted).

As with most of the tutorials / pages within this site, this page should not be considered an exhaustive review of the topic covered and it should not be considered a substitute for a good textbook.