Skip Navigation Links
Ade4TkGUI - A GUI for Multivariate Analysis and Graphical Display in R
Link to the last RSS article here: Tinn-R: A Convenient Script Editor for R on the Win32 Platform - Ed.
By Dr Rich Herrington, ACS Research and Statistical Support Services Consultant
This month we take a look at some advanced functionality in R that is available from a drop down menu system in R. Ade4 is a package for multivariate analysis and graphical display for the environmental sciences. Much of this package's functionality will be useful for researchers in the social sciences as well. The package and accompanying documentation can be downloaded from the CRAN website - Ade4, Ade4TkGUI. A more complete tutorial on using Ade4 can be found at the website for Ade4. Methodologies in this package that will be of interest to researchers include (see Table 1.): Principal Component Analysis; Centered and Un-centered Correspondence Analysis; Multiple Correspondence Analysis; Fuzzy Correspondence Analysis; Methods to analyze mixtures of quantitative (interval) and qualitative variables (ordinal, categorical). Additionally Ade4 implements: Linear Discriminant Analysis; Canonical Correlation Analysis; and statistical tests for between group clusters based on Monte-Carlo and Permutation based techniques. In this article we will look at how to use Ade4 to implement "Correspondence Analysis" (CA). The technique of CA falls under a wider class of multivariate techniques called "ordination methods" (e.g. Principal Component Analysis; Multidimensional Scaling, etc.). These methods order objects on derived continua (subject to some optimization criteria) such that similar objects are nearer one another, and dissimilar objects are further from one another. Graphical depiction of these derived continua allow graphical based clustering of objects (e.g. row objects and column objects in the data table). The field of Psychometrics has contributed greatly to the development of these methodologies (see the journal Psychometrika). CA is based on a matrix decomposition algorithm called Singular Value Decomposition (SVD) and bears the greatest resemblance to a class of lesser known techniques that are more generally known as "optimal scaling" methods - variously known as: Dual scaling; Optimal Scoring; Reciprocal Averaging; Homogeneity Analysis; or Alternating Least Squares Methods (ALS - see the pseudonym Albert Gifi). These methods derive reduced rank representations (e.g. a reduced set of coordinate systems) or lower dimensional components of transformed categorical and ordinal data by an iterative algorithm that transforms the categorical scaling of the variables into optimally derived numerical scales. These algorithms (and their variants) are, iteratively applied, constrained least squares optimization procedures (i.e. an iterative application of an eigenvalue/eigenvector (e.g. independent modes of variation in the original scores) extraction algorithm subject to certain row and column constraints - for a readable account see William Jacoby, 1999). As a result, nonlinear multivariate associations between sets of variables can be uncovered. These methods will be valuable in situations where survey researchers are interested in data tables where (for example) the relationship of the rows (respondents) to columns (items or survey questions) are of interest, regardless of whether the rows and columns represent nominal or ordinal level measured variables. These SVD based techniques for categorical or ordinal data bear a theoretical resemblance to techniques for interval level data that allow comparisons between respondents (rows) and items (columns) in "reduced subspaces" (coordinate system), the so called "latent variable" models (for a readable account see Bollen). For example, Factor Analyses (FA) are applied to interval measured data where the measured data (or manifest data) are an indicator of an unobserved, latent, continuous variable. Item Response Theory Analyses (IRT) are applied to ordinal dichotomous, or polytomous ordinal, measured data that are an indicator of an unobserved, latent, continuous variable. Latent Class Analysis (LCA) is applied to data that is nominal or categorical in composition, assuming an underlying latent category for observed responses. What these methods all have in common are that they derive a reduced set of factors, components, or categories for observed or latent scores where the relationship between rows and columns are of interest (e.g. respondent by category; respondent by item; or category by item; item by item; etc.). While the methods in Ade4 (e.g. correspondence analysis) are primarily descriptive and are not model-based (e.g. like Maximum Likelihood Factor Analysis) and do not involve the estimation of sampling variability or interval estimation for parameters, there are some nonparametric based statistical tests available (i.e. resampling or permutation based tests). Additionally, the eigenvalue/eigenvector extraction procedures (i.e. SVD algorithm) and the subsequent common scaling of the coordinate system, do allow researchers to explore respondent and item similarity in a highly useful exploratory graphical procedure (e.g. biplots) of item and respondent similarity. As a set of exploratory methods, these techniques are indispensable for reducing the complexity of multivariate data so that interrelationships amongst sets of variables may be uncovered (respondents, items, and other important covariates). My intention in this article is to demonstrate the steps necessary to produce an analysis with Ade4 and Ade4TkGUI and to demonstrate how ordination based methods can be useful for survey research.
Table I. From
http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf (page 5)
Correspondence Analysis with the "Satisfaction With Life Scale" (SWLS)
The following are screen shots of the list-drop-down boxes that were used in the SWLS survey. The UNT Zope Survey server was used to collect the responses to the SWLS:
Responses are collected on a 7 point ordinal (arguably an interval) scale; the anchors to the points on the scale are: Strongly agree, Agree, Slightly agree, Neither agree nor disagree, Slightly disagree, Disagree, and Strongly disagree:
Some demographic data were collected, however we'll not look at that data here, we'll only examine the responses to the 5 items Q1-Q5. The data are available for download at: http://www.unt.edu/rss/SWLS.questions.txt. Last month we discussed using the Tinn-R editor as a script editor and pager for the R environment on the Windows platform. Below is a screen shot in Tinn-R of the R commands used to: 1) load needed packages; 2) download the data from the URL given; 3) export the survey data to a delimited text file such that the data can be read in by the ade4 GUI interface; and 4) display the data in a window for examination. Additionally, the ade4TkGUI menu is displayed as well. Below is a screen-shot Tinn-R with the necessary R commands. Additionally a screenshot of 16 out of 174 responses (as an example) to the SWLS are displayed below the Tinn-R screenshot.
R Commands needed to load packages; download data; and start the ade4 GUI:
16 out of 174 responses are displayed:
The Initial ade4 GUI:
To begin, we need to read the data into the R working environment: On the Ade4 main menu bar, select: File - Read Text Data which will bring up the following window:
The defaults will work for the tab-delimited file SWLS data file (with variables on row one). Give the dataset a name of SWLS.dat. This will be the data set name for the workspace that we are using in R. Select the "Set" button and browse to the location of the SWLS text file (for this example the data was exported to "c:\" drive). Once the data is selected, a data editor appears. Close this window after a visual inspection - make sure the data loaded properly. When you are finished inspecting the data close the R data editor by clicking the "x" in the upper right corner of the data editor. For example:
The "read-text" dialog box should disappear and the
display returns to the initial Ade4 dialog box. In the "One table
analyses" panel, Select the "COA" button for the "correspondence analysis"
On the following dialog box, click "Set" and fill the "Input data frame" field by selecting the SWLS.dat entry in the next popup window. Then click "Choose" and then "Submit":
The dialog box shows that 174 rows with 5 columns have been selected. Give an "output dudi name" of "SWLSdudi" and click "Submit". The following windows are generated:
Select "2" axis for the display and click
"Choose". The bar chart that is displayed
The following dialog box appears:
This dialog box gives us information regarding the eigenvalue/eigenvector extraction and row and column scalings that Ade4 performed ("Vectors" panel). In the Vectors panel we see: column weights, row weights, and eigenvalues. Four eigenvalues were extracted with values of: .039, .023, .014, and .011 (note: since there are 5 items only 5 eigenvalues could be extracted; only the set larger than .01 are displayed). Clicking the "scatter(SWLSdudi)" button produces an (x, y) plot of the case and item scores (labeled with row and column number IDs) where the scaling for the (x, y) coordinates are equivalent. The upper left panel of the window displays a shaded bar graph of the two eigenvalues that are producing the scores and scalings of the coordinate system (x, y), which are the two largest modes of variation in the row (respondents) and column (items) objects:
The pair of axes (x, y) represent independent coordinates or uncorrelated "components of variation", or "units of information" for the row and column objects. The units are scaled in a common metric for both x and y axes. Which, as a set of (x ,y) pairs, describe each of the row objects (n=174 respondents), and column objects (5 items). For a set of perfectly homogeneous items, we would expect the items to cluster fairly close to one another on both axes, with most of the clustering occurring along one axis. Since these items were semantically constructed to elicit the self report of a theoretically defined variable called "Life Satisfaction", we expect a single or unidimensional construct to emerge across individuals such that items look similar in terms of the (x, y) coordinates. That is, we expect all of the information in the original set of elicited responses to be contained in the transformed values (x, y) with most of the information contained in either x or y. We would expect that the (x, y) values would be close for the items that are more homogeneous (i.e. responded to similarly by respondents). Additionally, individual respondents who are close in their (x, y) pairs would be considered to be more similar in their response patterns across the set of items as opposed to individuals whose (x, y) values differ substantially. Moreover, the closeness of individuals AND items on (x, y) scores would allow a researcher to cluster "similar individuals" on clusters of "similar items". In our survey of 5 items, we see that items Q1, Q3 and Q5 are more similar to one another on the x coordinate. And, Q1, Q2, Q3, and Q4 are more similar to one another on the y coordinate. Item Q5 can be seen as standing apart from the other four items (even more so than Q2). Also notice that the bulk of the respondents fall near items Q1-Q4. Item Q5 might be better reworded or be discarded entirely. One way of thinking of this situation is to see that there are 3 seperate TYPES of questions: 1) Low (x) values - [Q1,Q2,Q3,Q4]; and 2) Low (y) values - [Q1, Q3, Q5]; and 3) Low (x, y) values - [Q1, Q3]. These patterns of variation would account for three of the largest eigenvalues (e.g. indpendent modes of variation in the original scores). Perhaps the smallest eigenvalue is accounted for by item Q2 since it resides some distance from both the x and y axes to some extent. Notice that respondents 21, 48, 52, 62,79, 158, & 161 reside a substantial distance away from the bulk of the other respondents.
In summary, our conclusion is that this Correspondence Analysis has helped reveal a potentially informative source of heterogeneity in the set of items and respondents (rows and columns). The original presupposition of a unidimensional construct underlying these items does not seem to hold, at least upon a graphical inspection (and is supported by multiply large eigenvalues). Our next step might be to look for subgroups of individuals that account for the heterogeneity that we see in respondents responses on certain items (e.g Q5). Additionally, we might try to clarify the wording in the survey items to better communicate the semantic content that we are hoping will elicit correlates of the construct "Life Satisfaction" in our respondents responses.
Statnotes: Topics in Multivariate Analysis,
by G. David Garson -