The following covers
a few of the SAS procedures for conducting component and factor analysis. Use the Import Wizard to import the
Example Data 4 file using the SPSS File (*.sav) source option and
member name example4. There should be 750 cases or observations with no missing
values and 16 variables.
Make sure the *entire *data set was successfully
imported to SAS with the following syntax:
**PROC MEANS DATA=example4;**
RUN;
Before we begin with the analysis syntax; let's take a moment to address and
hopefully clarify one of the most confusing and misarticulated
issues in statistical teaching and practice literature. An ambitious goal to be
sure.
**First, Principal Components Analysis (PCA)** is a variable reduction
technique which maximizes the amount of variance accounted for in the observed
variables by a smaller group of variables called COMPONENTS. As an example,
consider the following situation. Let's say, we have 500 questions on a survey
we designed to measure persistence. We want to reduce the number of questions
so that it does not take someone 3 hours to complete the survey. It would be appropriate
to use PCA to reduce the number of questions by identifying and removing
redundant questions. For instance, if question 122 and question 356 are
virtually identical (i.e. they ask the exact same thing but in different ways),
then one of them is not necessary. The PCA process allows us to reduce the
number of questions or variables down to their PRINCIPAL COMPONENTS.
PCA is commonly, but very confusingly, called exploratory factor analysis (EFA).
The use of the word *factor* in EFA is inappropriate and confusing because we are really
interested in COMPONENTS, not factors. This issue is made more confusing by some
software packages (e.g. PASW / SPSS) which list or use PCA under the heading
factor analysis.
**Second, Factor Analysis (FA)** is typically used to confirm the latent
factor structure for a group of measured variables. Latent factors are
unobserved variables which typically can not be directly measured; but, they are
assumed to *cause* the scores we observe on the measured or indicator variables.
FA is a model based technique. It is concerned with modeling the relationships
between measured variables, latent factors, and error.
As stated in O'Rourke, Hatcher, and Stepanski (2005): "Both (PCA & FA) are
methods that can be used to identify groups of observed variables that tend to
hang together empirically. Both procedures can also be performed with the SAS
FACTOR procedure and they generally tend to provide similar results.
Nonetheless, there are some important conceptual differences between principal
component analysis and factor analysis that should be understood at the outset.
Perhaps the most important deals with the assumption of an underlying causal
structure. Factor analysis assumes that the covariation in the observed
variables is due to the presence of one or more latent variables (factors) that
exert causal influence on these observed variables" (p. 436).
Final thoughts. Both PCA and FA can be used as exploratory analysis. But; PCA is predominantly used in an exploratory fashion and
almost never used in a
confirmatory fashion. FA can be used in an exploratory fashion, but most of the
time it is used in a confirmatory fashion because it is concerned with modeling
factor structure. The choice of which is used should be
driven by the goals of the analyst. If you are interested in reducing the
observed variables down to their principal components while maximizing the
variance accounted for in the variables by the components, then you should be using PCA. If you are concerned with
modeling the latent factors
(and their relationships)
which cause the scores on your observed variables, then you should be using FA.
### REFERENCE ###
O'Rourke, N., Hatcher, L., & Stepanski, E.J. (2005). A step-by-step approach to
using SAS for univariate and multivariate statistics, Second Edition. Cary, NC:
SAS Institute Inc.
##################
**IX. Principal Components Analysis**
So, here we go with the syntax. The generic syntax for Principal Components
Analysis with options is displayed below.
PROC FACTOR DATA=datasetname
SIMPLE
METHOD=PRIN
PRIORS=ONE
NFACT=
MINEIGEN=1
SCREE
ROTATE=
FLAG=.32
OUT=newdata;
VAR variable1 variable2 variable3...variableN;
RUN;
PROC FACTOR, as stated earlier, can be used for either principal components
analysis or factor analysis (you see why this can be confusing). The data step
should be familiar by now. The SIMPLE statement provides simple descriptive
statistics for each of the variables in the analysis (i.e. number of
cases/observations, means, standard deviations). The METHOD=PRIN specifies
the extraction method as principal components. The PRIORS=ONE specifies prior communality
estimates. When conducting principal components analysis, you should always use
ONE. The NFACT optional statement allows you to specify the number of retained
components (again, the use of fact or factor makes this confusing). The MINEIGEN=1
specifies the minimum acceptable (or critical) eigen value we want a component
to display in order for it to be retained. The SCREE simply specifies that we
want a scree plot to be displayed with the output. ROTATE= specifies a rotation
strategy. When components are correlated, we would choose an oblique rotation
strategy (e.g. PROMAX) and when components are not correlated, we would choose
an orthogonal rotation strategy (e.g. VARIMAX). The FLAG=.32 specifies that we want the output to flag (with an *) all
loadings greater than the number we specify. Here, 0.32 is specified because
when squared, it represents 10% of the variance in the variable accounted for by
the component. The OUT option specifies a name for a new data set which will
include the original variables and the retained component scores for each
observation. The OUT option can only be used when the input data is raw data
(as opposed to a correlation or covariance matrix) and the number of components
(NFACT) has been specified. The OUT option can be useful for determining whether
or not the components are correlated (e.g. running a PROC CORR on the newly
created data which includes the component scores). The VAR statement is used to
specify all the variables being subjected to the component analysis. It is **
important** to notice the presence of semi colons before and after the VAR
statement.
If the OUT option is used in the principal component analysis, then you will
likely want to explore the relationships between the components (named factor1
factor2...factorN by default) and the variables. In which case, the syntax below
provides the generic format for doing so. Again, the use of the term factor when
referring to components makes this stuff confusing.
PROC CORR DATA=newdata;
VAR factor1 factor2...factorN;
WITH variable1 variable2...variableN factor1 factor2...factorN;
RUN;
**(1)**
Now we can move on to a practical example. The current example uses Example Data
4 (example4) which contains 15 items or variables and 750 cases or observations.
For an initial components analysis, we specify no number of components to be
retained, no rotation strategy, and we are not interested in creating a new data
file.
**PROC FACTOR DATA=example4**
SIMPLE
METHOD=PRIN
PRIORS=ONE
MINEIGEN=1
SCREE
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
We notice from the output, we have two items (y14 & y15) which do not load on
the first component (always the strongest component without rotation) but create
their own retained component (also with eigen value greater than 1). We know a
component should have, as a minimum, 3 items/variables; but let's reserve
deletion of items until we can discover whether or not our components are
related.
**(2)** Next, we re-run the PCA specifying NFACT = 5, which really means we
are specifying 5 components to be retained. We also specify the creation of a
new data set (ex4comp2) which will contain all the variables used in the PCA *
and* component scores for each observation. Also note, we removed the SIMPLE
option because the descriptive statistics were given with the previous PCA.
**PROC FACTOR DATA=example4**
METHOD=PRIN
PRIORS=ONE
NFACT=5
MINEIGEN=1
SCREE
FLAG=.32
OUT=ex4comp2;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
The creation of the new data set allows us to determine if our components are
correlated.
**PROC CORR DATA=ex4comp2;**
VAR factor1 factor2 factor3 factor4 factor5;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2 factor4
factor5;
RUN
We see in this output that our components are not correlated, which indicates we
should use an orthogonal rotation.
**(3)** Now we can re-run the PCA with a VARIMAX rotation applied.
**PROC FACTOR DATA=example4**
METHOD=PRIN
PRIORS=ONE
NFACT=5
MINEIGEN=1
SCREE
ROTATE=VARIMAX
FLAG=.32
OUT=ex4comp3;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
Here we see that the varimax rotation cleaned up the interpretation by
eliminating the global first component (see the Rotated Factor Pattern table).
And, because we created a new data file, we can verify the complete lack of
correlations between the components using the syntax below.
**PROC CORR DATA=ex4comp3;**
VAR factor1 factor2 factor3 factor4 factor5;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2 factor4
factor5;
RUN;
**(4)** Finally, we can eliminate the two items which (1) by themselves
create a component (components should have more than 2 items or variables) and
(2) do not load (at all) on the un-rotated or initial component 1.
**PROC FACTOR DATA=example4**
METHOD=PRIN
PRIORS=ONE
NFACT=4
MINEIGEN=1
SCREE
ROTATE=VARIMAX
FLAG=.32
OUT=ex4comp4;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
**PROC CORR DATA=ex4comp4;**
VAR factor1 factor2 factor3 factor4;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4;
RUN;
To help clarify the purpose of PCA, consider reviewing the output for PCA **(3)**
with particular attention to the first page of that output (the page above the
scree plot). You will find there a table with the title "Eigenvalues of the
Correlation Matrix: Total = 15, Average = 1". The fourth column in that table is
called "Cumulative" and refers to the cumulative variance accounted for by the
components. Now focus on the fifth value from the top in that fourth column.
That value of .5503 tells us 55.03% of the variance in the items (specifically
the items' variance - covariance matrix) is accounted for by all 5 components.
As a comparison, and to highlight the purpose of PCA; look at the same table
only for PCA **(4)**, which has the title "Eigenvalues of the Correlation
Matrix: Total = 13, Average = 1". Pay particular attention to the fourth value
in the fourth (cumulative) column. This value of .5517 tells us 55.17% of the
variance in the items (specifically the items' variance - covariance matrix) is
accounted for by all 4 components. So, we have reduced the number of items from
15 to 13, reduced the number of components, and yet have improved the amount of variance accounted for in the
items by
our principal components.
**X. Factor Analysis**
The generic syntax for Factor
Analysis (FA) with options is displayed below; however, the only real changes
are the extraction method and priors. Before we used principal for PCA while
here with FA, we will be using ML which refers to maximum likelihood extraction.
Some suggest using ULS which refers to un-weighted least squares extraction. The
other change is the use of SMC or squared multiple correlations in the priors
statement.
PROC FACTOR DATA=datasetname
SIMPLE
METHOD=ML or ULS
PRIORS=SMC
NFACT=
MINEIGEN=1
SCREE
ROTATE=
FLAG=.32
OUT=newdata;
VAR variable1 variable2 variable3...variableN;
RUN;
Continuing with the same data as was used above, we will submit our 15 initial
items to the Maximum likelihood FA with VARIMAX rotation, and SMC priors. We
leave out the SIMPLE option because we have already seen the descriptive
statistics for each item above. Will will leave out the OUT statement because we
do not need to use the factor scores for assessing the relationship between the
factors (we know from above they are not related). However, it is often useful
to save the factor scores for use in another analysis (SEM). We will leave out the MINEIGEN criteria so that we insure we get all 5 factors retained (often it is
the case that only one common factor is retain because only one factor displays
an eigen value greater than 1).
**PROC FACTOR DATA=example4**
METHOD=ML
PRIORS=SMC
NFACT= 5
SCREE
ROTATE=VARIMAX
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
Look on the sixth page of the output, you will see a table titled "Rotated
Factor Pattern" in the middle of that page. This table displays the rotated
factor loadings for each item / variable on each factor retained. Notice that
Factor 5 has no items loading greater than 0.32 (* indicates loadings greater
than 0.32). Also notice that items y14 and y15 do not load on any factor greater
than 0.32. In fact, the greatest loading for y14 is with Factor 5 which is only
0.20; which when squared (.04) represents only 4% of the variance accounted for
in that item by factor 5. Furthermore, Factor 5 is only supported by two items
(y14 & y15) which themselves are not very good (indicated by the communalities).
For instance, if we look at the seventh page of the output, we find the majority
of a table titled "Final Communality Estimates and Variable Weights" which
displays the communalities for each item / variable. Communalities represent the
sum of the squared loadings for an item. They are interpreted as the amount of
variance in an item which is explained by all the retained factors after
rotation. So, we can see that both y14 and y15 display very low communalities
which indicates their variance is not explained by the combined factors. To be
more specific, y14 displays a communality of 0.042; which when interpreted
means: only 4.2% of the variance of item y14 is explained by all five factors
combined. The bottom line interpretation here is that Factor 5 and items y14 and
y15 can be removed.
**PROC FACTOR DATA=example4**
METHOD=ML
PRIORS=SMC
NFACT= 4
SCREE
ROTATE=VARIMAX
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
Reviewing the last two pages of the most recent output, we see the "Rotated
Factor Pattern" table and the "Final Communality Estimates and Variable Weights"
table (which starts on the bottom of one page and continues on the last page of
the output). In the Rotated Factor Pattern table we see clear factor structure
displayed; meaning, each item loads predominantly on one factor. For instance,
the first four items load virtually exclusively on Factor 1. Furthermore, if we
look at the communalities we see that all the items displayed a communality of
0.32 or greater, with one exception. The exception is y4, which is a little
lower than we would like and given that Factor 1 has three other items which
load significantly on it, we may choose to remove item y4 from further analysis
or measurement in the future.
Finally; as an additional example; we can take a look at the same analysis but
with an oblique (PROMAX) rotation strategy.
**PROC FACTOR DATA=example4**
METHOD=ML
PRIORS=SMC
NFACT=4
SCREE
ROTATE=PROMAX
FLAG=.40
OUT=ex4comp5;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
**PROC CORR DATA=ex4comp5;**
VAR factor1 factor2 factor3 factor4;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4;
RUN;
When interpreting the output of a run with oblique rotation, remember that the
oblique process is a two stage process. During the first stage, an orthogonal
rotation solution is produced. The current example provides output (on pages 6 &
7) which is identical to the previous VARIMAX rotated 4 factor and 13 item
solution from above. During the second stage, the factors are allowed to
correlate and the PROMAX rotation is then applied. Interpretation of the oblique
(PROMAX) solution begins on page 8 of the current output. The top of page 10
begins with the table named "Inter-Factor Correlations" but, directly below that
table one can find the "Rotated Factor Pattern (Standardized Regression
Coefficients)" table. Here is where the rotated loadings for the PROMAX rotation
are displayed. At the bottom of page 12 and continuing on to page 13 one will
find the communality estimates associated with the PROMAX solution.
**XI. Internal Consistency Analysis (Cronbach's
Alpha Coefficient)**
Often when one is conducting principal components analysis or factor analysis,
one will want to conduct an internal consistency analysis. Traditionally, reliability
analysis was used synonymously with internal consistency and/or Cronbach's Alpha
or Coefficient Alpha. However, Cronbach's Alpha is not a statistical measure of
reliability; it is a measure of internal consistency. Reliability generally
refers to whether or not a measurement device provides consistent data across
multiple administrations. Reliability can be assessed by correlating multiple
administrations of the measurement device given to the same population at
different times -- this is known as test-retest reliability. Internal consistency can be thought of as
the relationship between each item and each other item; *and* internal
consistency can be thought of as the relationship of each item to the collection
of items or total score. Internal consistency
is assessed using (1) the item to total score correlation and (2) Cronbach's
alpha coefficient. In SAS, the item to total score correlation and Cronbach's
alpha coefficient are provided using the PROC CORR procedure, but we would have
to specify the ALPHA option.
The generic form of the PROC CORR procedure for producing Cronbach's alpha
follows:
PROC CORR DATA=dataname ALPHA NOMISS;
VAR variable1 variable2 variable3...variableN;
RUN;
The PROC CORR for obtaining Cronbach's alpha uses the ALPHA option and can use
the NOMISS option to perform deletion of missing values.
As a practical example, consider the following syntax which will provide
Cronbach's alpha coefficient for the first four items of our data set (i.e.
Factor 1):
**PROC CORR DATA=example4 ALPHA;**
VAR y1 y2 y3 y4;
RUN;
The output of the ALPHA procedure contains 4 tables. The first table simply
reports the descriptive statistics for each item/variable entered in the ALPHA
procedure. The second table reports the raw and standardized versions of
Cronbach's alpha coefficient. The third table (often critically important)
reports the item-to-total score correlations and the alpha if item deleted for
each item in both raw and standardized forms. Keep in mind, we would expect a
good item to display a high correlation with the total score and a low alpha if
item deleted (i.e. if alpha drops when an item is deleted, then clearly that
item was important). |