|
|
|

RSS Matters
By Craig Henderson, Former Employee
Research and Statistical Support Services
Longitudinal Growth Curve Modeling with
SAS Proc Mixed
This article is a slightly edited
version of an article that appeared in the July, 1999 issue
of Benchmarks Online. Dr. Herrington will have an article
on this topic using S-Plus next month. His previous
articles have been "Controlling the False Discovery
Rate in Multiple Hypothesis Testing" in April's
Benchmarks Online. The previous article in this series
can be found in the December,
2001 issue of Benchmarks Online: "Dealing with
Outliers in Bivariate Data: Robust Correlation" -
Ed.
Multilevel Modeling
In a nutshell, multilevel modeling (also
known as hierarchical linear modeling and random
coefficient modeling), is a flexible data analysis
technique that involves analyzing linear models (e.g.,
the general linear model used in conjunction with ANOVA
and regression) with a hierarchically nested structure
(Bryk & Raudenbush, 1992). It is actually a more
restrictive form of the mixed effects general linear
model. The classic example is of students, nested within
classrooms, nested within schools, nested within school
districts, etc. Another frequently used application is
the analysis of individual growth models designed for
exploring longitudinal data (on individuals) over time.
Basically, multilevel modeling models expand traditional
regression methods by dropping the assumption of
independence of observations and allowing the researcher
to estimate both fixed and random effects on more than
one level of a hierarchical structure simultaneously.
Relationships are no longer assumed to be fixed over
contexts (e.g., schools, time) and therefore are allowed
to differ. These models are more realistic than
traditional regression models due to making less
restrictive assumptions; however, as Kreft (1996) points
out, this generality is not without its price.
multilevel modeling models are not parsimonious, as more
parameters are estimated, the outcomes may be more sample
specific, larger data sets are needed for stable
solutions, and they use more complex estimation methods
than the ordinary least squares method applied in
traditional linear regression.
Although multilevel models are not a
panacea, finally giving researchers THE statistical
technique that will generate theory for you, there are
several reasons that multilevel modeling is something
that researchers in the social sciences need to know.
First, there is the problem of nonindependence of
observations. Basically, this problem involves a
situation in which clusters of individuals in an analysis
have more in common with each other than other
individuals. Situations in which this would be obvious
are students in the same classroom, and family members in
the same family. If traditional methods are used in these
cases, standard errors will be underestimated, leading to
an increased probability of a Type I error. However,
other problematic situations are less obvious. The
intraclass correlation is a helpful diagnostic tool in
determining if a multilevel modeling will be superior to
a traditional method, such as regression or ANOVA. A
rough rule of thumb is when the intraclass correlation is
over .10, hidden clusters are present in your data, and a
multilevel modeling model would be a more appropriate
data analysis technique.
Second, in the absence of intraclass
correlation, there is no improvement of multilevel
modeling over traditional models in terms of estimating
fixed effects (Kreft, 1996). However, this is not the
case if the researcher is interested in estimating random
effects, particularly random regression coefficients. To
illustrate this point, a multilevel model involves the
following equations:
Yij = aj + bjXij
+ eij (1)
aj = g00 + g01Zj + u0j (2)
bj = g10 + g11Zj + u1j (3)
where underlining indicates a random
variable, X is a single predictor, and Y is the
dependent variable. Index i is used for
individuals, and index j is used for contexts.
The error terms u0j and
u1j
indicate that the intercept aj and the slope bj will vary over contexts. g00 indicates
the grand mean, while u0j measures
the deviation in means across contexts from the grand
mean. Likewise, g10 represents the grand regression slope
across contexts and u1j the
deviation in slopes from the grand slope across contexts.
The equations for aj and bj include a fixed component, g00 and g10, and a
random component, u0j
and u1j. u0j has a variance, t00, u1j has a variance t11, and u0j and u1j
have a covariance, t01. Zj
represents a contextual level
variable (e.g., school, person in the case of repeated
measurements); therefore, equation (2) demonstrates that
the intercept (mean) of each context is a function of the
group level variable and random fluctuation. In
equation (3), the slope is a function of the same group
level variable and random fluctuation. The variances of u0j and u1j
and their covariance are parameters estimated in the
model, and are found in the matrix T, which has the
following structure:

In traditional regression, a and b are
treated as fixed effects, and the random fluctuations are
not estimated. Why is this important? By estimating
the elements in the T matrix, we can examine the unique
estimates for separate contexts more efficiently than by
conducting separate regression equations for each
context. Furthermore, we can now examine cross
level interactions. An example would be the literature on
aptitude by treatment interaction literature in
education. Such research operates on the theory that
teacher styles differ, and that some styles are more
effective for certain students than for others. Instead
of asking the question, what teaching methods are most
effective, the more useful question of what teaching
methods are most effective, for which students, in which
contexts?
Longitudinal Growth Curve
Modeling with SAS PROC MIXED
In 1992 SAS introduced the PROC MIXED routine into
their statistical package. It was written by agricultural
and physical scientists seeking to generalize the
standard linear model to incorporate both fixed and
random effects and therefore did not have the needs of
social scientists in mind. However, by correctly
specifying the mixed model, a researcher can fit
multilevel models and individual growth curve models that
have become quite popular in the social sciences (Singer,
1997). The material for this paper is provided by Singer
(1997), and interested readers should study her very
informative, understandable article. Using her examples,
I will provide demonstrations of how to fit a
longitudinal growth curve model. These examples are also
provided by Bryk and Raudenbush (1992).
It would be helpful at this point to review the article I
wrote on how to fit cross-sectional multilevel models
with SAS PROC MIXED. In that article, I discussed how the
three fundamental statements in SAS PROC MIXED syntax
used to fit cross-sectional multilevel models are the
CLASS statement, which identifies the categorical
variable, the MODEL statement, which specifies the fixed
effects, and the RANDOM statement which specifies the
random effects. In this article, I will extend the use of
the RANDOM statement to fit individual growth curve
models. I will also discuss how growth curve models can
be fit with the REPEATED statement. As with the
cross-sectional multilevel model, I will begin with the
example of an unconditional linear growth model.
Unconditional Linear Growth Model
In this model, we will begin with a simple two-level
model. The level-1 model is a linear individual growth
model, modeling the way in which each individual changes
over time. The level-2 model expresses variation in the
parameters from the growth model as random effects that
occur between individuals (i.e., the change in
individuals as a group over time); in the unconditional
model, these random effects are unrelated to any
person-level covariates. The equations to fit the
unconditional model appear below; the level-1 (within
person) parameters are indicated by p
and the level-2 (between person) parameters are indicated
by b:
Yij=p0j
+ p1j(TIME)ij
+ rij
p0j=b00 + u0j
p1j=b10 + u1j
where rij~N(0,s 2) and

Substituting the models into each other yields:
Yij=[b00 + b10TIMEij]
+ [u0j + u1jTIMEij
+ rij]
This model contains two fixed effects, the intercept
and the effect of TIME, and three random effects, the
intercept, the slope for TIME, and the within person
residual, rij. This model is a little
unique in that a data set needs to be created in which
each individual has a record for each time period that
he/she is observed. Please see Singer (1997) for
details of how to create such a data set. The syntax used
to fit the unconditional linear growth curve model is
presented below:
proc mixed noclprint covtest;
class id;
model y = time/solution ddfm=bw notest;
random intercept time/subject=id type=un;
The CLASS statement indicates that the data represent
multiple observations over time for individuals. The
fixed effects are included in the MODEL statement (the
intercept does not need to be written into the model
statement, as SAS will include it by default), and the
random effects are included in the RANDOM statement
(again the intercept is included by default). The
SUBJECT=ID portion of the RANDOM statement indicates that
we want to allow both intercepts and slopes to vary
across people. In coding the TIME variable, the intercept
can be coded in such a way that it represents initial
status (by coding TIME=0), average status (by centering
TIME), or final status (by letting 0 represent the last
wave of data, all other time points coded with negative
numbers). It is usually recommended that TIME be
coded in such a way that the intercept represents initial
status. The SUBJECT= option indicates that the data set
is composed of a set of individual subjects; the TYPE=
option specifies the structure of the variance-covariance
matrix for the intercepts and slopes. In our example, we
are specifying an unstructured variance-covariance
matrix.
Adding a Person-Level Covariate
Typically, in growth curve modeling, we are not only
interested in change over time; we are also interested in
how growth may be influenced by background covariates
(e.g., IQ, family size, SES, etc.). This model adds some
complexity to the unconditional growth curve model:
Yij=p0j
+ p1j(TIME)ij
+ rij
p0j=b00 + b01COVARj
+ u0j
p1j=b10 + b11COVARj
+ u1j
where rij~N(0,s 2) and

Centering is important in such a model, because as the
model now stands, the interpretation of the fixed
effects, b00 and b10,
are based on a scenario in which the background
covariate would be equal to 0. As this is most likely not
the case, we must center the covariate at the grand mean
as follows:
Yij=p0j
+ p1j(TIME)ij
+ rij
p0j=b00 + b01(COVAR-Mean(COVAR))
+ u0j
p1j=b10 + b11(COVAR-Mean(COVAR))
+ u1j
where rij~N(0,s 2) and

Substituting models yields:
Yij=b00 + b10(TIME)ij
+ b01(COVAR-Mean(COVAR))ij + b11(COVAR-Mean(COVAR))(TIME)ij + u0j + u1j(TIME)ij
+ rij
If we let CCOVAR represent the centered covariate, we
can fit this model with the following syntax:
proc mixed noclprint covtest;
class id;
model y = time ccovar time*ccovar/s ddfm=bw
notest;
random intercept time/type=un sub=id gcorr;
We have added two fixed effects to the MODEL
statement, CCOVAR and the TIME*CCOVAR interaction. The
RANDOM statement remains the same. The GCORR option will
print the estimated correlation matrix among the random
effects.
Exploring the Structure of the Within Person
Variance-Covariance Matrix
The above syntax examples place a somewhat unrealistic
assumption on the structure of the within person
residuals. "Were we to fit a model in which only the
intercepts vary across persons . . ., we would be
assuming a compound symmetric error covariance matrix for
each person" (Singer, 1997, p. 25). A compound
symmetric matrix is a variance-covariance matrix in which
the residual covariance for each individual is
uncorrelated with that of other individuals, a rather
unrealistic assumption. In addition, when we fit
individual slopes, we introduce heteroscedasticity into
this residual matrix. However, one of the strengths of
PROC MIXED is that it allows the user to explore
different structures of the error covariance matrix.
"By considering alternative structures for S [the within-person error
variance-covariance matrix] (that ideally derive from
theory), and by comparing the goodness of fit of
resulting models, the user can determine what type of
structure is most appropriate for the data at hand"
(Singer, 1997, p. 25). S For
details on the structure of this matrix, the interested
reader is referred to pages 92-102 of the book SAS
System for Mixed Models (Littell, Milliken, Stroup,
& Wolfinger, 1996). S is
referred to the R matrix in SAS PROC MIXED terminology.
The structure of the R matrix is specified using a
REPEATED statement. With the assumption that the R matrix
is compound symmetric, the PROC MIXED syntax would be as
follows:
proc mixed noclprint covtest noitprint;
class id wave;
model y = time/s notest;
repeated wave/type=cs subject=id r;
In the above syntax, WAVE is included as a CLASS
variable. WAVE refers to the wave of data collection
(i.e., 1st collection, 2nd collection, etc.). WAVE is a
series of dummy-coded variables, as opposed to TIME,
which is a continuous variable. This is because the
variable specified in the REPEATED statement must be
categorical. With the TYPE= option, we specify the
structure of the R matrix, in this example, compound
symmetry. Other possible structures include UN for
unstructured and AR for autoregressive. The R
option above requests SAS to print out the R matrix. The
idea is to try several different error structures and to
compare the goodness of fit statistics for the models
specifying different error structures. Please consult pp.
92-102 of the SAS System for Mixed Models for
details.
Now, putting the information for structuring the R
matrix with the model we previously tested that included
a person-specific background covariate, our SAS PROC
MIXED syntax becomes:
proc mixed noclprint covtest noitprint;
class id wave;
model y = time ccovar time*ccovar/s ddfm=bw
notest;
random intercept time/type=un sub=id g;
repeated wave/type=ar(1) subject=id r;
The AR(1) option indicates an autoregressive structure
with a lag of 1.
I hope that I have helped provide you with some
information in which you can jump off into multilevel
modeling. My opinion is that as structural equation
modeling has increased in popularity, the same will
happen with multilevel modeling. The ability to test
variance components and cross-level interactions are
particularly appealing features of this up and coming
approach. Enjoy your researching, and good luck.
References
Bryk, A. S., & Raudenbush, S.
W. Hierarchical linear models.
Newbury Park, CA: Sage Publications.
Kreft, I. G. G. (1996). Are
multilevel techniques necessary? An overview including
simulation studies. Multilevel Models Project. http://www.ioe.ac.uk:80/multilevel/workpap.html.
Littell, R. C.,
Milliken, G. A., Stroup, W. W., & Wolfinger, R.
D. (1996). SAS system for mixed models.
Cary, NC: SAS Institute, Inc.
Singer, J. D.
(1997). Using proc mixed to fit multilevel models,
hierarchical models, and individual growth models. The
Journal of Educational and Behavioral Statistics, in
press.
|