Page One

Campus Computing News

"Pass the word, Please"

Round Reel Revolution

Students in the Tree

ssh . . .

The New Wordmarks are Here!

The Software Crisis

RSS Matters

The Network Connection

List of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

    

RSS Matters

By Craig Henderson, Research and Statistical Support Services

Longitudinal Growth Curve Modeling with SAS Proc Mixed

This article continues the article I contributed to the May Benchmarks Online. I will provide the same information to review multilevel modeling in general and then proceed with an explanation of longitudinal growth curve modeling.

We have begun a new series of articles to be published in the online newsletter put out by academic computing.  In this series of articles, we will be discussing some advanced methods of data analysis and how they can be implemented in the software supported by the Research and Statistical Support office.  In April, Rich Herrington contributed an article on the new conjoint analysis module implemented in SPSS 9.0, SPSS Conjoint. In this month's article, I will discuss hierarchical linear modeling (multilevel modeling) using SAS Proc Mixed.

Multilevel Modeling

In a nutshell, multilevel modeling (also known as hierarchical linear modeling and random coefficient modeling), is a flexible data analysis technique that involves analyzing linear models (e.g., the general linear model used in conjunction with ANOVA and regression) with a hierarchically nested structure (Bryk & Raudenbush, 1992). It is actually a more restrictive form of the mixed effects general linear model. The classic example is of students, nested within classrooms, nested within schools, nested within school districts, etc. Another frequently used application is the analysis of individual growth models designed for exploring longitudinal data (on individuals) over time. Basically, multilevel modeling models expand traditional regression methods by dropping the assumption of independence of observations and allowing the researcher to estimate both fixed and random effects on more than one level of a hierarchical structure simultaneously. Relationships are no longer assumed to be fixed over contexts (e.g., schools, time) and therefore are allowed to differ. These models are more realistic than traditional regression models due to making less restrictive assumptions; however, as Kreft (1996) points out, this generality is not without its price.  multilevel modeling models are not parsimonious, as more parameters are estimated, the outcomes may be more sample specific, larger data sets are needed for stable solutions, and they use more complex estimation methods than the ordinary least squares method applied in traditional linear regression. 

Although multilevel models are not a panacea, finally giving researchers THE statistical technique that will generate theory for you, there are several reasons that multilevel modeling is something that researchers in the social sciences need to know. First, there is the problem of nonindependence of observations.  Basically, this problem involves a situation in which clusters of individuals in an analysis have more in common with each other than other individuals. Situations in which this would be obvious are students in the same classroom, and family members in the same family. If traditional methods are used in these cases, standard errors will be underestimated, leading to an increased probability of a Type I error. However, other problematic situations are less obvious. The intraclass correlation is a helpful diagnostic tool in determining if a multilevel modeling will be superior to a traditional method, such as regression or ANOVA. A rough rule of thumb is when the intraclass correlation is over .10, hidden clusters are present in your data, and a multilevel modeling model would be a more appropriate data analysis technique.

Second, in the absence of intraclass correlation, there is no improvement of multilevel modeling over traditional models in terms of estimating fixed effects (Kreft, 1996). However, this is not the case if the researcher is interested in estimating random effects, particularly random regression coefficients. To illustrate this point, a multilevel model involves the following equations:

Yij = aj + bjXij + eij  (1)
aj = g00 + g01Zj + u0j     (2)
bj = g10 + g11Zj + u1j    (3)

where underlining indicates a random variable, X is a single predictor, and Y is the dependent variable.  Index i is used for individuals, and index j is used for contexts. The error terms u0j and u1j indicate that the intercept aj and the slope bj will vary over contexts. g00 indicates the grand mean, while u0j measures the deviation in means across contexts from the grand mean. Likewise, g10 represents the grand regression slope across contexts and u1j the deviation in slopes from the grand slope across contexts. The equations for aj and bj include a fixed component, g00 and g10, and a random component, u0j and u1ju0j has a variance, t00, u1j has a variance t11, and u0j and u1j have a covariance, t01. Zj represents a contextual level
variable (e.g., school, person in the case of repeated measurements); therefore, equation (2) demonstrates that the intercept (mean) of each context is a function of the group level variable and random fluctuation.  In equation (3), the slope is a function of the same group level variable and random fluctuation. The variances of u0j and u1j and their covariance are parameters estimated in the model, and are found in the matrix T, which has the following structure:

     t_matrix.jpg (2420 bytes)

In traditional regression, a and b are treated as fixed effects, and the random fluctuations are not estimated.  Why is this important? By estimating the elements in the T matrix, we can examine the unique estimates for separate contexts more efficiently than by conducting separate regression equations for each context.   Furthermore, we can now examine cross level interactions. An example would be the literature on aptitude by treatment interaction literature in education. Such research operates on the theory that teacher styles differ, and that some styles are more effective for certain students than for others. Instead of asking the question, what teaching methods are most effective, the more useful question of what teaching methods are most effective, for which students, in which contexts?

Longitudinal Growth Curve Modeling with SAS PROC MIXED

In 1992 SAS introduced the PROC MIXED routine into their statistical package. It was written by agricultural and physical scientists seeking to generalize the standard linear model to incorporate both fixed and random effects and therefore did not have the needs of social scientists in mind. However, by correctly specifying the mixed model, a researcher can fit multilevel models and individual growth curve models that have become quite popular in the social sciences (Singer, 1997). The material for this paper is provided by Singer (1997), and interested readers should study her very informative, understandable article. Using her examples, I will provide demonstrations of how to fit a longitudinal growth curve model. These examples are also provided by Bryk and Raudenbush (1992).

It would be helpful at this point to review the article I wrote on how to fit cross-sectional multilevel models with SAS PROC MIXED. In that article, I discussed how the three fundamental statements in SAS PROC MIXED syntax used to fit cross-sectional multilevel models are the CLASS statement, which identifies the categorical variable, the MODEL statement, which specifies the fixed effects, and the RANDOM statement which specifies the random effects. In this article, I will extend the use of the RANDOM statement to fit individual growth curve models. I will also discuss how growth curve models can be fit with the REPEATED statement. As with the cross-sectional multilevel model, I will begin with the example of an unconditional linear growth model.

Unconditional Linear Growth Model

In this model, we will begin with a simple two-level model. The level-1 model is a linear individual growth model, modeling the way in which each individual changes over time. The level-2 model expresses variation in the parameters from the growth model as random effects that occur between individuals (i.e., the change in individuals as a group over time); in the unconditional model, these random effects are unrelated to any person-level covariates. The equations to fit the unconditional model appear below; the level-1 (within person) parameters are indicated by p and the level-2 (between person) parameters are indicated by b:

Yij=p0j + p1j(TIME)ij + rij
p0j=b00 + u0j
p1j=b10 + u1j

where rij~N(0,s 2) and
equation.jpg (5236 bytes)

Substituting the models into each other yields:

Yij=[b00 + b10TIMEij] + [u0j + u1jTIMEij + rij]

This model contains two fixed effects, the intercept and the effect of TIME, and three random effects, the intercept, the slope for TIME, and the within person residual, rij. This model is a little unique in that a data set needs to be created in which each individual has a record for each time period that he/she is observed.  Please see Singer (1997) for details of how to create such a data set. The syntax used to fit the unconditional linear growth curve model is presented below:

proc mixed noclprint covtest;
  class id;
  model y = time/solution ddfm=bw notest;
  random intercept time/subject=id type=un;

The CLASS statement indicates that the data represent multiple observations over time for individuals. The fixed effects are included in the MODEL statement (the intercept does not need to be written into the model statement, as SAS will include it by default), and the random effects are included in the RANDOM statement (again the intercept is included by default). The SUBJECT=ID portion of the RANDOM statement indicates that we want to allow both intercepts and slopes to vary across people. In coding the TIME variable, the intercept can be coded in such a way that it represents initial status (by coding TIME=0), average status (by centering TIME), or final status (by letting 0 represent the last wave of data, all other time points coded with negative numbers). It is usually recommended that TIME  be coded in such a way that the intercept represents initial status. The SUBJECT= option indicates that the data set is composed of a set of individual subjects; the TYPE= option specifies the structure of the variance-covariance matrix for the intercepts and slopes. In our example, we are specifying an unstructured variance-covariance matrix.

Adding a Person-Level Covariate

Typically, in growth curve modeling, we are not only interested in change over time; we are also interested in how growth may be influenced by background covariates (e.g., IQ, family size, SES, etc.). This model adds some complexity to the unconditional growth curve model:

Yij=p0j + p1j(TIME)ij + rij
p0j=b00 + b01COVARj + u0j
p1j=b10 + b11COVARj + u1j

where rij~N(0,s 2) and
equation.jpg (5236 bytes)

Centering is important in such a model, because as the model now stands, the interpretation of the fixed effects, b00 and b10, are based on a scenario in which the background covariate would be equal to 0. As this is most likely not the case, we must center the covariate at the grand mean as follows:

Yij=p0j + p1j(TIME)ij + rij
p0j=b00 + b01(COVAR-Mean(COVAR)) + u0j
p1j=b10 + b11(COVAR-Mean(COVAR)) + u1j

where rij~N(0,s 2) and
equation.jpg (5236 bytes)

Substituting models yields:

Yij=b00 + b10(TIME)ij + b01(COVAR-Mean(COVAR))ij + b11(COVAR-Mean(COVAR))(TIME)ij + u0j + u1j(TIME)ij + rij

If we let CCOVAR represent the centered covariate, we can fit this model with the following syntax:

proc mixed noclprint covtest;
  class id;
  model y = time ccovar time*ccovar/s ddfm=bw notest;
  random intercept time/type=un sub=id gcorr;

We have added two fixed effects to the MODEL statement, CCOVAR and the TIME*CCOVAR interaction. The RANDOM statement remains the same. The GCORR option will print the estimated correlation matrix among the random effects.

Exploring the Structure of the Within Person Variance-Covariance Matrix

The above syntax examples place a somewhat unrealistic assumption on the structure of the within person residuals. "Were we to fit a model in which only the intercepts vary across persons . . ., we would be assuming a compound symmetric error covariance matrix for each person" (Singer, 1997, p. 25). A compound symmetric matrix is a variance-covariance matrix in which the residual covariance for each individual is uncorrelated with that of other individuals, a rather unrealistic assumption. In addition, when we fit individual slopes, we introduce heteroscedasticity into this residual matrix. However, one of the strengths of PROC MIXED is that it allows the user to explore different structures of the error covariance matrix. "By considering alternative structures for S [the within-person error variance-covariance matrix] (that ideally derive from theory), and by comparing the goodness of fit of resulting models, the user can determine what type of structure is most appropriate for the data at hand" (Singer, 1997, p. 25). S For details on the structure of this matrix, the interested reader is referred to pages 92-102 of the book SAS System for Mixed Models (Littell, Milliken, Stroup, & Wolfinger, 1996). S is referred to the R matrix in SAS PROC MIXED terminology.

The structure of the R matrix is specified using a REPEATED statement. With the assumption that the R matrix is compound symmetric, the PROC MIXED syntax would be as follows:

proc mixed noclprint covtest noitprint;
  class id wave;
  model y = time/s notest;
  repeated wave/type=cs subject=id r;

In the above syntax, WAVE is included as a CLASS variable. WAVE refers to the wave of data collection (i.e., 1st collection, 2nd collection, etc.). WAVE is a series of dummy-coded variables, as opposed to TIME, which is a continuous variable. This is because the variable specified in the REPEATED statement must be categorical. With the TYPE= option, we specify the structure of the R matrix, in this example, compound symmetry. Other possible structures include UN for unstructured and AR for autoregressive.  The R option above requests SAS to print out the R matrix. The idea is to try several different error structures and to compare the goodness of fit statistics for the models specifying different error structures. Please consult pp. 92-102 of the SAS System for Mixed Models for details.

Now, putting the information for structuring the R matrix with the model we previously tested that included a person-specific background covariate, our SAS PROC MIXED syntax becomes:

proc mixed noclprint covtest noitprint;
  class id wave;
  model y = time ccovar time*ccovar/s ddfm=bw notest;
  random intercept time/type=un sub=id g;
  repeated wave/type=ar(1) subject=id r;

The AR(1) option indicates an autoregressive structure with a lag of 1.

I hope that I have helped provide you with some information in which you can jump off into multilevel modeling. My opinion is that as structural equation modeling has increased in popularity, the same will happen with multilevel modeling. The ability to test variance components and cross-level interactions are particularly appealing features of this up and coming approach. Psychology's Dr. Ke-Hai Yuan will be instructing a class on multilevel modeling in the fall, for those of you who would like to pick up another methodology class, or for those of you who are faculty and would like to sit in on a class. Please contact me, craigh@unt.edu, if I can assist you in any way in implementing multilevel models or for other help as well. Enjoy your researching, and good luck.

References

    Bryk, A. S., & Raudenbush, S. W.  Hierarchical linear models.  Newbury Park, CA:  Sage Publications.

    Kreft, I. G. G.  (1996). Are multilevel techniques necessary? An overview including simulation studies. Multilevel Models Project. http://www.ioe.ac.uk:80/multilevel/workpap.html.

    Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D.  (1996). SAS system for mixed models. Cary, NC:  SAS Institute, Inc.

    Singer, J. D.  (1997).  Using proc mixed to fit multilevel models, hierarchical models, and individual growth models. The Journal of Educational and Behavioral Statistics, in press.