|
RSS MattersBy Craig Henderson, Research and Statistical Support ServicesLongitudinal Growth Curve Modeling with SAS Proc MixedThis article continues the article I contributed to the May Benchmarks Online. I will provide the same information to review multilevel modeling in general and then proceed with an explanation of longitudinal growth curve modeling. We have begun a new series of articles to be published in the online newsletter put out by academic computing. In this series of articles, we will be discussing some advanced methods of data analysis and how they can be implemented in the software supported by the Research and Statistical Support office. In April, Rich Herrington contributed an article on the new conjoint analysis module implemented in SPSS 9.0, SPSS Conjoint. In this month's article, I will discuss hierarchical linear modeling (multilevel modeling) using SAS Proc Mixed. Multilevel ModelingIn a nutshell, multilevel modeling (also known as hierarchical linear modeling and random coefficient modeling), is a flexible data analysis technique that involves analyzing linear models (e.g., the general linear model used in conjunction with ANOVA and regression) with a hierarchically nested structure (Bryk & Raudenbush, 1992). It is actually a more restrictive form of the mixed effects general linear model. The classic example is of students, nested within classrooms, nested within schools, nested within school districts, etc. Another frequently used application is the analysis of individual growth models designed for exploring longitudinal data (on individuals) over time. Basically, multilevel modeling models expand traditional regression methods by dropping the assumption of independence of observations and allowing the researcher to estimate both fixed and random effects on more than one level of a hierarchical structure simultaneously. Relationships are no longer assumed to be fixed over contexts (e.g., schools, time) and therefore are allowed to differ. These models are more realistic than traditional regression models due to making less restrictive assumptions; however, as Kreft (1996) points out, this generality is not without its price. multilevel modeling models are not parsimonious, as more parameters are estimated, the outcomes may be more sample specific, larger data sets are needed for stable solutions, and they use more complex estimation methods than the ordinary least squares method applied in traditional linear regression. Although multilevel models are not a panacea, finally giving researchers THE statistical technique that will generate theory for you, there are several reasons that multilevel modeling is something that researchers in the social sciences need to know. First, there is the problem of nonindependence of observations. Basically, this problem involves a situation in which clusters of individuals in an analysis have more in common with each other than other individuals. Situations in which this would be obvious are students in the same classroom, and family members in the same family. If traditional methods are used in these cases, standard errors will be underestimated, leading to an increased probability of a Type I error. However, other problematic situations are less obvious. The intraclass correlation is a helpful diagnostic tool in determining if a multilevel modeling will be superior to a traditional method, such as regression or ANOVA. A rough rule of thumb is when the intraclass correlation is over .10, hidden clusters are present in your data, and a multilevel modeling model would be a more appropriate data analysis technique. Second, in the absence of intraclass correlation, there is no improvement of multilevel modeling over traditional models in terms of estimating fixed effects (Kreft, 1996). However, this is not the case if the researcher is interested in estimating random effects, particularly random regression coefficients. To illustrate this point, a multilevel model involves the following equations: Yij = aj + bjXij
+ eij (1) where underlining indicates a random
variable, X is a single predictor, and Y is the
dependent variable. Index i is used for
individuals, and index j is used for contexts.
The error terms u0j and
u1j
indicate that the intercept aj and the slope bj will vary over contexts. g00 indicates
the grand mean, while u0j measures
the deviation in means across contexts from the grand
mean. Likewise, g10 represents the grand regression slope
across contexts and u1j the
deviation in slopes from the grand slope across contexts.
The equations for aj and bj include a fixed component, g00 and g10, and a
random component, u0j
and u1j. u0j has a variance, t00, u1j has a variance t11, and u0j and u1j
have a covariance, t01. Zj
represents a contextual level In traditional regression, a and b are treated as fixed effects, and the random fluctuations are not estimated. Why is this important? By estimating the elements in the T matrix, we can examine the unique estimates for separate contexts more efficiently than by conducting separate regression equations for each context. Furthermore, we can now examine cross level interactions. An example would be the literature on aptitude by treatment interaction literature in education. Such research operates on the theory that teacher styles differ, and that some styles are more effective for certain students than for others. Instead of asking the question, what teaching methods are most effective, the more useful question of what teaching methods are most effective, for which students, in which contexts? Longitudinal Growth Curve Modeling with SAS PROC MIXEDIn 1992 SAS introduced the PROC MIXED routine into their statistical package. It was written by agricultural and physical scientists seeking to generalize the standard linear model to incorporate both fixed and random effects and therefore did not have the needs of social scientists in mind. However, by correctly specifying the mixed model, a researcher can fit multilevel models and individual growth curve models that have become quite popular in the social sciences (Singer, 1997). The material for this paper is provided by Singer (1997), and interested readers should study her very informative, understandable article. Using her examples, I will provide demonstrations of how to fit a longitudinal growth curve model. These examples are also provided by Bryk and Raudenbush (1992). It would be helpful at this point to review the article I wrote on how to fit cross-sectional multilevel models with SAS PROC MIXED. In that article, I discussed how the three fundamental statements in SAS PROC MIXED syntax used to fit cross-sectional multilevel models are the CLASS statement, which identifies the categorical variable, the MODEL statement, which specifies the fixed effects, and the RANDOM statement which specifies the random effects. In this article, I will extend the use of the RANDOM statement to fit individual growth curve models. I will also discuss how growth curve models can be fit with the REPEATED statement. As with the cross-sectional multilevel model, I will begin with the example of an unconditional linear growth model. Unconditional Linear Growth ModelIn this model, we will begin with a simple two-level model. The level-1 model is a linear individual growth model, modeling the way in which each individual changes over time. The level-2 model expresses variation in the parameters from the growth model as random effects that occur between individuals (i.e., the change in individuals as a group over time); in the unconditional model, these random effects are unrelated to any person-level covariates. The equations to fit the unconditional model appear below; the level-1 (within person) parameters are indicated by p and the level-2 (between person) parameters are indicated by b: Yij=p0j
+ p1j(TIME)ij
+ rij where rij~N(0,s 2) and Substituting the models into each other yields: Yij=[b00 + b10TIMEij] + [u0j + u1jTIMEij + rij] This model contains two fixed effects, the intercept and the effect of TIME, and three random effects, the intercept, the slope for TIME, and the within person residual, rij. This model is a little unique in that a data set needs to be created in which each individual has a record for each time period that he/she is observed. Please see Singer (1997) for details of how to create such a data set. The syntax used to fit the unconditional linear growth curve model is presented below: proc mixed noclprint covtest; The CLASS statement indicates that the data represent multiple observations over time for individuals. The fixed effects are included in the MODEL statement (the intercept does not need to be written into the model statement, as SAS will include it by default), and the random effects are included in the RANDOM statement (again the intercept is included by default). The SUBJECT=ID portion of the RANDOM statement indicates that we want to allow both intercepts and slopes to vary across people. In coding the TIME variable, the intercept can be coded in such a way that it represents initial status (by coding TIME=0), average status (by centering TIME), or final status (by letting 0 represent the last wave of data, all other time points coded with negative numbers). It is usually recommended that TIME be coded in such a way that the intercept represents initial status. The SUBJECT= option indicates that the data set is composed of a set of individual subjects; the TYPE= option specifies the structure of the variance-covariance matrix for the intercepts and slopes. In our example, we are specifying an unstructured variance-covariance matrix. Adding a Person-Level CovariateTypically, in growth curve modeling, we are not only interested in change over time; we are also interested in how growth may be influenced by background covariates (e.g., IQ, family size, SES, etc.). This model adds some complexity to the unconditional growth curve model: Yij=p0j
+ p1j(TIME)ij
+ rij where rij~N(0,s 2) and Centering is important in such a model, because as the model now stands, the interpretation of the fixed effects, b00 and b10, are based on a scenario in which the background covariate would be equal to 0. As this is most likely not the case, we must center the covariate at the grand mean as follows: Yij=p0j
+ p1j(TIME)ij
+ rij where rij~N(0,s 2) and Substituting models yields: Yij=b00 + b10(TIME)ij + b01(COVAR-Mean(COVAR))ij + b11(COVAR-Mean(COVAR))(TIME)ij + u0j + u1j(TIME)ij + rij If we let CCOVAR represent the centered covariate, we can fit this model with the following syntax: proc mixed noclprint covtest; We have added two fixed effects to the MODEL statement, CCOVAR and the TIME*CCOVAR interaction. The RANDOM statement remains the same. The GCORR option will print the estimated correlation matrix among the random effects. Exploring the Structure of the Within Person Variance-Covariance MatrixThe above syntax examples place a somewhat unrealistic assumption on the structure of the within person residuals. "Were we to fit a model in which only the intercepts vary across persons . . ., we would be assuming a compound symmetric error covariance matrix for each person" (Singer, 1997, p. 25). A compound symmetric matrix is a variance-covariance matrix in which the residual covariance for each individual is uncorrelated with that of other individuals, a rather unrealistic assumption. In addition, when we fit individual slopes, we introduce heteroscedasticity into this residual matrix. However, one of the strengths of PROC MIXED is that it allows the user to explore different structures of the error covariance matrix. "By considering alternative structures for S [the within-person error variance-covariance matrix] (that ideally derive from theory), and by comparing the goodness of fit of resulting models, the user can determine what type of structure is most appropriate for the data at hand" (Singer, 1997, p. 25). S For details on the structure of this matrix, the interested reader is referred to pages 92-102 of the book SAS System for Mixed Models (Littell, Milliken, Stroup, & Wolfinger, 1996). S is referred to the R matrix in SAS PROC MIXED terminology. The structure of the R matrix is specified using a REPEATED statement. With the assumption that the R matrix is compound symmetric, the PROC MIXED syntax would be as follows: proc mixed noclprint covtest noitprint; In the above syntax, WAVE is included as a CLASS variable. WAVE refers to the wave of data collection (i.e., 1st collection, 2nd collection, etc.). WAVE is a series of dummy-coded variables, as opposed to TIME, which is a continuous variable. This is because the variable specified in the REPEATED statement must be categorical. With the TYPE= option, we specify the structure of the R matrix, in this example, compound symmetry. Other possible structures include UN for unstructured and AR for autoregressive. The R option above requests SAS to print out the R matrix. The idea is to try several different error structures and to compare the goodness of fit statistics for the models specifying different error structures. Please consult pp. 92-102 of the SAS System for Mixed Models for details. Now, putting the information for structuring the R matrix with the model we previously tested that included a person-specific background covariate, our SAS PROC MIXED syntax becomes: proc mixed noclprint covtest noitprint; The AR(1) option indicates an autoregressive structure with a lag of 1. I hope that I have helped provide you with some
information in which you can jump off into multilevel
modeling. My opinion is that as structural equation
modeling has increased in popularity, the same will
happen with multilevel modeling. The ability to test
variance components and cross-level interactions are
particularly appealing features of this up and coming
approach. Psychology's Dr.
Ke-Hai Yuan will be instructing a class on multilevel
modeling in the fall, for those of you who would like to
pick up another methodology class, or for those of you
who are faculty and would like to sit in on a class.
Please contact me, craigh@unt.edu,
if I can assist you in any way in implementing multilevel
models or for other help as well. Enjoy your researching,
and good luck. ReferencesBryk, A. S., & Raudenbush, S. W. Hierarchical linear models. Newbury Park, CA: Sage Publications. Kreft, I. G. G. (1996). Are multilevel techniques necessary? An overview including simulation studies. Multilevel Models Project. http://www.ioe.ac.uk:80/multilevel/workpap.html. Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS system for mixed models. Cary, NC: SAS Institute, Inc. Singer, J. D. (1997). Using proc mixed to fit multilevel models, hierarchical models, and individual growth models. The Journal of Educational and Behavioral Statistics, in press.
|