Research and Statistical Support

MODULE 6

VIII. ANOVA and Linear Regression

The following covers some of the common SAS procedures with which you can run some intermediate level statistical analyses. Use the Import Wizard to import the Example Data 1 file using the  SPSS File (*.sav) source option as was done previously.

1. One-way ANOVA

Some sources will recommend use of PROC ANOVA for the one-way or single factor analysis; however, PROC ANOVA assumes balanced cells (i.e. each group has an equal number of cases). Given that we frequently do not have balanced cells, use of PROC GLM is preferred. The current example compares different stimuli conditions on ability to recall at time 1. In this example, our factor or independent variable stimuli has three levels (or conditions); spoken, written, and combined. Our dependent variable is the familiar recall at time 1 (recall1).

PROC GLM DATA=example1;
CLASS stimuli;
MODEL recall1 = stimuli;
MEANS stimuli;
RUN;

We can run post-hoc tests (here with Tukey's version) by adding additional operators to the means statement.

PROC GLM DATA=example1;
CLASS stimuli;
MODEL recall1 = stimuli;
MEANS stimuli / TUKEY;
RUN;

Here we use the Tukey and the  Ryan-Einot-Gabriel-Welsch Multiple Range Test for our post hoc comparisons.

PROC GLM DATA=example1;
CLASS stimuli;
MODEL recall1 = stimuli;
MEANS stimuli / TUKEY REGWQ;
RUN;

2. Multi-way or Factorial ANOVA

Here we are looking for mean differences among two factors with six total conditions on ability to recall at time 1 (recall1). The first factor, stimuli, has three conditions and was described above. The second factor, candy; has two conditions (Skittles & no candy).

PROC GLM DATA=example1;
CLASS candy stimuli;
MODEL recall1 = candy stimuli;
MEANS stimuli / REGWQ;
MEANS candy stimuli;
RUN;

3. One-way MANOVA

Here, we are testing for group differences among two dependent variables simultaneously using our familiar three groups of stimuli. First, we run a PROC MEANS to take a look at the descriptive statistics for each group across the two dependent variables. Then, we run the actual MANOVA.

PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS stimuli;
VAR recall1 recall2;
RUN;
PROC GLM DATA=example1;
CLASS stimuli;
MODEL recall1 recall2 = stimuli / SS3;
CONTRAST 'Printed vs Spoken&Printed and Spoken' stimuli 2 -1 -1;
CONTRAST 'Spoken vs Printed and Spoken' stimuli 0 1 -1;
MANOVA h=_all_;
RUN;
QUIT;

Given that our two dependent variables above are really the same variable measured at two points in time; it would be more appropriate to run the Repeated Measures ANOVA.

PROC GLM DATA=example1;
CLASS stimuli;
MODEL recall1 recall2 = stimuli;
REPEATED TIME 2 (0 1) / SUMMARY;
RUN;

4. Factorial MANOVA

Here, we are looking at differences between stimuli groups, as well as candy groups, on recall at time 1 and age. To begin, we will take a look at some of the descriptive statistics of our variables; then the correlation between our two dependent variables (age & recall1); then run the GLM procedure.

PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS stimuli;
VAR age;
RUN;
PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS candy;
VAR age;
RUN;
PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS stimuli;
VAR recall1;
RUN;
PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS candy;
VAR recall1;
RUN;
PROC MEANS N MEAN STD MIN MAX DATA=example1;
CLASS stimuli candy;
VAR age recall1;
RUN;

PROC CORR DATA=example2;
VAR age recall1;
RUN;

PROC GLM DATA=example1;
CLASS stimuli candy;
MODEL age recall1 = stimuli candy / SS3;
CONTRAST 'Printed vs Spoken&Printed and Spoken' stimuli 2 -1 -1;
CONTRAST 'Spoken vs Printed and Spoken' stimuli 0 1 -1;
MANOVA h=_all_ / SUMMARY PRINTE;
RUN;

5. Linear Regression.

Use the Import Wizard to import the 'regression_example_data.sav' file using the  SPSS File (*.sav) source option and the member name 'red'.

PROC PRINT DATA=red;
RUN;

First, we'll do a simple linear ordinary least squares (OLS) regression with two predictors (age & recall1) and recall2 as our outcome variable.

PROC REG DATA=red;
MODEL apt = prison age peyrs;
RUN;

SAS produces un-standardized regression coefficients by default. If you also want SAS to produce the standardized coefficients then you must include an STB (standardized beta) options statement directly following the name of the last predictor; like the following example:

PROC REG DATA=red;
MODEL apt = prison age peyrs / STB;
RUN;

Next, we'll take a second look at the same regression model, but have SAS create a graph of the residuals vs. the Cook's Distance.

PROC REG DATA=red;
MODEL apt = prison age peyrs;
OUTPUT OUT = T STUDENT = RES COOKD = COOKD;
RUN;
QUIT;
PROC GPLOT DATA = T;
PLOT res*cookd = 1 / vaxis=axis1;
RUN;
QUIT;

Now, we'll review the residual values which is a three stage process. We will first generate a new variable rabs containing the absolute value of standardized residuals. Then we sort the data on rabs in descending order. We then list the first 50 observations.

DATA T2;
SET T;
RABS = abs(res);
RUN;
PROC SORT DATA=T2;
BY DESCENDING rabs;
RUN;
PROC PRINT DATA=T2 (obs=50);
RUN;

6. Robust regression is done by Iterated Weighted Least Squares (IWLS). The procedure for running robust regression is proc robustreg. There are a couple of estimators for IWLS. We are going to use the Huber estimator in this example. We can save the final weights created by the IWLS process. This can be very useful. We  will use the data set T2 generated above. It includes the original data set and the diagnostic variables generated based on the OLS regression model. *Note in the output the presence of the AIC & BIC for model fit.

PROC ROBUSTREG DATA=T2 METHOD=m (wf=huber);
MODEL apt = prison age peyrs;
OUTPUT OUT = test1 weight=wgt;
RUN;

Next, we'll take a look at the residuals of the robust regression.

PROC SORT DATA=test1;
by wgt;
RUN;
PROC PRINT DATA=test1 (obs=50);
RUN;

Now let's compare the results of a regular OLS regression and a robust regression.  If the results are very different, you will most likely want to use the results from the robust regression.

ODS LISTING CLOSE;
PROC REG DATA=red;
MODEL apt = prison age peyrs;
ODS OUTPUT PARAMETERESTIMATES = a;
RUN;
QUIT;
PROC ROBUSTREG DATA=T2 METHOD=m (wf=huber);
MODEL apt = prison age peyrs;
ODS OUTPUT PARAMETERESTIMATES = b;
RUN;
QUIT;
ODS LISTING;
TITLE "OLS Regression";
PROC PRINT DATA=a;
TITLE "Robust Regression";
PROC PRINT DATA=b;
RUN;