If you are not familiar with
Bivariate Regression, then I strongly recommend returning to the previous
tutorial and reviewing it prior to reviewing this tutorial.
Multiple Linear Regression in SPSS.
Multiple regression simply refers to a regression model with multiple
predictor variables. Multiple regression, like any regression analysis, can have a couple of different purposes.
Regression can be used for prediction or determining variable importance, meaning how are two or
more variables related in the context of a model. There are a vast number of
types and ways to conduct regression. This tutorial will focus exclusively on
ordinary least squares (OLS) linear regression. As with many of the tutorials on
this web site, this page should not be considered a replacement for a good
textbook, such as:
Pedhazur, E. J. (1997). Multiple regression in behavioral research:
Explanation and prediction (3rd ed.). New York: Harcourt Brace.
For the duration of this tutorial, we
will be using
RegData001.sav
Standard Multiple Regression. Standard multiple regression is
perhaps one of the most popular statistical analysis. It is extremely
flexible and allows the researcher to investigate multiple variable
relationships in a single analysis context. The general interpretation of
multiple regression involves: (1) whether or not the regression model is
meaningful, (2) which variables contribute meaningfully to the model. The first
part is concerned with model summary statistics (given the assumptions are met), and
the second part is concerned with evaluating the predictor variables (e.g. their
coefficients).
Assumptions: Please notice the mention of assumptions above.
Regression also likely has the distinction of being the most frequently abused
statistical analysis, meaning it is often used incorrectly. There are many
assumptions of multiple regression analysis. It is strongly urged that one
consult a good textbook to review all the assumptions of regression, such as
Pedhazur (1997). However, some of the more frequently violated assumptions will
be reviewed here briefly. First, multiple regression works best under the
condition of proper model specification; essentially, you should have all the
important variables in the model and no unimportant variables in the model.
Literature reviews on the theory and variables of interest pay big dividends
when conducting regression. Second, regression works best when there is a lack
of multicollinearity. Multicollinearity is a big fancy word for: your predictor
variables are too strongly related, which degrades regression's ability to
discern which variables are important to the model. Third, regression is
designed to work best with linear relationships. There are types of regression
specifically designed to deal with nonlinear relationships (e.g. exponential,
cubic, quadratic, etc.); but standard multiple regression using ordinary least
squares works best with linear relationships. Fourth, regression is designed to
work with continuous or nearly continuous data. This one causes a great deal of
confusion, because 'nearly continuous' is a subjective judgment. A 9point
Likert response scale item is NOT a continuous, or even nearly continuous,
variable. Again, there are special types of regression to deal with different
types of data, for example, ordinal regression for dealing with an ordinal
outcome variable, logistic regression for dealing with a binary dichotomous
outcome, multinomial logistic regression for dealing with a polytomous outcome
variable, etc. Furthermore, if you have one or more categorical predictor
variables, you cannot simply enter them into the model. Categorical predictors
need to be coded using special strategies in order to be included into a
regression model and produce meaningful interpretive output. The use of dummy
coding, effects coding, orthogonal coding, or criterion coding is appropriate
for entering a categorical predictor variable into a standard regression model.
Again, a good textbook will review each of these strategiesas each one lends
itself to particular purposes. Fifth, regression works best when outliers are
not present. Outliers can be very influential to correlation and therefore,
regression. Thorough initial data analysis should be used to review the data,
identify outliers (both univariate and multivariate), and take appropriate
action. A single, severe outlier can wreak havoc in a multiple regression
analysis; as an esteemed colleague is fond of saying...know thy data!
To conduct a standard multiple regression using ordinary least squares (OLS),
start by clicking on Analyze, Regression, Linear...
Next, highlight the y variable and use the top
arrow button to move it to the Dependent: box. Then, highlight the x1 and x2
variables and use the second arrow to move them to the Independent(s): box.
Next, click on the Statistics... button. Select Confidence intervals,
Covariance matrix, Descriptives, and Part and partial correlations. Then, click
on the Continue button.
Next, click on Plots... Then, highlight ZRESID and use the top arrow button
to move it to the Y: box. Then, highlight ZPRED and use the bottom arrow button
to move it to the X: box. Then click on the Next
button (marked with a red ellipse here). Then, select Histogram and Normal
probability plot. Then, click the Continue button.
Next, click on the Save... button. Notice here you can have SPSS save a
variety of values into the data file. By selecting these options, SPSS will fill
in subsequent columns to the right of your data file with the values you select
here. It is recommended one typically save some type of distance measure, here
we used Mahalanobis distance; which can be used to checking for multivariate
outliers. Then click the Continue button and then click the OK button.
The output should be very similar to that displayed below, with the exception
of the new variable called MAH_1 which was created in the data set and includes
the values of Mahalanobis distance for each case.
The output begins with the syntax generated by all of the pointing and
clicking we did to run the analysis.
Then, we have descriptive statistics table which includes the mean,
standard deviation, and number of observations for each variable selected
for the model.
Then, we have a correlation matrix table, which includes the correlation,
pvalue, and number of observations for each pair of variables in the model.
Note, if you have unequal number of observations for each pair, SPSS will
remove cases from the regression analysis which do not have complete data on
all variables selected for the model. This table should not be terribly
useful, as a good research will have already taken a look at the
correlations during initial data analysis (i.e. before running the
regression). One thing to notice here is the lack of multicollinearity, the
two predictors are not strongly related (r = .039, p = .350).
This is good, as it indicates adherence to one of the assumptions of
regression.
Next, we have the Variables Entered/Removed table, which as the name
implies, reports which variables were entered into the model.
Then, we have the Model Summary table. This table provides the Multiple
Correlation (R = .784), the Multiple Correlation squared (R²
= .614), the adjusted Multiple Correlation squared (adj.R² = .606),
and the Standard Error of the Estimate. The multiple correlation refers to
the combined correlation of each predictor with the outcome. The multiple
correlation squared represents the amount of variance in the outcome which
is accounted for by the predictors; here, 61.4% of the variance in y is
accounted for by both x1 and x2. However, as mentioned in a previous
tutorial, the multiple correlation squared is a bit optimistic, and
therefore, the adjusted R² is more appropriate. More appropriate
still for model comparison and model fit statistics; would be the use of the
Akaike Information Criterion (AIC; Akaike, 1974) or Bayesian Information
Criterion (BIC; Schwarz, 1978), neither of which is available in SPSS, but
both can be computed very easily (see the references at the bottom of the
page).
Next, we have the ANOVA
summary table, which indicates that our model's R² is significantly
different from zero, F(2, 97) = 77.286, p < .001.
Next we have the very informative Coefficients table. It is often preferred
to read this table by column from left to right, recognizing that each row of
information corresponds to an element of the regression model. The first two
columns contain unstandardized (or raw score) coefficients and their standard
errors. The Constant coefficient is simply the yintercept term for a linear
best fit fine representing our fitted model. The x1 and x2 unstandardized
coefficients represent the weight applied to each score (for each variable) to
produce new y scores along the best fit line. If predicting new scores is the
goal for your regression analysis, then here is one of the places where you will
be focusing your attention. The unstandardized coefficients are used to build
the linear regression equation one might use to predict new scores of y using
available scores of x1 and x2. The equation for the current example is below:
(1)
y = .810(x1) + .912(x2) + 221.314
or
y = 221.314 + .810(x1) + .912(x2)
Next, we have the Standardized Coefficients, which are typically reported in
social science journals (rather than the unstandardized coefficients) as a way
of interpreting variable importance because, they can be directly compared (they
are in the same metric). They are sometimes referred to as slopes, but the
standardized coefficients use the symbol
Beta, which is the capital Greek
letter β and can be interpreted as the correlation between a predictor and the
outcome variable. There is no constant or yintercept term when referring to
standardized scores (sometimes called Zscores) because, the yintercept when
graphing them is always zero. The standardization transformation results in a
mean of 0 and a standard deviation of 1 for all variables so transformed. Next,
we have the calculated tscore for each unstandardized coefficient (coefficient
divided by standard error) and their associated pvalue. Next, we have the
confidence intervals for each unstandardized coefficient as specified in the
point and click options. Then, we have the correlations for each predictor (as
specified in the options). SPSS labels the semipartial correlation as the Part
correlation.
Next, we have the Coefficient Correlations table, which as the name implies
displays the correlations and covariances among our predictors.
Next, we have the Residuals Statistics table which displays descriptive
statistics for predicted values, adjusted predicted values, and residual
values. Residuals are the differences between the actual values of our
outcome y and the predicted values of our outcome y based on the model we
have specified. The table also produces descriptive summary statistics for
measures of multivariate distance and leverage; which allow us to get an
idea of whether or not we have outliers or influential data points.
Finally, we have the Normal PP Plot of Regression Standardized Residual
values. We expect the values to be very close to (or on top of) the
reference line, which would indicate very little deviation of the expected
values from the observed values.
Next, we have a histogram of the standardized residual values, which we
expect to be close to normally distributed around a mean of zero.
Now, we can return to the data view and evaluate our Mahalanobis distances
(MAH_1) to investigate the presence of outliers. Click on Analyze, Descriptive
Statistics, Explore...
Next, highlight the Mahalanobis Distance variable and use the top arrow
button to move it to the Dependent List: box. Then click on the Statistics...
button.
Next, select Descriptives, Mestimators, Outliers, and Percentiles; then
click the Continue button. Then click on the Plots... button and select
Stemandleaf, Histogram, and Normality plots with tests. Then click the
Continue button, then click the OK button.
The output should be similar to what is displayed below.
These first few tables are fairly intuitively named. Case Processing Summary
provides information on the number of cases used for the Explore function.
The Descriptives table provides the usual suspects in terms of
descriptive statistics for the Mahalanobis distances. Remember, you should
not be alarmed by the skewness and kurtosis because Mahalanobis distance
with always be nonnormally distributed. If there are values less than one,
you have problem.
The MEstimators are maximum likelihood estimates which can be used when
outliers are present to overcome their undue influence on the least squares
regression. (1) (2).
The Percentiles table simply reports the percentile ranks for the
Mahalanobis distances.
The Extreme Values table is very helpful and reports the highest and
lowest five cases for the variable specified; here Mahalanobis distance.
This allows us to see just how extreme the most outlying cases are because,
Mahalanobis distance is a multivariate measure of distance from the centroid
(mean of all the variables).
The Tests of Normality table reports two tests of normality; meaning they
test whether or not the distribution of the specified variable is
significantly different from the standard normal curve. Here, it is not
terribly useful because, we know Mahalanobis distance is not typically
normally distributed (i.e. it is always positively skewed).
The next four graphical displays simply show the distribution of
Mahalanobis distances. Of note at the bottom of the Stem & Leaf plot,
where it
shows that 3 values are extreme; which can be seen in the extreme values
table and the Normal QQ Plots on the second row below.
Finally, we have the wonderful box plot which displays the distribution of
Mahalanobis distances intuitively and identifies extreme values with either
a circle (as is the case here) or an asterisk (which is the case when values
are well beyond the whiskers of the box plot).
This concludes the standard multiple regression section. The
next section focuses on
multiple regression while investigating the influence of a covariate.
REFERENCES & RESOURCES
Achen, C. H. (1982). Interpreting and using regression. Series: Quantitative
Applications in the Social Sciences, No. 29. Thousand Oaks, CA: Sage Publications. (1)
Akaike, H. (1974). A new look at the statistical model identification.
I.E.E.E. Transactions on automatic control, AC 19, 716 – 723. (1)
(2)
(3)
Allison, P. D. (1999). Multiple regression. Thousand Oaks, CA: Pine Forge
Press.
Cohen, J. (1968). Multiple regression as a general dataanalytic system.
Psychological Bulletin, 70(6), 426  443. (1)
Hardy, M. A. (1993). Regression with dummy variables. Series:
Quantitative Applications in the Social Sciences, No. 93. Thousand Oaks, CA: Sage
Publications. (1)
Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariate prognostic
models: Issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Statistics in Medicine, 15, 361 – 387. (1)
(2)
(3)
Kass, R. E., & Raftery, A., E. (1995). Bayes factors. Journal of the
American Statistical Association, 90, 773 – 795. (1)
(2)
(3)
Pedhazur, E. J. (1997). Multiple regression in behavioral research:
Explanation and prediction (3rd ed.). New York: Harcourt Brace.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of
Statistics, 6, 461 – 464. (1)
(2)
(3)
Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate
Statistics. Fourth Edition. Boston: Allyn and Bacon.
