Bivariate Linear Regression in SPSS.
Regression analysis can have a couple of different purposes. Generally
regression is used as a means to predict values or scores on the outcome
variable using one or more predictor variables. However, regression is also
often used as a means of determining variable importance, meaning how are two or
more variables related in the context of a model. There are a vast number of
types and ways to conduct regression. This tutorial will focus exclusively on
ordinary least squares (OLS) linear regression. As with many of the tutorials on
this web site, this page should not be considered a replacement for a good
textbook, such as:
Pedhazur, E. J. (1997). Multiple regression in behavioral research:
Explanation and prediction (3rd ed.). New York: Harcourt Brace.
For the duration of this tutorial, we
will be using
RegData001.sav
Regression Assumptions: Regression is perhaps the most popular form of
statistical analysis. Unfortunately, regression also likely has the distinction of being the most
frequently abused statistical analysis, meaning it is often used incorrectly.
There are many assumptions of regression analysis. It is strongly urged that one
consult a good textbook (such as Pedhazur, 1997) to review all the assumptions
of regression. However, some of the more frequently violated
assumptions will be reviewed here briefly. First, regression works best
under the condition of proper model specification; essentially, you should have
all the important variables in the model and no unimportant variables in the
model. Literature reviews on the theory and variables of interest pay big
dividends when conducting regression. Second, regression works best when there
is a lack of multicollinearity. Multicollinearity is a big fancy word for: your
predictor variables are too strongly related, which degrades regression's
ability to discern which variables are important to the model. Third, regression
is designed to work best with linear relationships. There are types of
regression specifically designed to deal with nonlinear relationships (e.g.
exponential, cubic, quadratic, etc.); but standard multiple regression using
ordinary least squares works best with linear relationships. Fourth, regression
is designed to work with continuous or nearly continuous data. This one causes a
great deal of confusion, because 'nearly continuous' is a subjective judgment. A
9point Likert response scale item is NOT a continuous, or even nearly
continuous, variable. Again, there are special types of regression to deal with
different types of data, for example, ordinal regression for dealing with an
ordinal outcome variable, logistic regression for dealing with a binary
dichotomous outcome, multinomial logistic regression for dealing with a
polytomous outcome variable, etc. Furthermore, if you have one or more
categorical predictor variables, you cannot simply enter them into the model.
Categorical predictors need to be coded using special strategies in order to be
included into a regression model and produce meaningful interpretive output. The
use of dummy coding, effects coding, orthogonal coding, or criterion coding is
appropriate for entering a categorical predictor variable into a standard
regression model. Again, a good textbook will review each of these
strategiesas each one lends itself to a particular purpose. Fifth, regression
works best when outliers are not present. Outliers can be very influential to
correlation and therefore, regression. Thorough initial data analysis should be
used to review the data, identify outliers (both univariate and multivariate),
and take appropriate action. A single, severe outlier can wreak havoc in a
multiple regression analysis; as an esteemed colleague is fond of saying...know
thy data!
Bivariate Regression. The simplest form of regression is
bivariate regression, in which one variable is the outcome and one is the
predictor. Very little information can be extracted from this type of analysis.
The most meaningful statistic is likely to be the correlation coefficient
squared (R²), which refers to
the amount of variance in one variable accounted for by the other.
Start by clicking on Analyze, Regression, Linear...
Next, highlight the y variable and use the top arrow button to move it to the
Dependent: box. Then, highlight the x1 variable and use the second arrow button
to move it to the Independent(s): box.
Next, click on the Statistics... button and select Confidence intervals and
Covariance matrix (Estimates & Model fit should be selected by default). Then,
click the Continue button.
Next, click on the Plots... button. Then, highlight *ZPRED and use the second
arrow button to move it to the X: box. Then highlight *ZRESID and use the top
arrow button to move it to the Y: box. Then, select Histogram and Normal
probability plot. Next, click the Next button (marked with a
red ellipse in the figure to the right). Finally,
click the Continue button.
Next, click the OK button to complete the regression analysis and produce
output similar to that displayed below.
Interpretation of the Model Summary Table shows us that the multiple
correlation coefficient (R; i.e. the correlation between the predictor
and the outcome variables) is .547, which when squared gives us .300 which can
be thought of as the amount of variance in the outcome variable that is
accounted for by the predictor variable. Generally expressed as: 30% of the
variance in y was accounted for by x1. However, R²
tends to be slightly optimistic and therefore, a more appropriate metric is
adjusted multiple correlation coefficient squared (adj. R² = .293). Next,
we see the ubiquitous ANOVA table which simply tests whether or not our model is
significantly better than just using the mean of x1 to predict new values of y.
Here, our model is significantly better. Another way of thinking about this
ANOVA table concerns whether or not the R² is significantly different
from zero. Next, we see our Coefficients table which gives the unstandardized
and standardized coefficients (for building a regression equation) as well as a
t test for each. So, if we wanted to predict new raw scores on the
outcome variable (y), we would use the following equation:
(1) y =
.779*x1 + 366.485
where the .779 is the
unstandardized coefficient for the predictor (often called the bweight) and the
366.485 is the yintercept term (often called a). The t test for the
constant or yintercept has virtually no meaning. The t test for the predictor
coefficient is testing whether or not the coefficient is significantly different
than zero. The standardized coefficient (often called Beta and given the symbol
β) represent the correlation between the predictor and the outcome. As you can
see, in the case of only one predictor, it is the same as the multiple
correlation (R). If we were interested in predicting new standardized
scores of the outcome (y) then we would use the following regression equation:
(2) Z_{y}
= .547*Z_{x1}
where the .547 represent the
coefficient for the standardized predictor. There is obviously not an intercept
term when dealing with standardized scores because, the intercept is always zero
on both the x and y axiscommonly called the centroid.
The next table gives us the
correlation and covariance matrix for our coefficient(s). Then, we have the
residual descriptive statistics table which displays descriptive summary
statistics for the residuals, also called errors of prediction (y  yhat). This
table is followed by a histogram of the residuals, which we expect to be
normally distributed and finally a diagnostic plot showing the expected versus
observed probability values.
As shown in a previous tutorial,
we can get an informative scatterplot to represent our bivariate regression by
clicking no Graphs, Legacy dialogs, Scatter/Dot...
Next, click the Define button (the default
Simple Scatter is appropriate). Then highlight the y variable and use the top
arrow button to move it to the Y Axis: box. Then, highlight the x1 variable and
use the second arrow button to move it to the X Axis: box.
Next, click the OK button to create the simple
scatterplot.
Next, right click on the scatterplot in the
output and select Edit Content, In Separate Window to bring up the chart editor.
Using the Chart editor, right click on the
actual data points in the scatter plot (in the chart editor), at which point
they should turn a yellow color. Then, select Add Fit Line at Total.
Next, simply left click somewhere in the white
space of the output (outside the chart editor). You should now see something
similar to what is displayed below. Note, the yintercept does not seem to match
with what is in the table above because, the scale of the xaxis begins at
approximately 80 rather than zero.
This concludes the bivariate regression section. The next section focuses on
Multiple Linear Regression.
