Replace Missing Values Task:
Using ‘ExampleData002' replace missing values on Age
variable with the mean of Age.
Start off by importing the
ExampleData002.sav into the Data Editor window of PASW / SPSS (from this
point forward referred to as simply SPSS).
First, click on "Transform", then "Replace Missing Values..." in the toolbar
at the top of SPSS.
Now, you should see a dialog box similar to the one on the left, below.
Select and use the arrow to move Age [Age] from the available variables list to
the "New Variable(s):" box. By default, "Series mean" is selected as the
"Method"; however, it is not the only option and sometimes not the best
option. So, click on the dropdown arrow to the right of "Series mean" and
review the choices you have available. A review of the choices is shown on the
right below. These options should be fairly selfexplanatory. Linear
interpolation is simply using ordinary least squares (OLS) regression to predict
the missing values and impute them; sometimes called regression imputation,
which was quite popular until maximum likelihood (ML) imputation became widely
available. UNT does now (Version 19 site license) support the Missing Values
module of SPSS (which includes ML imputation; and is covered in the second
tutorial of this Module [Module 6]). For the current example, we will
use the default (series mean) for illustrative purposes. It is fairly widely
accepted that mean imputation should not be used unless the percentage of
missing data is very, very low (i.e. well below 5% missing). Click the "OK" button
to complete the imputation and creation of the new variable: Age_1.
You'll notice a new variable will be created during this procedure, called
Age_1, which will include all the existing values and the mean of Age imputed in
place of missing values or blank cells. The importance of keeping the original
variable with missing values is that it allows us to try different methods of
imputation and compare each method's new variable to the others so that we can
choose the option we feel is most appropriate.
The output also pops up a table reflecting the newly created variable
and the number of imputations which took place.
As a followup for comparison, let's also run the missing value imputation
using Linear Interpolation. Make sure you hit the reset button at the bottom of
the "Replace Missing Values" dialog box. Then, change the dropdown menu from
Series Mean to Linear Interpolation. Next, select and move the Age [Age]
variable using the arrow button. Next, click your curser into the "Name:" field
of the "Name and Method" area. Change the new variable name from Age_1 to Age_2
so that we do not overwrite the previously created Age_1 variable. Then, click
"OK" to complete the imputation and creation of Age_2.
Linear interpolation produces another table in the output which is virtually
indistinguishable from the previous table with the series mean.
Now, what we want to do is find out if there is a difference between
the imputation techniques. So, we click on "Analyze", "Descriptive Statistics",
"Frequencies..." because, we want to compare the mean and standard deviation
(and/or variance), etc. of the two new variables to see if the imputation
techniques differ.
You should now see the Frequencies dialog. First, select or highlight "SMEAN(Age)[Age_1]"
and then using the left mouse button and the control key on the keyboard, select
"LINT(Age)[Age_2]". You should now have both variables selected and you can now
move them with the arrow to the "Variables:" box as a pair. Next, click on the
"Statistics..." button.
You should now see Statistics menu. Select all the options just for good
measure. Then click "Continue" and then click the "Charts" button on the
Frequencies dialog. Select "Histograms:" and "Show normal curve on histogram".
Then click the "Continue" button and finally, click the "OK" button in the
Frequencies dialog.
You should now see a table in your output similar to the one below. Notice
that both imputation strategies resulted in identical values for all descriptive
statistics. This is because both strategies imputed the same value for cases 8
and 31 (the two cases with missing values on the original Age variable). If we
had a more realistic data set with more cases and more missing values, we may
find differences; which is why this tutorial showed how to compare two missing
value imputation techniques.
