Research and Statistical Support

UIT | ACUS | Help Desk | Training | About Us | Publications | RSS Home

Return to the SPSS Short Course

MODULE 6

Replace Missing Values

Task: Using ‘ExampleData002' replace missing values on Age variable with the mean of Age.

Start off by importing the ExampleData002.sav into the Data Editor window of PASW / SPSS (from this point forward referred to as simply SPSS).

First, click on "Transform", then "Replace Missing Values..." in the toolbar at the top of SPSS.

Now, you should see a dialog box similar to the one on the left, below. Select and use the arrow to move Age [Age] from the available variables list to the "New Variable(s):" box. By default, "Series mean" is selected as the "Method";  however, it is not the only option and sometimes not the best option. So, click on the drop-down arrow to the right of "Series mean" and review the choices you have available. A review of the choices is shown on the right below. These options should be fairly self-explanatory. Linear interpolation is simply using ordinary least squares (OLS) regression to predict the missing values and impute them; sometimes called regression imputation, which was quite popular until maximum likelihood (ML) imputation became widely available. UNT does now (Version 19 site license) support the Missing Values module of SPSS (which includes ML imputation; and is covered in the second tutorial of this Module [Module 6]). For the current example, we will use the default (series mean) for illustrative purposes. It is fairly widely accepted that mean imputation should not be used unless the percentage of missing data is very, very low (i.e. well below 5% missing). Click the "OK" button to complete the imputation and creation of the new variable: Age_1.

                 

You'll notice a new variable will be created during this procedure, called Age_1, which will include all the existing values and the mean of Age imputed in place of missing values or blank cells. The importance of keeping the original variable with missing values is that it allows us to try different methods of imputation and compare each method's new variable to the others so that we can choose the option we feel is most appropriate.

 The output also pops up a table reflecting the newly created variable and the number of imputations which took place.

As a follow-up for comparison, let's also run the missing value imputation using Linear Interpolation. Make sure you hit the reset button at the bottom of the "Replace Missing Values" dialog box. Then, change the drop-down menu from Series Mean to Linear Interpolation. Next, select and move the Age [Age] variable using the arrow button. Next, click your curser into the "Name:" field of the "Name and Method" area. Change the new variable name from Age_1 to Age_2 so that we do not overwrite the previously created Age_1 variable. Then, click "OK" to complete the imputation and creation of Age_2.

Linear interpolation produces another table in the output which is virtually indistinguishable from the previous table with the series mean.

Now, what we want to do is  find out if there is a difference between the imputation techniques. So, we click on "Analyze", "Descriptive Statistics", "Frequencies..." because, we want to compare the mean and standard deviation (and/or variance), etc. of the two new variables to see if the imputation techniques differ.

You should now see the Frequencies dialog. First, select or highlight "SMEAN(Age)[Age_1]" and then using the left mouse button and the control key on the keyboard, select "LINT(Age)[Age_2]". You should now have both variables selected and you can now move them with the arrow to the "Variables:" box as a pair. Next, click on the "Statistics..." button.

You should now see Statistics menu. Select all the options just for good measure. Then click "Continue" and then click the "Charts" button on the Frequencies dialog. Select "Histograms:" and "Show normal curve on histogram". Then click the "Continue" button and finally, click the "OK" button in the Frequencies dialog.

         

You should now see a table in your output similar to the one below. Notice that both imputation strategies resulted in identical values for all descriptive statistics. This is because both strategies imputed the same value for cases 8 and 31 (the two cases with missing values on the original Age variable). If we had a more realistic data set with more cases and more missing values, we may find differences; which is why this tutorial showed how to compare two missing value imputation techniques.

Return to the SPSS Short Course

UNT home page | Search UNT | UNT news | UNT events

Contact Information

Jon Starkweather, PhD

Jonathan.Starkweather@unt.edu

940-565-4066

Richard Herrington, PhD

Richard.Herrington@unt.edu

940-565-2140

Last updated: 01/21/14 by Jon Starkweather.

UIT | ACUS | Help Desk | Training | About Us | Publications | RSS Home