Conduct Multiple Imputation for missing values using a version of the Estimation
Maximization (EM) algorithm. The user manual for the Missing Values module can
be found at the
Manuals page. For a more detailed treatment of the
more general topic of missing value analysis, see Little and Rubin (1987).
The SPSS Missing Values module is implemented in two distinct ways. First,
the Missing Values Analysis (MVA) menu option produces a series of tables and
figures which describe the pattern of missingness, estimates of basic
descriptive statistics (means, standard deviations, correlations, covariances)
based on a user specified method (e.g. EM), and imputes values based on the
specified method. It is important to note that this approach can be referred to
as single or simple imputation; rather than multiple imputation,
which is widely accepted as superior. To read about why this simple imputation
is not a good idea, see
Von Hippel (2004). The second (newer) way the module is implemented is
through the use of the Multiple Imputation menu option which itself contains two
menu options (i.e. functions): Analyze Patterns and Impute Missing Data Values.
These two options will be covered below.
Start off by importing the
DataMissing.sav file into the Data Editor window of SPSS. The data is
simulated (i.e. fictitious) and was generated originally with all the values
intact. Subsequently, a function was used to randomly remove approximately 5% of
the values (using the statistical programming environment
R). The data contains
18 variables; one case identification variable, five categorical variables
(nominal and ordinal scaled), and twelve continuous or nearly continuous
variables (considered interval or ratio scaled). The data contains 1500 cases.
1.) Evaluation of Missing Values
First, click on "Analyze", then "Multiple Imputation",
then "Analyze Patterns..." in the toolbar
at the top of SPSS.
Next, select all the variables (excluding the case identification variable) and
move them to the Analyze Across Variables: box. There are only 17 variables
included in the analysis so the maximum number of variables displayed (25) will
display all the variables included. Further control over what is displayed in
the output can be exercised by changing the minimum percentage missing cutoff
value. The default (10%) indicates that only variables with 10% or more missing
values will be displayed in the output. Since it is generally good practice to
review all the patterns of missingness, we will change the percentage cutoff to
0.01% -- thus insuring all variables are included in the output (right figure
below). Then, we can click the OK button to proceed.
The following output was produced.
First, a figure with three pie charts displays the number and percentage of
missing variables (left), cases (center), and individual cells (right) which
have at least one missing value. Note that green indicates missing; for
instance, the Variables (left) pie indicates that 17 variables (100% of those
included in the analysis) have at least one missing value. The Cases (middle)
pie indicates 860 (57.33%) of the 1500 cases contained at least one missing
value. The Values (right) pie indicates that approximately 5% of all values are
missing (i.e. 17 variables multiplied by 1500 cases equals 25500 values, so 1246
equals 4.886% missing).
Next, the Variable Summary chart displays the (as specified) variables which
contained at least 0.01% missing values. The number of missing values,
percentage missing, number of valid values, mean based on valid values, and
standard deviation based on valid values are displayed for each of the 17
variables. Notice, the variables are ordered by the amount of values they are
missing (i.e. the percentage missing). Extroversion is listed first because it
has the highest percentage of missing values (5.5%). Neuroticism is listed last
because it has the lowest percentage of missing values (3.9%).
Next, the Missing Value Patterns chart is displayed (it has been grossly
enlarged to enhance interpretability). Each pattern (row) reflects a group of
cases with the same pattern of missing values; in other words, the patterns or
groups of cases are displayed based on where the missing values are located
(i.e. on each variable). The variables along the bottom (x-axis) are ordered by
the amount of missing values each contains. Consider the table above,
neuroticism has the lowest percentage of missing values (3.9%) and is therefore,
listed first (on the left), while extroversion which has the largest percentage
of missing values (5.4%) is listed last (on the right). For example; the first
pattern is always one which contains no missing values. The second pattern
reflects only cases with missing values on the neuroticism variable. The chart
allows one to assess monotonicity (i.e. rigid decreasing or increasing across a
sequence). Essentially if all the missing cells and non-missing cells are
touching, then monotonicity is present. Because we have clumps or islands of
missing and non-missing cells, we can conclude that this data's missingness does
not display monotonicity and therefore, the monotone method of imputation is not
Next, the Pattern Frequencies graph is displayed. This graph shows that the
first pattern (one in which no missing values are present) is the most
prevalent. The other patterns are much less prevalent, but roughly equally so.
2.) Impute Missing Data Values
Since we would like to be able to reproduced the results (output below) and
multiple imputation is an iterated process (i.e. you can get slightly different
results each time it is done), we must first set the random seed. To do this,
click on Transform, then Random Number Generators...
Next, select Set Active Generator, then Mersenne Twister, then select Set
Starting Point and Fixed Value. Then, click the OK button.
Now we can conduct the multiple imputation. Begin by clicking on Analyze,
Multiple Imputation, then Impute Missing Data Values...
Next, select all the variables (excluding the case identification variable) and
move them to the Variables in Model box. Then, click the Create a new dataset
circle and type in a name for the imputed data set which will be created. This
data set will contain imputed values in place of the missing values. It is
important to note that the operation actually does 5 (default, which is fine)
imputation runs; meaning five imputations are performed in sequence. During each
imputation the missing values are imputed and at the end of the imputations (all
5 in this case), the values are averaged together to take into account the
variance of the missing values. This is why the procedure is called multiple
imputation; because you end up with one set of imputed values -- but those
values are in fact aggregates of multiple imputed values. At the risk of beating
a dead horse, imagine that the fourth case is missing the extroversion score.
That score will be imputed 5 times and stored in data sets, then those 5 values
will be averaged and the resulting (single) value will be used in the primary
analysis of the study. Multiple imputation is a strategy or
process, there are many methods of going about the process of multiple
imputation, such as implementation of the EM algorithm (often referred to as
maximum likelihood imputation), but it is not the only method (monotone is also
available in the Missing Values module of SPSS, while there are many, many more
methods available in R).
Next, click on the Method tab. There are only two methods available: Markov
Chain Monte Carlo method (MCMC) and Monotone. The automatic function scans the
data for monotonicity and if discovered uses the Monotone method...otherwise, it
defaults to the MCMC method (with 10 iterations and a standard regression model
type). Next, up the iterations from 10 (default) to 100 to increase the
likelihood of attaining convergence (when the MCMC chain reaches stability
-- meaning the estimates are no longer fluctuating more than some arbitrarily
small amount). Next, choose the Model type; keeping standard Linear Regression
(default). The alternative is Predictive Mean Matching (PMM). PMM still uses
regression, but the resulting predicted values are adjusted to match the closest
(actual, existing) value in the data (i.e. the nearest non-missing value to the
predicted value). If so desired, all possible two-way interactions among
categorical predictors can be included in the (regression) model. Next, click
the Constraints tab.
The Constraints tab allows the user to specify a variety of options for the
process. First, click the Scan Data button, which will scan the data to
fill in the Variable Summary table. This information can be (will be) used in
the Define Constraints table. Obviously, for some variables, there are a finite
number of valid values (e.g. minimum of zero for drinks_week and absences) and
so, we would want to constrain the analysis to those minimum and maximum valid
values. Categorical variables are automatically dummy coded; so Min, Max, and
Rounding are not necessary. However, specification of the Min, Max, and Rounding
can only be done when Linear Regression is specified as the Variable Model. By
default all variables will be considered as both predictor and outcome (imputed)
during the analysis. However, if for some reason you wanted to set a variable(s)
as only a predictor or only a outcome, then the Role column can be used for that
purpose. Below the Define Constraints table, one can also specify that variables
be excluded from the analysis if they have a maximum (specified) percentage of
missing values. Lastly, if Min, Max, and/or Rounding are specified, then the
analysis will continue drawing values until those constraints are satisfied or
until the Maximum case draws and/or Maximum parameter draws have been reach --
which can be manipulated by the user -- and an error occurs (i.e. the process is
not completed). Next, click on the Output tab.
Here, we can specify what is to be displayed in the output window. Select
Descriptive statistics for variables with imputed values and Create iteration
history; naming the dataset to be created "IterHist". The iteration history is
often useful for diagnosing convergence failures or errors (as mentioned above
with the maximum case and parameter draws).
Finally, we can click the OK button. [Note, depending on number of iterations
and maximum case and parameter draw specifications, the processing time can be
The iteration numbers are listed in the lower right corner of the output window
as they occur. Keep in mind, we have specified (above) that 100 iterations
should be run for each of the 5 imputations.
When the process finishes, you should have two new data files (NewImputedData.sav
& IterHist.sav) and quite a lot of new output. First, take a look at the imputed
dataset. You'll notice there are several subtle differences in the data editor
window when compared to the original data (DataMissing.sav). Three things which
stand out are the missing values (blank cells), the new variable on the far left
side of the data, called "Imputation_", and on the far right (pictured in the
second image below) is the little cube of white and yellow cells with a
drop-down menu (with "Original data" shown).
The Imputation variable simply labels each of the imputation sets. The first set
is the original data (1500 cases) with the missing values still present. Below
it, you'll notice the value for the Imputation_ variable changes to a number "1"
(1500 more cases) which shows the first set of imputed; meaning the first of the
5 sets we requested and the other four sets are listed below the original and
first sets. You can also use the drop-down menu at the extreme right (shown
above) to move between each set. Notice, SPSS marks the cells which contain
imputed values by highlighting them (i.e. changing the background of the cells
to yellow). The little cube (shown above) which is next to the drop-down
menu for moving between imputed sets is colored white and yellow to identify
that this data file is an imputed data file.
Another very important thing to notice about this imputed data file, which is
not readily apparent, is the fact that the Data Editor is aware that this
file is an imputed data file. As such, if you click on Analyze, then some
analysis (e.g. Compare Means), you'll notice that many of the analyses are
compatible with imputed data. In other words, each analysis or function which
shows an icon (shown below), which looks like the cube with a concentric swirl
next to it, will automatically be run on the aggregated imputed data ("pooled"
As a quick example of pooled output, if we run a simple independent samples t-test
comparing males and females on number grade, we get the following output --
which displays results for (1) the original data (with missing), (2) each
imputation set separately, and (3) the 'pooled' estimates. [Note: only the Group
Statistics table is shown, the second table (t-test results table) is not
The second data set produced by the imputation process is the iteration history
data set (IterHist.sav). This file simply lists the mean and standard deviation
for each interval/ratio scaled variable by iteration and imputation. By plotting
the mean and standard deviation of a particular variable across iterations and
imputations, one can assess the patterns of the imputed values. These plots
should show a fairly random pattern (i.e. no discernable pattern).
Next, we can review the output created by the imputation process.
The first table, Imputation Specifications, simply lists what was specified for
the process. The second table, Imputation Constraints, again simply lists what
was specified on the Constraints tab prior to running the process.
The next two tables, Imputation Results and Imputation Models, simply display
what occurred during the imputation procedure. [Note: The majority of the second
table (Imputation Models) is not displayed.]
The next section of output contains one table for each variable, displaying
descriptive statistics for each variable in the original data (with missing) and
at each imputation for both the (specific) imputed values and (globally) all the
values (i.e. all cases after imputation) of the variable. [Note: Only the sex
variable and the age variables' tables are shown.]
And that's it. As mentioned above, many analyses in SPSS (here version 19) are
able to consider imputed data sets and offer pooled output (i.e. showing
the results of the analysis for each imputation set, as was done above with the
quick t-test example). For a complete list of the analyses capable of
utilizing imputed data, refer to the module's
manual (specifically pages 29 - 31).
IBM (2010). IBM SPSS Missing Values 19: User's Guide. Available at:
Little, R., & Rubin, D. (1987). Statistical analysis with missing data.
New York: Wiley.
Von Hippel, P. (2004). Biases in SPSS 12.0 Missing Value Analysis. The
American Statistician, 58, 160 - 164.