Research and Statistical Support

MODULE 6

Multiple Imputation

Task: Conduct Multiple Imputation for missing values using a version of the Estimation Maximization (EM) algorithm. The user manual for the Missing Values module can be found at the SPSS Manuals page. For a more detailed treatment of the more general topic of missing value analysis, see Little and Rubin (1987).

The SPSS Missing Values module is implemented in two distinct ways. First, the Missing Values Analysis (MVA) menu option produces a series of tables and figures which describe the pattern of missingness, estimates of basic descriptive statistics (means, standard deviations, correlations, covariances) based on a user specified method (e.g. EM), and imputes values based on the specified method. It is important to note that this approach can be referred to as single or simple imputation; rather than multiple imputation, which is widely accepted as superior. To read about why this simple imputation is not a good idea, see Von Hippel (2004). The second (newer) way the module is implemented is through the use of the Multiple Imputation menu option which itself contains two menu options (i.e. functions): Analyze Patterns and Impute Missing Data Values. These two options will be covered below.

Start off by importing the DataMissing.sav file into the Data Editor window of SPSS. The data is simulated (i.e. fictitious) and was generated originally with all the values intact. Subsequently, a function was used to randomly remove approximately 5% of the values (using the statistical programming environment R). The data contains 18 variables; one case identification variable, five categorical variables (nominal and ordinal scaled), and twelve continuous or nearly continuous variables (considered interval or ratio scaled). The data contains 1500 cases.

1.) Evaluation of Missing Values

First, click on "Analyze", then "Multiple Imputation", then "Analyze Patterns..." in the toolbar at the top of SPSS.

Next, select all the variables (excluding the case identification variable) and move them to the Analyze Across Variables: box. There are only 17 variables included in the analysis so the maximum number of variables displayed (25) will display all the variables included. Further control over what is displayed in the output can be exercised by changing the minimum percentage missing cutoff value. The default (10%) indicates that only variables with 10% or more missing values will be displayed in the output. Since it is generally good practice to review all the patterns of missingness, we will change the percentage cutoff to 0.01% -- thus insuring all variables are included in the output (right figure below). Then, we can click the OK button to proceed.

The following output was produced.

First, a figure with three pie charts displays the number and percentage of missing variables (left), cases (center), and individual cells (right) which have at least one missing value. Note that green indicates missing; for instance, the Variables (left) pie indicates that 17 variables (100% of those included in the analysis) have at least one missing value. The Cases (middle) pie indicates 860 (57.33%) of the 1500 cases contained at least one missing value. The Values (right) pie indicates that approximately 5% of all values are missing (i.e. 17 variables multiplied by 1500 cases equals 25500 values, so 1246 equals 4.886% missing).

Next, the Variable Summary chart displays the (as specified) variables which contained at least 0.01% missing values. The number of missing values, percentage missing, number of valid values, mean based on valid values, and standard deviation based on valid values are displayed for each of the 17 variables. Notice, the variables are ordered by the amount of values they are missing (i.e. the percentage missing). Extroversion is listed first because it has the highest percentage of missing values (5.5%). Neuroticism is listed last because it has the lowest percentage of missing values (3.9%).

Next, the Missing Value Patterns chart is displayed (it has been grossly enlarged to enhance interpretability). Each pattern (row) reflects a group of cases with the same pattern of missing values; in other words, the patterns or groups of cases are displayed based on where the missing values are located (i.e. on each variable). The variables along the bottom (x-axis) are ordered by the amount of missing values each contains. Consider the table above, neuroticism has the lowest percentage of missing values (3.9%) and is therefore, listed first (on the left), while extroversion which has the largest percentage of missing values (5.4%) is listed last (on the right). For example; the first pattern is always one which contains no missing values. The second pattern reflects only cases with missing values on the neuroticism variable. The chart allows one to assess monotonicity (i.e. rigid decreasing or increasing across a sequence). Essentially if all the missing cells and non-missing cells are touching, then monotonicity is present. Because we have clumps or islands of missing and non-missing cells, we can conclude that this data's missingness does not display monotonicity and therefore, the monotone method of imputation is not justified.

Next, the Pattern Frequencies graph is displayed. This graph shows that the first pattern (one in which no missing values are present) is the most prevalent. The other patterns are much less prevalent, but roughly equally so.

2.) Impute Missing Data Values

Since we would like to be able to reproduced the results (output below) and multiple imputation is an iterated process (i.e. you can get slightly different results each time it is done), we must first set the random seed. To do this, click on Transform, then Random Number Generators...

Next, select Set Active Generator, then Mersenne Twister, then select Set Starting Point and Fixed Value. Then, click the OK button.

Now we can conduct the multiple imputation. Begin by clicking on Analyze, Multiple Imputation, then Impute Missing Data Values...

Next, select all the variables (excluding the case identification variable) and move them to the Variables in Model box. Then, click the Create a new dataset circle and type in a name for the imputed data set which will be created. This data set will contain imputed values in place of the missing values. It is important to note that the operation actually does 5 (default, which is fine) imputation runs; meaning five imputations are performed in sequence. During each imputation the missing values are imputed and at the end of the imputations (all 5 in this case), the values are averaged together to take into account the variance of the missing values. This is why the procedure is called multiple imputation; because you end up with one set of imputed values -- but those values are in fact aggregates of multiple imputed values. At the risk of beating a dead horse, imagine that the fourth case is missing the extroversion score. That score will be imputed 5 times and stored in data sets, then those 5 values will be averaged and the resulting (single) value will be used in the primary analysis of the study. Multiple imputation is a strategy or process, there are many methods of going about the process of multiple imputation, such as implementation of the EM algorithm (often referred to as maximum likelihood imputation), but it is not the only method (monotone is also available in the Missing Values module of SPSS, while there are many, many more methods available in R).

Next, click on the Method tab. There are only two methods available: Markov Chain Monte Carlo method (MCMC) and Monotone. The automatic function scans the data for monotonicity and if discovered uses the Monotone method...otherwise, it defaults to the MCMC method (with 10 iterations and a standard regression model type). Next, up the iterations from 10 (default) to 100 to increase the likelihood of attaining convergence (when the MCMC chain reaches stability -- meaning the estimates are no longer fluctuating more than some arbitrarily small amount). Next, choose the Model type; keeping standard Linear Regression (default). The alternative is Predictive Mean Matching (PMM). PMM still uses regression, but the resulting predicted values are adjusted to match the closest (actual, existing) value in the data (i.e. the nearest non-missing value to the predicted value). If so desired, all possible two-way interactions among categorical predictors can be included in the (regression) model. Next, click the Constraints tab.

The Constraints tab allows the user to specify a variety of options for the process. First, click the Scan Data button, which will scan the data to fill in the Variable Summary table. This information can be (will be) used in the Define Constraints table. Obviously, for some variables, there are a finite number of valid values (e.g. minimum of zero for drinks_week and absences) and so, we would want to constrain the analysis to those minimum and maximum valid values. Categorical variables are automatically dummy coded; so Min, Max, and Rounding are not necessary. However, specification of the Min, Max, and Rounding can only be done when Linear Regression is specified as the Variable Model. By default all variables will be considered as both predictor and outcome (imputed) during the analysis. However, if for some reason you wanted to set a variable(s) as only a predictor or only a outcome, then the Role column can be used for that purpose. Below the Define Constraints table, one can also specify that variables be excluded from the analysis if they have a maximum (specified) percentage of missing values. Lastly, if Min, Max, and/or Rounding are specified, then the analysis will continue drawing values until those constraints are satisfied or until the Maximum case draws and/or Maximum parameter draws have been reach -- which can be manipulated by the user -- and an error occurs (i.e. the process is not completed). Next, click on the Output tab.

Here, we can specify what is to be displayed in the output window. Select Descriptive statistics for variables with imputed values and Create iteration history; naming the dataset to be created "IterHist". The iteration history is often useful for diagnosing convergence failures or errors (as mentioned above with the maximum case and parameter draws).

Finally, we can click the OK button. [Note, depending on number of iterations and maximum case and parameter draw specifications, the processing time can be quite long.]

The iteration numbers are listed in the lower right corner of the output window as they occur. Keep in mind, we have specified (above) that 100 iterations should be run for each of the 5 imputations.

When the process finishes, you should have two new data files (NewImputedData.sav & IterHist.sav) and quite a lot of new output. First, take a look at the imputed dataset. You'll notice there are several subtle differences in the data editor window when compared to the original data (DataMissing.sav). Three things which stand out are the missing values (blank cells), the new variable on the far left side of the data, called "Imputation_", and on the far right (pictured in the second image below) is the little cube of white and yellow cells with a drop-down menu (with "Original data" shown).

The Imputation variable simply labels each of the imputation sets. The first set is the original data (1500 cases) with the missing values still present. Below it, you'll notice the value for the Imputation_ variable changes to a number "1" (1500 more cases) which shows the first set of imputed; meaning the first of the 5 sets we requested and the other four sets are listed below the original and first sets. You can also use the drop-down menu at the extreme right (shown above) to move between each set. Notice, SPSS marks the cells which contain imputed values by highlighting them (i.e. changing the background of the cells to yellow). The little cube (shown above) which is next to the drop-down menu for moving between imputed sets is colored white and yellow to identify that this data file is an imputed data file.

Another very important thing to notice about this imputed data file, which is not readily apparent, is the fact that the Data Editor is aware that this file is an imputed data file. As such, if you click on Analyze, then some analysis (e.g. Compare Means), you'll notice that many of the analyses are compatible with imputed data. In other words, each analysis or function which shows an icon (shown below), which looks like the cube with a concentric swirl next to it, will automatically be run on the aggregated imputed data ("pooled" estimates).

As a quick example of pooled output, if we run a simple independent samples t-test comparing males and females on number grade, we get the following output -- which displays results for (1) the original data (with missing), (2) each imputation set separately, and (3) the 'pooled' estimates. [Note: only the Group Statistics table is shown, the second table (t-test results table) is not shown.]

The second data set produced by the imputation process is the iteration history data set (IterHist.sav). This file simply lists the mean and standard deviation for each interval/ratio scaled variable by iteration and imputation. By plotting the mean and standard deviation of a particular variable across iterations and imputations, one can assess the patterns of the imputed values. These plots should show a fairly random pattern (i.e. no discernable pattern).

Next, we can review the output created by the imputation process.

The first table, Imputation Specifications, simply lists what was specified for the process. The second table, Imputation Constraints, again simply lists what was specified on the Constraints tab prior to running the process.

The next two tables, Imputation Results and Imputation Models, simply display what occurred during the imputation procedure. [Note: The majority of the second table (Imputation Models) is not displayed.]

The next section of output contains one table for each variable, displaying descriptive statistics for each variable in the original data (with missing) and at each imputation for both the (specific) imputed values and (globally) all the values (i.e. all cases after imputation) of the variable. [Note: Only the sex variable and the age variables' tables are shown.]

And that's it. As mentioned above, many analyses in SPSS (here version 19) are able to consider imputed data sets and offer pooled output (i.e. showing the results of the analysis for each imputation set, as was done above with the quick t-test example). For a complete list of the analyses capable of utilizing imputed data, refer to the module's manual (specifically pages 29 - 31).

IBM (2010). IBM SPSS Missing Values 19: User's Guide. Available at: http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm

Little, R., & Rubin, D. (1987). Statistical analysis with missing data. New York: Wiley.

Von Hippel, P. (2004). Biases in SPSS 12.0 Missing Value Analysis. The American Statistician, 58, 160 - 164.

 Contact Information Jon Starkweather, PhD 940-565-4066 Richard Herrington, PhD Richard.Herrington@unt.edu 940-565-2140

Last updated: 09/27/12 by Jon Starkweather.