Research and Statistical Support

MODULE 9

Categorical Regression (CATREG)

The SPSS CATREG function incorporates optimal scaling and can be used when the predictor(s) and outcome variables are any combination of numeric, ordinal, or nominal. Standard multiple regression can only accommodate an outcome variable which is continuous or nearly continuous (i.e. interval/ratio in scale) and it works best with continuous or nearly continuous predictor variables. Although standard regression can accommodate categorical predictors using one of the following strategies for those types of predictors: dummy coding, effects coding, orthogonal coding, or criterion coding. Binomial Logistic regression is appropriate when the outcome is a dichotomous variable (i.e. categorical with only two categories). Multi-nomial Logistic Regression or Discriminant Function Analysis is appropriate when the outcome variable is polytomous (i.e. categorical with more than two categories).

It is recommended that when conducting categorical regression, one approach the process as one would approach a data reduction process; meaning it is often necessary to conduct multiple runs of the analysis while slightly changing the options / parameters in an effort to discover the best results (i.e. best fitting model and most substantively meaningful interpretation of results).

For the duration of this tutorial we will be using the SPSScatreg.sav file; which contains 1 outcome variable (y) and 5 predictor variables (x1 - x5). The outcome variable was operationally defined as general happiness and was measured with a subjective rating scale. Each of the five predictors were 8-point Likert scaled questionnaire items which are believed to measure preferences for various types of social interactions.

(1) Evaluate the variables.

Begin by conducting a Frequency function to get an idea of how our variables are distributed. Click on Analyze, Descriptive Statistics, Frequencies...

Next, highlight / select all of the variables and use the arrow button to move them to the Variable(s): box. Then click on the Statistics... button and select the following. Then click Continue.

Next, click on the Charts... button, and select, Histograms: and then click the box for Show normal curve on histogram. Then click the Continue button, then click the OK button.

The output should be similar to what is displayed below.

Here, we see that our outcome variable not only has a substantial range of 196 values, but it also has very low values for skewness (.026) and kurtosis (.065) which indicate a fairly normally distributed variable. In fact, we could treat this ordinal variable as numeric or nearly continuous and decide to run a standard multiple regression. But, we would need to use a coding strategy for  the predictor variables if we did choose to run the standard multiple regression. Also, as this example shows, it is better to run a categorical regression on this data because of the opportunity to apply optimal scaling and because all the predictors appear to be and are nominal or ordinal.

The frequency table for y has been truncated to save space, but it too shows how broadly distributed the values are on our outcome variable. The other frequency tables show the discrete nature of our predictor variables.

Again, the histogram (with super-imposed normal curves) shows how well our outcome variable (which is nominal) displays the characteristics of an interval or ratio variable. Bar charts would be more appropriate for categorical variables (showing the discrete nature of the variables), but we can see each of the predictors displays narrow range across values 1 - 8.

(2) Standard (multiple) Regression for comparison.

Running a standard multiple regression gives us a baseline model for comparison. Click on Analyze, Regression, Linear...

Next, highlight / select y and use the top arrow button to move it to the Dependent: box. Then, select all the predictors and move them to the Independent(s): box. Then, click the Statistics button.

Next, select Descriptives (this will produce a Pearson correlation matrix).

Next, click the Continue button, then click the OK button.

The output should be similar to what is displayed below.

The Descriptive Statistics table shows some of the same information provided in the frequencies function above. The Correlations table provides us with an idea of the relationship between each of the variable. It provides only an idea, because a polychoric correlation matrix (rather than a Pearson correlation matrix) would be more appropriate given the nature of the variables. Pay particular attention to the relationships between each predictor and the outcome variable. Also, notice the lack of multicollinearity (i.e. low magnitude relationships between the predictors). The significance associated with this data is likely to be of little use given the fairly large sample size of the data (N = 1000).

The Variables Entered/Removed table shows just that, all the predictors in the model and none removed. The Model Summary tables shows the multiple correlation coefficient (R²), squared multiple correlation coefficient (R²), adjusted multiple correlation coefficient (adj.R²), and standard error of the estimate. According to our model summary, the collection of predictors accounts for 92.6% (adj.R² = .926) of the variance in our outcome variable.

The ANOVA table simply tells us our R² is significantly different from zero. Pay particular attention to the magnitude and order of magnitude of the standardized coefficients or Beta (β)coefficients, for each of our predictors in the Coefficients table. They should be close to the bi-variate correlations for each predictor with the outcome, as listed above in the Correlations table.

(3) First Categorical Regression Analysis.

Returning to the Data Window, click on Analyze, Regression, Optimal Scaling (CATREG)...

Next, select the outcome variable (y) and use the top arrow button to move it to the Dependent Variable: box. Then, click on the top Define Scale... button and select Ordinal. You can see here the different levels of scale / measurement available. Then, click the Continue button.

Next, select all 5 predictor variables and use the lower arrow button to move them to the Independent Variable(s): box. Then, click on the lower Define Scale... button and select Ordinal for all 5 predictors. Again, you can see here the different levels of scale / measurement available. Then, click the Continue button.

Next, click on the Output... button and select Correlations of original variables and Correlations of transformed variables. Then click the Continue button. Notice, if you click on the Save... button, you have the ability to save predicted and / or residual values. Next, click the OK button. Keep in mind, the optimal scaling process is iterative and can take a minute or more.

The output should be similar to what is displayed below.

The first two tables are of little to no interest for interpretation; we now know who to thank for inclusion of this analysis in SPSS and there was no missing data.

The two correlation tables show the original relationships between the predictor variables (identical to what we saw in the original Correlations table above), and the correlations between our transformed (i.e. optimally scaled) predictor variables. Again, there is no danger of violating the regression assumption of no multicollinearity; meaning, our predictor variables are not substantially related.

The Model Summary table shows an unrealistic multiple correlation coefficient. Regression assumes correct model specification (i.e. all important variables in the model and no un-important variables in the model); so given the simulated nature of this data, it is reporting perfect fit because, all of the important variables are in the model -- which never happens with 'real' data. Of course, the ANOVA table is showing that our R² value is significantly different from zero. The R script file used to generate this data can be found here.

The Coefficients table and the Correlations and Tolerance table display a rather curious pattern of relationships between each predictor and the outcome when compared to the correlations and Beta coefficients tables displayed in the standard multiple regression. Focus on the Beta coefficients, zero-order correlations, partial correlations, part correlations (semi-partial), and importance. In these two tables, the strongest predictor is x2; and x3, x4, x5 are not significant predictors of y. It seems as though the CATREG algorithm is confusing the importance of x1 & x2, inflating the importance of x2, and completely discounting the importance of x3, x4, & x5.

In pursuit of a more clear and realistic interpretation for this data and the relationships between the variables, we can run a second CATREG with the outcome variable specified as numeric.

(4) Second Categorical Regression Analysis.

Returning to the Data Window, click on Analyze, Regression, Optimal Scaling (CATREG)...

You'll notice the previous run of the analysis is still specified. Here, all we need to do is highlight / select y in the Dependent Variable: box, then click on the Define Scale... button (marked here with a red ellipse).

Next, change the scale from Ordinal to Numeric. Then click the Continue button, then click the OK button.

The output should be similar to what is displayed below.

The first two tables are identical to those from the previous run.

The Correlations Original Variables table is identical to the previous run. The correlations between our transformed (i.e. optimally scaled) predictor variables have changed because of the (newly produced) iterative optimally scaled data. Again, there is no danger of violating the regression assumption of no multicollinearity.

The Model Summary table offers more realistic representations of the multiple correlation between all 5 predictors and our outcome variable. The ANOVA table again shows that our R² is significantly different from zero.

The Coefficients table (Beta coefficients) and the Correlations and Tolerance table (zero-order correlations, partial correlations, part correlations (semi-partial), & importance) shows values which more closely resemble the original relationships. This should highlight the importance of (1) knowing the operational definitions of variables, (2) conducting initial data analysis (IDA) by running frequencies and / or descriptive statistics functions, and (3) conducting multiple analysis while modifying the options / parameters to extract as much information as possible from the data. Here, the true relationships of the variables are reflected while running the appropriate analysis for the measurement scale of the variables. In this example we knew the true relationships between the variables because we used simulation to generate the data. In a genuine research study, it is recommended one conduct simulation studies in order to more easily recognize the patterns in the data and have confidence in the analysis being performed. However, even if simulation is not used prior to collecting the actual data, one should have at least some understanding of the underlying relationships between the variables of interest based on a thorough literature review (i.e. prior research and theory associated with the area of study).

REFERENCES and RESOURCES

de Leeuw, J. (1988). Multivariate analysis with linearizable regressions. Psychometrika, 53(4), 437 - 454. (here).

Meulman, J. J. (1998). Optimal scaling methods for multivariate categorical data analysis. SPSS White Paper, SPSS Inc. (here).

SPSS Content Guideline for CATREG in PASW 18. (here).

Van Der Geer, J. P. (1993). Multivariate analysis of categorical data: Theory. Advanced Quantitative Techniques in the Social Sciences Series (Vol. 2). Sage Publications, Inc.

Van Der Geer, J. P. (1993). Multivariate analysis of categorical data: Applications. Advanced Quantitative Techniques in the Social Sciences Series (Vol. 3). Sage Publications, Inc.