RSS Matters: Robust Statistics in S-Plus
By Rich Herrington, Research and Statistical Support Services
This month we take a look at the library RobLib, which is shipping with the newest version of S-Plus (6.0). The library for S-Plus 2000 can be downloaded at the following URL: http://www.insightful.com/roblib/registration.html . This library supplements the already existing suite of robust statistical functions in S-Plus. The RobLib library provides a graphical user interface.
Introduction to Robust Estimators
It is often assumed in the social sciences that data conform to a normal distribution. Numerous studies have examined real world data sets for conformity to normality, and have strongly questioned this assumption. Sometimes we may believe that a normal distribution is a good approximation to the data, and at other times we may believe this to be only a rough approximation. Two approaches have been taken to incorporate this reality. One approach is a two stage process whereby influential observations are identified and removed from the data. So called outlier analysis involves the calculation of leverage and influence statistics to help identify influential observations. The other approach, robust estimation, involves calculating estimators that are relatively insensitive to the tails of a data distribution, but which conform to normal theory approximation at the center of the data distribution. These robust estimators are somewhere between a nonparametric or distribution free approach, and a parametric approach. Consequently, a robust approach distinguishes between plausible distributions the data may come from, unlike a nonparametric approach which treats all possible distributions as equal. The positive aspect of this is that robust estimators are very nearly as efficient (very nearly optimal estimators) as the best possible estimators. Robust estimators are considered resistant if small changes in many of the observations or large changes in only a few data points have a small effect on its value. For example, the median is considered an example of a resistant measure of location, while the mean is not.
The RobLib S-Plus Library
After installing the RobLib library, type library(RobLib) at the Command Window to load the library. You easily obtain both a least squares and robust linear model fit for the so called "stack loss" data using the new linear regression dialog box in RobLib. The stack loss data has been used in a number of publications on robust regression, and is known to contain highly influential outliers. The stack loss data is included in Roblib as the data frame stack.dat. Open the Data icon in the Object Explorer and select stack.dat. The right-hand pane of the Object Explorer displays the four variables in stack.dat: the response variable Loss,and the three predictor variables Air.flow, Water.Temp and Acid.Conc. First select the response variable Loss, and then select the three predictor variables. Choose Roblib - Linear Regression from the menubar. The dialog shown below appears.
Because you selected the response variable Loss first, followed by the three predictor variables, the Formula field is automatically filled in with correct formula Loss ~ Air.Flow + Water.Temp +Acid.Conc. for modeling Loss in terms of the three predictor variables. Note that the Model page of this dialog looks exactly like that of the Linear Regression dialog in S-PLUS 2000, except for the Fitting Option choices, with the default choice LS + Robust (both least squares and robust fits are computed) and alternate choices LS (least squares fit only) and Robust (robust fit only) and the Options button. Click on the Options button to access various optional features of the robust fitting method. Click on the tabs labeled Results, Plot and Predict to look at those dialog pages. You will notice that the Results and Predict pages are identical to those of the Linear Regression dialog in S-PLUS 2000. However, the Plot page is different in that it has three new Plots region entries: Std. Resid. vs. Robust Distances, Estimated Residual Density and Standardized Resid. vs. Index (Time), and a new Overlaid Plots region with the entries: Residuals Normal QQ and Estimated Residual Density.
The latter are only available when you have chosen the default choice LS + Robust on the Model page. The default choices of plots indicated by the checked boxes. This will encourage you to quickly compare the LS and robust versions of these plots and quickly determine whether or not there are any outliers in the data, and whether or not the outliers have an impact on the least squares fit. Click OK to compute both the LS and robust fits, along with the three diagnostic comparison plots and other standard statistical summary information. The results appear in a Report window and four tabbed pages of a Graph Sheet, respectively. Each of the Graph Sheet pages contains a Trellis display for the LS and robust fit. The normal QQ-plot for the LS fit residuals shows at most one outlier, while the one for the robust fit reveals four outliers.
This reveals one of the most important advantages of a good robust fit relative to a least squares fit: The least squares fit is highly influenced by outliers in such a way that the outliers are not clearly revealed in the residuals, while the robust fit clearly exposes the outliers. You also note that if you ignore the outliers, a normal distribution is a pretty good model for the residuals in both cases. However, the slope of the central linear portion of the normal QQ-plot of the residuals for the robust fit is noticeably smaller than that for the LS fit. This indicates that the normal distribution fit to the robust residuals, ignoring the outliers, has a substantially smaller standard deviation than the normal distribution fit to the LS residuals. In this sense, the robust method provides a better fit to the bulk of the data.
MathSoft. (2001). Robust Library: A Library of New Robust Mehtods in S-Plus, Version 1.0.