|
|
|

RSS Matters: Robust
Statistics in S-Plus
By Rich
Herrington, Research and Statistical Support
Services
This month we take a look at the library RobLib,
which is shipping with the newest version of S-Plus
(6.0). The library for S-Plus 2000 can be downloaded at
the following URL: http://www.insightful.com/roblib/registration.html
. This library supplements the already existing suite of
robust statistical functions in S-Plus. The RobLib
library provides a graphical user interface.
Introduction to Robust Estimators
It is often assumed in the social sciences that data
conform to a normal distribution. Numerous studies
have examined real world data sets for conformity to
normality, and have strongly questioned this assumption.
Sometimes we may believe that a normal distribution is a
good approximation to the data, and at other times we may
believe this to be only a rough approximation. Two
approaches have been taken to incorporate this reality.
One approach is a two stage process whereby influential
observations are identified and removed from the
data. So called outlier analysis involves
the calculation of leverage and influence statistics to
help identify influential observations. The other
approach, robust estimation, involves calculating
estimators that are relatively insensitive to the tails
of a data distribution, but which conform to normal
theory approximation at the center of the data
distribution. These robust estimators are somewhere
between a nonparametric or distribution free
approach, and a parametric approach. Consequently, a
robust approach distinguishes between plausible
distributions the data may come from, unlike a
nonparametric approach which treats all possible
distributions as equal. The positive aspect of this
is that robust estimators are very nearly as efficient
(very nearly optimal estimators) as the best possible
estimators. Robust estimators are considered
resistant if small changes in many of the observations or
large changes in only a few data points have a small
effect on its value. For example, the median is
considered an example of a resistant measure of location,
while the mean is not.
The RobLib S-Plus Library
After installing
the RobLib library, type
library(RobLib) at the Command Window to load the
library. You easily obtain both a least squares and
robust linear model fit for the so called "stack
loss" data using the new linear regression dialog
box in RobLib. The stack loss data has been used in a
number of publications on robust regression, and is known
to contain highly influential outliers. The stack loss
data is included in Roblib as the data frame stack.dat. Open
the Data icon in the Object Explorer and select stack.dat. The right-hand pane of
the Object Explorer displays the four variables in stack.dat: the
response variable Loss,and
the three predictor variables Air.flow, Water.Temp and Acid.Conc. First select the response variable
Loss, and
then select the three predictor variables. Choose Roblib
-
Linear
Regression from the menubar. The dialog shown
below appears.

Because you
selected the response variable Loss first, followed by the three
predictor variables, the Formula field is automatically
filled in with correct formula Loss ~ Air.Flow + Water.Temp +Acid.Conc. for modeling Loss in
terms of the three predictor variables. Note that the
Model page of this dialog looks exactly like that of the
Linear Regression dialog in S-PLUS 2000, except for the Fitting
Option choices, with the default choice LS + Robust (both
least squares and robust fits are computed) and alternate
choices LS (least squares fit only) and Robust (robust
fit only) and the Options button. Click on the Options
button to access various optional features of the robust
fitting method. Click on the tabs labeled Results, Plot
and Predict to look at those dialog pages. You will
notice that the Results and Predict pages are identical
to those of the Linear Regression dialog in S-PLUS 2000. However, the Plot
page is different in that it has three new Plots region
entries: Std. Resid. vs. Robust Distances, Estimated
Residual Density and Standardized Resid. vs. Index
(Time), and a new Overlaid Plots region with the
entries: Residuals Normal QQ and Estimated
Residual Density.

The latter are
only available when you have chosen the default choice LS
+ Robust on the Model page. The default choices of plots
indicated by the checked boxes. This will encourage you
to quickly compare the LS and robust versions of these
plots and quickly determine whether or not there are any
outliers in the data, and whether or not the outliers
have an impact on the least squares fit. Click OK to
compute both the LS and robust fits, along with the three
diagnostic comparison plots and other standard
statistical summary information. The results appear in a
Report window and four tabbed pages of a Graph Sheet,
respectively. Each of the Graph Sheet pages contains a
Trellis display for the LS and robust fit. The normal
QQ-plot for the LS fit residuals shows at most one
outlier, while the one for the robust fit reveals four
outliers.

This reveals one
of the most important advantages of a good robust fit
relative to a least squares fit: The least squares fit is
highly influenced by outliers in such a way that the
outliers are not clearly revealed in the residuals, while
the robust fit clearly exposes the outliers. You also
note that if you ignore the outliers, a normal
distribution is a pretty good model for the residuals in
both cases. However, the slope of the central linear
portion of the normal QQ-plot of the residuals for the
robust fit is noticeably smaller than that for the LS
fit. This indicates that the normal distribution fit to
the robust residuals, ignoring the outliers, has a
substantially smaller standard deviation than the normal
distribution fit to the LS residuals. In this sense, the
robust method provides a better fit to the bulk of the
data.
References
MathSoft. (2001). Robust Library: A
Library of New Robust Mehtods in S-Plus, Version
1.0.
|