Page One

Campus Computing News

Winter Break Hours

Forget about Santa Claus; 'GAUSS' is coming to town!

EagleMail News

Lab-of-the-Month: SLIS

IDE RAID Technology

Today's Cartoon

RSS Matters

SAS Corner

The Network Connection

List of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
    

Research and Statistical Support Logo

RSS Matters

Dealing with Outliers in Bivariate Data: Robust Correlation

By Dr. Rich Herrington, Research and Statistical Support Consultant

Last time we examined the smoothed bootstrap, this month we demonstrate the calculation of a robust correlation measure. The GNU S language, "R" is used to implement this procedure. R is a statistical programming environment that is a clone of the S and S-Plus language developed at Lucent Technologies. In the following document we illustrate the use of a GNU Web interface to the R engine on the "rss" server, http://rss.acs.unt.edu/cgi-bin/R/Rprog. This GNU Web interface is a derivative of the "Rcgi" Perl scripts available for download from the CRAN Website, http://www.cran.r-project.org (the main "R" Website). Scripts can be submitted interactively, edited, and be re-submitted with changed parameters by selecting the hypertext link buttons that appear below the figures. For example, clicking the "Run Program" button  below creates 100 random numbers from a bivariate normal distribution; uses linear regression to fit a least squares line to two of the columns of data; and calculates person's product moment correlation.  To view any text output, scroll to the bottom of the browser window. To view the scatterplot, select the "Display Graphic" link. The script can be edited and resubmitted by changing the script in the form window and then selecting "Run the R Program". Selecting the browser "back page" button will return the reader to this document.

Pearson's Product Moment Correlation 

Let  wpe40.jpg (1479 bytes)  be a random sample from a bivariate distribution. The sample estimate of the population correlation wpe3D.jpg (738 bytes) is given by:

wpe3E.jpg (3106 bytes)

If X and Y are independent, wpe45.jpg (834 bytes).  A test of wpe46.jpg (1036 bytes) is based on the test statistic:

wpe56.jpg (1464 bytes)

where n is the number of X,Y pairs of scores.  If wpe53.jpg (833 bytes) is true, T has a Student's t distribution with df = n-2 degrees of freedom if at least one of the marginal distributions is normal.  Reject wpe46.jpg (1036 bytes) if  wpe55.jpg (1117 bytes), for the wpe6E.jpg (885 bytes) quantile of Student's t distribution with n-2 degrees of freedom.  When the null hypothesis is true, the distribution of sample correlation coefficients tends to be normally distributed for increasing sample size.  The standard error of this distribution of sample correlations is approximately:

wpe123.jpg (1149 bytes)   

When the sample size is reasonably large (wpe124.jpg (915 bytes)), then it is possible to test the significance of the sample correlation coefficient by forming the usual z statistic and referring it to the normal distribution.  In situations where one wishes to test for a non-zero value of the sample correlation coefficient, many sources have recommended Fisher's r-toZ transformation when computing confidence intervals, but it is not asymptotically correct when sampling from nonnormal distributions.  Moreover, simulation studies do not support Fisher's r-to-Z transformation for small sample sizes.  The S script below samples from a bivariate normal distribution whose population correlation is .40.  A scatterplot is plotted with a best fit regression line added to the scatterplot.  A classical t-test of the correlation coefficient is performed, and the standard error under the null hypothesis is calculated.   Change the script below to have different sample sizes and see the effect of sample size on the standard error; also see how the sample correlation fluctuates from sample to sample around the true population correlation value.  With smaller sample sizes, the sample correlation coefficient can be quite different from the population value. 

Robust Correlation:  The Biweight Midcorrelation

For an introduction to robust statistics and robust measures of location, see the July issue of Benchmarks.  Let  wpe40.jpg (1479 bytes)  be a random sample from a bivariate distribution.  Following Wilcox (1997, page 197), let:

wpe81.jpg (1745 bytes)

where wpe82.jpg (845 bytes) is the median, and wpe83.jpg (998 bytes) is the median absolute deviation.  A sample estimate of the median absolute deviation is given by:

wpe88.jpg (2972 bytes)

where wpe89.jpg (845 bytes) is the sample median.   MAD has a finite sample breakdown point of approximately .5.   Let wpe8A.jpg (777 bytes) be a function of the wpe8B.jpg (769 bytes) scores as is wpe8C.jpg (811 bytes)is for the wpe8E.jpg (804 bytes) scores.  The sample biweight midcovariance between wpeA1.jpg (750 bytes) and wpeA2.jpg (727 bytes) is given by:

wpeA3.jpg (5360 bytes)

where,

wpeA7.jpg (3616 bytes)

An estimate of the biweight midcorrelation between wpeA1.jpg (750 bytes) and wpeA2.jpg (727 bytes) is given by:

wpeC0.jpg (1761 bytes)

where wpeC1.jpg (872 bytes) and wpeC2.jpg (884 bytes) are the biweight midvariance for the wpeA1.jpg (750 bytes) and wpeA2.jpg (727 bytes) scores.  The S script below calculates both the Pearson product-moment correlation and the biweight midcorrelation for a bivariate normal distribution whose population correlation is .40.  An outlier is added to the sample, and then the two correlation coefficients are recalculated and compared again.  A scatterplot with a best fit line is plotted after the outlier is added.  Try changing various parameters and resubmitting the S script.

Calculating Empirical P-values, Empirical Power, and Confidence Intervals for a Robust Correlation Coefficient

The S script below calculates observed p-values, power, and confidence intervals for the biweight midcorrelation coefficient.   A comparison is made between the Pearson correlation coefficient and the biweight midcorrelation before and after an outlier is added.   Confidence intervals are calculated by simulating the null sampling distribution, and also by sampling from the alternate sampling distribution (Hall & Wilson, 1991).  For details on calculating the percentile bootstrap, and calculating power using the percentile bootstrap, see the September issue of Benchmarks.  For details on the percentile bootstrap see the August issue of Benchmarks.  After running the following S script, try changing various parameters:   sample size, population correlation,  correlation coefficient (assign "est" to have value "bicor" or "cor") and the size of the outlier (change the multiplying factor 2.5 to a smaller value).  Changing "est" to have value "cor" will allow one to calculate the p-value and power of the observed sample for the Pearson product-moment correlation.


Results

The output from the S script above are listed below.

wpe10B.jpg (15261 bytes)

 

wpe10C.jpg (18451 bytes)

 

wpe10D.jpg (12201 bytes)

 

wpe110.jpg (9330 bytes)

 

wpe120.jpg (18287 bytes)

 

wpe114.jpg (16649 bytes)

 

wpe115.jpg (16996 bytes)


References

Hall P., & Wilson S.R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics, 47, 757-762.

Wilcox, Rand (1997). Introduction to Robust Estimation and Hypothesis Testing.   Academic Press,  New York.