Last time we examined the
smoothed bootstrap, this month we demonstrate the calculation of a robust correlation measure. The GNU S
language, "R" is used to implement this procedure. R is a statistical programming environment that is a
clone of the S and S-Plus language developed at Lucent Technologies. In the following document we
illustrate the use of a GNU Web interface to the R engine on the "rss" server, http://rss.acs.unt.edu/cgi-bin/R/Rprog. This GNU Web
interface is a derivative of the "Rcgi" Perl scripts available for download from the CRAN Website,
http://www.cran.r-project.org (the main "R" Website). Scripts
can be submitted interactively, edited, and be re-submitted with changed parameters by selecting the
hypertext link buttons that appear below the figures. For example, clicking the "Run Program" button
below creates 100 random numbers from a bivariate normal distribution; uses linear regression to fit a
least squares line to two of the columns of data; and calculates person's product moment correlation. To
view any text output, scroll to the bottom of the browser window. To view the scatterplot, select the
"Display Graphic" link. The script can be edited and resubmitted by changing the script in the form
window and then selecting "Run the R Program". Selecting the browser "back page" button will return the
reader to this document.
Pearson's Product Moment Correlation
Let be a
random sample from a bivariate distribution. The sample estimate of the population correlation is given
by:
If X and Y are independent, . A test of is based on the test statistic:
where n is the number of X,Y pairs of scores. If is true, T has a Student's t distribution with df = n-2 degrees of
freedom if at least one of the marginal distributions is normal. Reject if , for the quantile of Student's t distribution with
n-2 degrees of freedom. When the null hypothesis is true, the distribution of sample correlation
coefficients tends to be normally distributed for increasing sample size. The standard error of this
distribution of sample correlations is approximately:
When the sample size is reasonably large (), then it is possible to test the significance of the sample
correlation coefficient by forming the usual z statistic and referring it to the normal distribution. In
situations where one wishes to test for a non-zero value of the sample correlation coefficient, many
sources have recommended Fisher's r-toZ transformation when computing confidence intervals, but it is not
asymptotically correct when sampling from nonnormal distributions. Moreover, simulation studies do not
support Fisher's r-to-Z transformation for small sample sizes. The S script below samples from a
bivariate normal distribution whose population correlation is .40. A scatterplot is plotted with a best
fit regression line added to the scatterplot. A classical t-test of the correlation coefficient is
performed, and the standard error under the null hypothesis is calculated. Change the script below to
have different sample sizes and see the effect of sample size on the standard error; also see how the
sample correlation fluctuates from sample to sample around the true population correlation value. With
smaller sample sizes, the sample correlation coefficient can be quite different from the population
value.
Robust Correlation: The Biweight Midcorrelation
For an introduction to robust statistics and robust measures of location, see the July issue of Benchmarks. Let
be a random
sample from a bivariate distribution. Following Wilcox (1997, page 197), let:
where is the
median, and is
the median absolute deviation. A sample estimate of the median absolute deviation is given by:
where is the
sample median. MAD has a finite sample breakdown point of approximately .5. Let be a function of the
scores as is
is for the
scores. The
sample biweight midcovariance between and is given by:
where,
An estimate of the biweight midcorrelation between and is given by:
where and
are the biweight
midvariance for the and
scores. The S script below calculates both the Pearson product-moment correlation and the biweight
midcorrelation for a bivariate normal distribution whose population correlation is .40. An outlier is
added to the sample, and then the two correlation coefficients are recalculated and compared again. A
scatterplot with a best fit line is plotted after the outlier is added. Try changing various parameters
and resubmitting the S script.
Calculating Empirical P-values, Empirical Power, and Confidence Intervals for a Robust Correlation
Coefficient
The S script below calculates observed p-values, power, and confidence intervals for the biweight
midcorrelation coefficient. A comparison is made between the Pearson correlation coefficient and the
biweight midcorrelation before and after an outlier is added. Confidence intervals are calculated by
simulating the null sampling distribution, and also by sampling from the alternate sampling distribution
(Hall & Wilson, 1991). For details on calculating the percentile bootstrap, and calculating power
using the percentile bootstrap, see the September issue of Benchmarks. For
details on the percentile bootstrap see the August issue of Benchmarks. After
running the following S script, try changing various parameters: sample size, population correlation,
correlation coefficient (assign "est" to have value "bicor" or "cor") and the size of the outlier (change
the multiplying factor 2.5 to a smaller value). Changing "est" to have value "cor" will allow one to
calculate the p-value and power of the observed sample for the Pearson product-moment correlation.
Results
The output from the S script above are listed below.
References
Hall P., & Wilson S.R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics,
47, 757-762.
Wilcox, Rand (1997). Introduction to Robust Estimation and Hypothesis Testing. Academic
Press, New York.