The previous article in this series can be found in a previous issue of Benchmarks Online: Controlling
the False Discovery Rate in Multiple Hypothesis Testing - Ed.

Using Robust Mean and Robust Variance Estimates to Calculate Robust Effect Size

By Dr. Rich Herrington, Research and Statistical Support
Consultant
This month we demonstrate the calculation of robust effect sizes. The GNU S language, "R" is used to
implement this procedure. R is a statistical programming environment that is a clone of the S and S-Plus
language developed at Lucent Technologies. In the following document we illustrate the use of a GNU Web
interface to the R engine on the "rss" server, http://rss.acs.unt.edu/cgi-bin/R/Rprog. This GNU Web
interface is a derivative of the "Rcgi" Perl scripts available for download from the CRAN website,
http://www.cran.r-project.org (the main "R" website). Scripts
can be submitted interactively, edited, and be re-submitted with changed parameters by selecting the
hypertext link buttons that appear below the figures. For example, clicking the "Run Program" button
below creates a vector of 100 random normal deviates; displays the results; sorts and displays the results;
then creates a histogram and a density plot of the random numbers. To view any text output, scroll to the
bottom of the browser window. To view any graphical output, select the "Display Graphic" link. The script
can be edited and resubmitted by changing the script in the form window and then selecting "Run the R
Program". Selecting the browser "back page" button will return the reader to this document.

Introduction - Calculating Power and Effect Size

Power analysis involves the relationships between four variables
involved in statistical inference: sample size (N), a significance criterion ( ), the population effect size ( ), and statistical power. For any statistical inference,
these relationships are a function of the other three (Cohen, 1988). For
research planning, it is most useful to determine the N necessary to have a specified power for a given
and . The statistical power of a test is the long term
probability of rejecting (null hypothesis) given a specified criterion and sample size N. When the effect size is not equal to zero, , is false, so a failure to reject is a decision error on the part of the
researcher. This is called a type II error ( ) and is related mathematically to power. The probability of
rejecting the null if it needs to be rejected (power) is one minus the type II error ( ). Figure 1. below is
a graphical representation of the relationship between the null distribution, the alternate distribution,
and the critical scores under the null distribution. The area
underneath the distribution (the alternate distribution), past the critical
score of the left tail of , and past the critical score of the right tail of , represents the power of the statistical test being performed (the shaded area).

Effect size is the degree to which (null hypothesis) is false and is indexed by the discrepancy
between the null hypothesis and the alternate hypothesis. Power analysis
specifies a non-centrality parameter to quantify this discrepancy. The
noncentrality parameter for the difference between means is:

where the difference between estimated population means is scaled in units (known as the estimated standard error of the
difference between means):

where,

and is the sample estimate of the population standard
deviation. The denominator of the non-centrality parameter represents the estimated standard deviation of the sampling
distribution for the null hypothesis for differences between means. Usually,
is calculated on the basis of a formula that assumes
normality in the population since the standard deviation of the null sampling distribution cannot be calculated
directly on the basis of the observed data without normality assumptions. For
robust measures of location (i.e. M-estimate), the numerator would be the difference between two M-estimates, and the
denominator would represent the standard deviation of the null hypothesis re-sampling distribution for the difference
between M-estimates. For robust estimates (as well as the sample mean), the
standard error can be estimated directly by calculating the standard deviation of the bootstrap estimates of the
differences between the robust estimates of location (see September 2001 issue of
Benchmarks).An alternative
effect size for group differences has been advocated by Cohen (1988). Cohens measure is based on the pooled estimated population standard
deviation:

Cohen provides guidelines for interpreting the practical importance of an effect size based on when no prior research is available to anchor
meaningfully.
Cohens rule of thumb for a small, medium and large effect size are based on a wide examination of the typical
difference found in psychological data. A small effect size for is .20; a medium
effect size for is .50; and a large effect size for is .80 (Cohen, 1992).
Equating and using algebra, the expression for is:

It is noted that is not a robust measure
of effect size. The pooled sample standard deviation, which is used to estimate the population standard
deviation ()
will be inflated in the presence of outliers thereby biasing the effect size measure. Furthermore,
assumes a
normal distribution in the calculation of power estimates.

Measures of Robust Effect Size

Several problems exist with the measure
of effect size. The assumption of equal variances in the population
is often dealt with by substituting a pooled variance estimate for
. With data that appear to have unequal variances, questions arise
about how to interpret . Another criticism of is that both the location and scale (mean difference and
sample standard deviation) of the sample are non-resistant measures. One
strategy would be to replace the means and standard deviation with more resistant measures of location and
scale. For example, one variation might
be a difference of medians divided by MAD (median absolute deviation):

where
and
is the median of the scores in the control group. This effect size
estimator does not seem like a good candidate since both the median and MAD are both known to be
inefficient for Normal distributions compared to the mean and standard deviation.

Robust Effect Size based on M-estimators

Lax (1985) examined the performance of 17 different estimators of
scale with heavy tailed distributions. Lax examined the performance
of these scale estimators with the Normal distribution; a distribution with Cauchy tails (large kurtosis
relative to the Normal The Slash dist.); and a mixture distribution of N(0,1) and N(0,100) for samples of
size 20. The mixture distribution had 19 points sampled from N(0,1)
and 1 point sampled from N(0,100) (One-Wild dist.). Lax combined
the efficiencies (see July 2001 issue
of Benchmarks) of the estimators for the three distributions into
what was defined as triefficiency. The biweight midvariance
(with c=9) estimator performed best, with favorable efficiencies across all scenarios: Normal (86.7%),
One-Wild (85.8), and Slash (86.1). Following Wilcox (1997) the
biweight midvariance can be calculated as follows. Setting (with
c=9, M=sample median):

and

the following is calculated:

The square of is called the biweight midvariance. It
appears to have a breakdown point of approximately .5 (Hoaglin, Moesteller, & Tukey,
1983).Based on this robust
variance, the following robust effect size can be calculated:

where, is the robust M-estimator for group 1 (using Huber objective
function, with k=1.28 for both groups), is the robust M-estimator for group 2, and
is the square root of the biweight midvariance for group 1
(control group). The robust effects size does not assume equal variances among groups since only the robust variance for the
control group is used (alternatively, a pooled estimated of both the control and experimental group
biweight midvariances could be used, assuming equal variances).

An Example Using GNU-S ("R")

Doksum & Sievers (1976) report data on a study designed to assess the
effects of ozone on weight gain in rats.The experimental group
consisted of 22 seventy-day old rats kept in an ozone environment for 7 days (group y). The control group consisted of 23 rats of the same age (group x), and were
kept in an ozone-free environment. Weight gain is measured in grams. The following R code produces
quantile-quantile plots and non-parametric density plots of the two groups of data:

Resulting qqnorm plots and density plots from R code above:

Both groups appear to have right and left tails which are "heavy". It appears as if the classical mean
difference between the groups will be underestimated (smaller). The R code below estimates both the
classical means, and robust means; classical estimated pooled standard deviation, and estimated pooled
robust root biweight midvariance.