Page One

Campus Computing News

New Software Available

Important Summer Reading

Free Virus Protection for Home PCs

Today's Cartoon

RSS Matters

SAS Corner

The Network Connection

Link of the Month


Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online

Research and Statistical Support Logo

RSS Matters

The previous article in this series can be found in a previous issue of Benchmarks Online: Controlling the False Discovery Rate in Multiple Hypothesis Testing - Ed.

Using Robust Mean and Robust Variance Estimates to Calculate Robust Effect Size 

By Dr. Rich Herrington, Research and Statistical Support Consultant

This month we demonstrate the calculation of robust effect sizes.  The GNU S language, "R" is used to implement this procedure.  R is a statistical programming environment that is a clone of the S and S-Plus language developed at Lucent Technologies. In the following document we illustrate the use of a GNU Web interface to the R engine on the "rss" server,  This GNU Web interface is a derivative of the "Rcgi" Perl scripts available for download from the CRAN  website, (the main "R" website).   Scripts can be submitted interactively, edited, and be re-submitted with changed parameters by selecting the hypertext link buttons that appear below the figures.  For example, clicking the "Run Program" button  below creates a vector of 100 random normal deviates; displays the results; sorts and displays the results; then creates a histogram and a density plot of the random numbers.  To view any text output, scroll to the bottom of the browser window.  To view any graphical output, select the "Display Graphic" link.  The script can be edited and resubmitted by changing the script in the form window and then selecting  "Run the R Program".  Selecting the browser "back page" button will return the reader to this document.

Introduction  - Calculating Power and Effect Size

     Power analysis involves the relationships between four variables involved in statistical inference: sample size (N), a significance criterion ( wpe4.jpg (718 bytes) ), the population effect size ( wpe5.jpg (932 bytes) ), and statistical power.  For any statistical inference, these relationships are a function of the other three (Cohen, 1988).  For research planning, it is most useful to determine the N necessary to have a specified power for a given wpe4.jpg (718 bytes)  and wpe5.jpg (932 bytes) .  The statistical power of a test is the long term probability of rejecting wpe8.jpg (780 bytes)  (null hypothesis) given a specified wpe4.jpg (718 bytes)  criterion and sample size N.   When the effect size is not equal to zero, wpe8.jpg (780 bytes)  , is false, so a failure to reject wpe8.jpg (780 bytes) is a decision error on the part of the researcher.  This is called a type II error ( wpe4A.jpg (756 bytes) ) and is related mathematically to power.  The probability of rejecting the null if it needs to be rejected (power) is one minus the type II error (wpe4D.jpg (830 bytes) ).   Figure 1. below is a graphical representation of the relationship between the null distribution, the alternate distribution, and the critical scores under the null distribution.  The area underneath the wpe4F.jpg (793 bytes)  distribution (the alternate distribution), past the critical score of the left tail of wpe8.jpg (780 bytes) , and past the critical score of the right tail of wpe8.jpg (780 bytes) , represents the power of the statistical test being performed (the shaded area).

wpe57.jpg (17757 bytes)

     Effect size is the degree to which wpe8.jpg (780 bytes)  (null hypothesis) is false and is indexed by the discrepancy between the null hypothesis and the alternate hypothesis.  Power analysis specifies a non-centrality parameter to quantify this discrepancy.  The noncentrality parameter for the difference between means is:

                                                  wpe5B.jpg (1660 bytes)
where the difference between estimated population means is scaled in wpe5C.jpg (826 bytes)  units (known as the estimated standard error of the difference between means):

                                         wpe5E.jpg (2635 bytes)


                                      wpe5F.jpg (2498 bytes)

and wpe61.jpg (736 bytes)  is the sample estimate of the population standard deviation.  The denominator of the non-centrality parameter represents the estimated standard deviation of the sampling distribution for the null hypothesis for differences between means.  Usually, wpe5C.jpg (826 bytes)  is calculated on the basis of a formula that assumes normality in the population since the standard deviation of the null sampling distribution cannot be calculated directly on the basis of the observed data without normality assumptions.   For robust measures of location (i.e. M-estimate), the numerator would be the difference between two M-estimates, and the denominator would represent the standard deviation of the null hypothesis re-sampling distribution for the difference between M-estimates.  For robust estimates (as well as the sample mean), the standard error can be estimated directly by calculating the standard deviation of the bootstrap estimates of the differences between the robust estimates of location (see September 2001 issue of Benchmarks).  An alternative effect size for group differences has been advocated by Cohen (1988).  Cohens wpe5.jpg (932 bytes)  measure is based on the pooled estimated population standard deviation:

                                              wpe63.jpg (2117 bytes)  

Cohen provides guidelines for interpreting the practical importance of an effect size based on wpe5.jpg (932 bytes)  when no prior research is available to anchor wpe5.jpg (932 bytes)  meaningfully.  Cohens rule of thumb for a small, medium and large effect size are based on a wide examination of the typical difference found in psychological data. A small effect size for wpe5.jpg (932 bytes)  is .20;  a medium effect size for wpe5.jpg (932 bytes)  is .50; and a large effect size for wpe5.jpg (932 bytes)  is .80 (Cohen, 1992). Equating wpe64.jpg (743 bytes)  and wpe5.jpg (932 bytes)  using algebra, the expression for wpe64.jpg (743 bytes)  is:

                                           wpe66.jpg (1798 bytes)           

It is noted that wpe5.jpg (932 bytes) is not a robust measure of effect size.   The pooled sample standard deviation, which is used to estimate the population standard deviation (wpe67.jpg (750 bytes)) will be inflated in the presence of outliers thereby biasing the effect size measure.  Furthermore, wpe5.jpg (932 bytes) assumes a normal distribution in the calculation of power estimates.

Measures of Robust Effect Size

     Several problems exist with the wpe5.jpg (932 bytes) measure of effect size.  The assumption of equal variances in the population is often dealt with by substituting a pooled variance estimate for wpe6F.jpg (717 bytes) .  With data that appear to have unequal variances, questions arise about how to interpret wpe8.jpg (780 bytes) .  Another criticism of wpe5.jpg (932 bytes)  is that both the location and scale (mean difference and sample standard deviation) of the sample are non-resistant measures.  One strategy would be to replace the means and standard deviation with more resistant measures of location and scale.  For example, one variation might be a difference of medians divided by MAD (median absolute deviation):

                                         wpe71.jpg (1898 bytes)

where wpe72.jpg (2123 bytes)  and wpe73.jpg (772 bytes) is the median of the scores in the control group.  This effect size estimator does not seem like a good candidate since both the median and MAD are both known to be inefficient for Normal distributions compared to the mean and standard deviation.

Robust Effect Size based on M-estimators

     Lax (1985) examined the performance of 17 different estimators of scale with heavy tailed distributions.  Lax examined the performance of these scale estimators with the Normal distribution; a distribution with Cauchy tails (large kurtosis relative to the Normal  The Slash dist.); and a mixture distribution of N(0,1) and N(0,100) for samples of size 20.  The mixture distribution had 19 points sampled from N(0,1) and 1 point sampled from N(0,100) (One-Wild dist.).  Lax combined the efficiencies (see July 2001 issue of Benchmarks) of the estimators for the three distributions into what was defined as triefficiency.  The biweight midvariance (with c=9) estimator performed best, with favorable efficiencies across all scenarios: Normal (86.7%), One-Wild (85.8), and Slash (86.1).  Following Wilcox (1997) the biweight midvariance can be calculated as follows.  Setting (with c=9, M=sample median):

                          wpe77.jpg (1553 bytes)          and          wpe7C.jpg (1039 bytes)wpe7A.jpg (1454 bytes)

the following is calculated:

                               wpe3.jpg (3610 bytes)

     The square of wpe90.jpg (825 bytes) is called the biweight midvariance.  It appears to have a breakdown point of approximately .5 (Hoaglin, Moesteller, & Tukey, 1983).  Based on this robust variance, the following robust effect size can be calculated:

                                        wpe84.jpg (1960 bytes)              

where, wpe86.jpg (969 bytes)  is the robust M-estimator for group 1 (using Huber objective function, with k=1.28 for both groups), wpe87.jpg (1011 bytes)  is the robust M-estimator for group 2, and wpe8F.jpg (842 bytes)  is the square root of the biweight midvariance for group 1 (control group).  The robust effects size wpe92.jpg (959 bytes) does not assume equal variances among groups since only the robust variance for the control group is used (alternatively, a pooled estimated of both the control and experimental group biweight midvariances could be used, assuming equal variances). 

An Example Using GNU-S ("R")

Doksum & Sievers (1976) report data on a study designed to assess the effects of ozone on weight gain in rats. The experimental group consisted of 22 seventy-day old rats kept in an ozone environment for 7 days (group y).   The control group consisted of 23 rats of the same age (group x), and were kept in an ozone-free environment. Weight gain is measured in grams.  The following R code produces quantile-quantile plots and non-parametric density plots of the two groups of data:


Resulting qqnorm plots and density plots from R code above:

wpeC3.jpg (33338 bytes)

Both groups appear to have right and left tails which are "heavy".   It appears as if the classical mean difference between the groups will be underestimated (smaller).  The R code below estimates both the classical means, and robust means; classical estimated pooled standard deviation, and estimated pooled robust root biweight midvariance. 


Results and Conclusion

wpeBB.jpg (21399 bytes)

The resulting M-estimators suggest that the population control group mean is downwardly biased (23.24 - robust; 22.40 - classical) and the experimental population group mean is biased upwardly (9.69 - robust; 11.01 - classical).  Additionally, the robust pooled scale estimate is smaller than the classical pooled scale estimate (14.36 - robust; 15.36 - classical).  Using these estimates to calculate Cohen's d measure indicates that the effect size is downwardly biased.  Cohen's d based on classical estimators suggests a medium to large effect size (.74), whereas Cohen's d based on robust estimators suggests a very large effect size (.94).  In terms of sample size planning for future experiments, the robust Cohen's d would suggest that a much smaller sample size would be needed to achieve the same power for a smaller effect size using non-robust estimators of location and scale - a considerable savings in terms of data that needs to be collected.  

wpeBD.jpg (9554 bytes)


Cohen J. (1992) A power primer. Psychological Bulletin, 112, 155-159.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey.

Doksum, K.A. & Sievers, G.L. (1976).  Plotting with confidence:  graphical comparisons of two populations.  Biometrika 63, 421-434.

Hoaglin, D. C., Mosteller, F., & Tukey, J.W. (1983). Understanding robust and exploratory data analysis. New York: Wiley.

        Lax, D.A. (1985).  Robust estimators of scale: finite sample performance in long-tailed symmetric distributions.  Journal of the American Statistical Association, 80(391), 736-741.

        Wilcox, Rand R. (1997). Introduction to Robust Estimation and Hypothesis Testing.  Academic Press, New York.