The previous article in this series can be found in a previous issue of Benchmarks Online: Controlling the False Discovery Rate in Multiple Hypothesis Testing - Ed.

# Using Robust Mean and Robust Variance Estimates to Calculate Robust Effect Size

## Introduction  - Calculating Power and Effect Size

Power analysis involves the relationships between four variables involved in statistical inference: sample size (N), a significance criterion ( ), the population effect size ( ), and statistical power.  For any statistical inference, these relationships are a function of the other three (Cohen, 1988).  For research planning, it is most useful to determine the N necessary to have a specified power for a given  and .  The statistical power of a test is the long term probability of rejecting  (null hypothesis) given a specified  criterion and sample size N.   When the effect size is not equal to zero,  , is false, so a failure to reject is a decision error on the part of the researcher.  This is called a type II error ( ) and is related mathematically to power.  The probability of rejecting the null if it needs to be rejected (power) is one minus the type II error ( ).   Figure 1. below is a graphical representation of the relationship between the null distribution, the alternate distribution, and the critical scores under the null distribution.  The area underneath the  distribution (the alternate distribution), past the critical score of the left tail of , and past the critical score of the right tail of , represents the power of the statistical test being performed (the shaded area).

Effect size is the degree to which  (null hypothesis) is false and is indexed by the discrepancy between the null hypothesis and the alternate hypothesis.  Power analysis specifies a non-centrality parameter to quantify this discrepancy.  The noncentrality parameter for the difference between means is:

where the difference between estimated population means is scaled in  units (known as the estimated standard error of the difference between means):

where,

and  is the sample estimate of the population standard deviation.  The denominator of the non-centrality parameter represents the estimated standard deviation of the sampling distribution for the null hypothesis for differences between means.  Usually,  is calculated on the basis of a formula that assumes normality in the population since the standard deviation of the null sampling distribution cannot be calculated directly on the basis of the observed data without normality assumptions.   For robust measures of location (i.e. M-estimate), the numerator would be the difference between two M-estimates, and the denominator would represent the standard deviation of the null hypothesis re-sampling distribution for the difference between M-estimates.  For robust estimates (as well as the sample mean), the standard error can be estimated directly by calculating the standard deviation of the bootstrap estimates of the differences between the robust estimates of location (see September 2001 issue of Benchmarks).  An alternative effect size for group differences has been advocated by Cohen (1988).  Cohens  measure is based on the pooled estimated population standard deviation:

Cohen provides guidelines for interpreting the practical importance of an effect size based on  when no prior research is available to anchor  meaningfully.  Cohens rule of thumb for a small, medium and large effect size are based on a wide examination of the typical difference found in psychological data. A small effect size for  is .20;  a medium effect size for  is .50; and a large effect size for  is .80 (Cohen, 1992). Equating  and  using algebra, the expression for  is:

It is noted that is not a robust measure of effect size.   The pooled sample standard deviation, which is used to estimate the population standard deviation () will be inflated in the presence of outliers thereby biasing the effect size measure.  Furthermore, assumes a normal distribution in the calculation of power estimates.

## Measures of Robust Effect Size

Several problems exist with the measure of effect size.  The assumption of equal variances in the population is often dealt with by substituting a pooled variance estimate for .  With data that appear to have unequal variances, questions arise about how to interpret .  Another criticism of  is that both the location and scale (mean difference and sample standard deviation) of the sample are non-resistant measures.  One strategy would be to replace the means and standard deviation with more resistant measures of location and scale.  For example, one variation might be a difference of medians divided by MAD (median absolute deviation):

where  and is the median of the scores in the control group.  This effect size estimator does not seem like a good candidate since both the median and MAD are both known to be inefficient for Normal distributions compared to the mean and standard deviation.

## Robust Effect Size based on M-estimators

Lax (1985) examined the performance of 17 different estimators of scale with heavy tailed distributions.  Lax examined the performance of these scale estimators with the Normal distribution; a distribution with Cauchy tails (large kurtosis relative to the Normal  The Slash dist.); and a mixture distribution of N(0,1) and N(0,100) for samples of size 20.  The mixture distribution had 19 points sampled from N(0,1) and 1 point sampled from N(0,100) (One-Wild dist.).  Lax combined the efficiencies (see July 2001 issue of Benchmarks) of the estimators for the three distributions into what was defined as triefficiency.  The biweight midvariance (with c=9) estimator performed best, with favorable efficiencies across all scenarios: Normal (86.7%), One-Wild (85.8), and Slash (86.1).  Following Wilcox (1997) the biweight midvariance can be calculated as follows.  Setting (with c=9, M=sample median):

and

the following is calculated:

The square of is called the biweight midvariance.  It appears to have a breakdown point of approximately .5 (Hoaglin, Moesteller, & Tukey, 1983).  Based on this robust variance, the following robust effect size can be calculated:

where,  is the robust M-estimator for group 1 (using Huber objective function, with k=1.28 for both groups),  is the robust M-estimator for group 2, and  is the square root of the biweight midvariance for group 1 (control group).  The robust effects size does not assume equal variances among groups since only the robust variance for the control group is used (alternatively, a pooled estimated of both the control and experimental group biweight midvariances could be used, assuming equal variances).

## An Example Using GNU-S ("R")

Doksum & Sievers (1976) report data on a study designed to assess the effects of ozone on weight gain in rats. The experimental group consisted of 22 seventy-day old rats kept in an ozone environment for 7 days (group y).   The control group consisted of 23 rats of the same age (group x), and were kept in an ozone-free environment. Weight gain is measured in grams.  The following R code produces quantile-quantile plots and non-parametric density plots of the two groups of data:

Resulting qqnorm plots and density plots from R code above:

Both groups appear to have right and left tails which are "heavy".   It appears as if the classical mean difference between the groups will be underestimated (smaller).  The R code below estimates both the classical means, and robust means; classical estimated pooled standard deviation, and estimated pooled robust root biweight midvariance.

## Results and Conclusion

The resulting M-estimators suggest that the population control group mean is downwardly biased (23.24 - robust; 22.40 - classical) and the experimental population group mean is biased upwardly (9.69 - robust; 11.01 - classical).  Additionally, the robust pooled scale estimate is smaller than the classical pooled scale estimate (14.36 - robust; 15.36 - classical).  Using these estimates to calculate Cohen's d measure indicates that the effect size is downwardly biased.  Cohen's d based on classical estimators suggests a medium to large effect size (.74), whereas Cohen's d based on robust estimators suggests a very large effect size (.94).  In terms of sample size planning for future experiments, the robust Cohen's d would suggest that a much smaller sample size would be needed to achieve the same power for a smaller effect size using non-robust estimators of location and scale - a considerable savings in terms of data that needs to be collected.

## References

Cohen J. (1992) A power primer. Psychological Bulletin, 112, 155-159.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey.

Doksum, K.A. & Sievers, G.L. (1976).  Plotting with confidence:  graphical comparisons of two populations.  Biometrika 63, 421-434.

Hoaglin, D. C., Mosteller, F., & Tukey, J.W. (1983). Understanding robust and exploratory data analysis. New York: Wiley.

Lax, D.A. (1985).  Robust estimators of scale: finite sample performance in long-tailed symmetric distributions.  Journal of the American Statistical Association, 80(391), 736-741.

Wilcox, Rand R. (1997). Introduction to Robust Estimation and Hypothesis Testing.  Academic Press, New York.