Power
analysis involves the relationships between four variables involved in statistical
inference: sample size (N), a significance criterion (
), the population effect size (
), and statistical power. For
any statistical inference, these relationships are a function of the other three (Cohen,
1988). For research planning, it is most
useful to determine the N necessary to have a specified power for a given
and
. The
statistical power of a test is the long term probability of rejecting
(null hypothesis) given a
specified
criterion
and sample size N. When the effect size is
not equal to zero,
,
is false, so a failure to reject Ho is a decision error on the part of the researcher. This is called a type II error (
) and is related mathematically to power.
The probability of rejecting the null if it needs to be rejected (power) is one
minus the type II error (
). Figure 1. below is a graphical representation of the relationship between the null
distribution, the alternate distribution, and the critical scores under the null
distribution. The area underneath the
distribution (the alternate
distribution), past the critical score of the left tail of
, and past the critical score of the right tail of
, represents the power of the statistical test being performed (the shaded
area).

Effect
size is the degree to which
(null hypothesis) is false and is indexed by the
discrepancy between the null hypothesis and the alternate hypothesis. Power analysis specifies a non-centrality
parameter to quantify this discrepancy. The
noncentrality parameter for the difference between means is:

where the difference between estimated population means is scaled in
units (known as the estimated
standard error of the difference between means):

where,
![]()
and
is the sample estimate of the population standard
deviation. The denominator of the non-centrality parameter represents the estimated
standard deviation of the sampling distribution for the null hypothesis for differences
between means. Usually,
is calculated on the basis of
a formula that assumes normality in the population since the standard deviation of the
null sampling distribution cannot be calculated directly on the basis of the observed data
without normality assumptions. For robust
measures of location (i.e. M-estimate), the numerator would be the difference between two
M-estimates, and the denominator would represent the standard deviation of the null
hypothesis re-sampling distribution for the difference between M-estimates. For robust estimates (as well as the sample mean),
the standard error can be estimated directly by calculating the standard deviation of the
bootstrap estimates of the differences between the robust estimates of location (see http://www.unt.edu/benchmarks/archives/2001/september01/rss.htm). An alternative
effect size for group differences has been advocated by Cohen (1988). Cohens
measure is based on the pooled estimated population
standard deviation:
Cohen provides guidelines for interpreting the practical importance
of an effect size based on
when no prior research is available to anchor
meaningfully. Cohens rule of thumb for a small, medium and
large effect size are based on a wide examination of the typical difference found in
psychological data. A small effect size for
is .20; a
medium effect size for
is
.50; and a large effect size for
is .80 (Cohen, 1992). Equating
and
using algebra, the expression
for
is:
It is noted that
is not a robust measure of effect size. The pooled sample standard
deviation, which is used to estimate the population standard deviation (
) will be inflated in the
presence of outliers thereby biasing the effect size measure. Furthermore,
assumes a normal distribution
in the calculation of power estimates.
Several problems exist with
the
where
and
is the median of the scores
in the control group. This effect size
estimator does not seem
Lax (1985) examined the performance of 17 different estimators of scale with heavy tailed distributions. Lax examined the performance of these scale estimators with the Normal distribution; a distribution with Cauchy tails (large kurtosis relative to the Normal The Slash dist.); and a mixture distribution of N(0,1) and N(0,100) for samples of size 20. The mixture distribution had 19 points sampled from N(0,1) and 1 point sampled from N(0,100) (One-Wild dist.). Lax combined the efficiencies of the estimators for the three distributions into what was defined as triefficiency. The biweight midvariance (with c=9) estimator performed best, with favorable efficiences across all scenarios: Normal (86.7%), One-Wild (85.8), and Slash (86.1). Following Wilcox (1997) the biweight midvariance can be calculated as follows. Setting (with c=9, M=sample median):
the following is calculated:
The
square of
where,
is the robust M-estimator for group 1 (using Huber
objective function, with k=1.28 for both groups),
is the robust M-estimator for group 2, and
is the square root of the
biweight midvariance for group 1
Doksum & Sievers (1976) report data on a study designed to assess the effects of ozone on weight gain in rats. The experimental group consisted of 22 seventy-day old rats kept in an ozone environment for 7 days (group y). The control group consisted of 23 rats of the same age (group x), and were kept in an ozone-free environment. Weight gain is measured in grams. The following R code produces quantile-quantile plots and non-parametric density plots of the two groups of data:
Resulting qqnorm plots and density plots from R code above:


The resulting M-estimators suggest that the population control group mean is downwardly biased (23.24 - robust; 22.40 - classical) and the experimental population group mean is biased upwardly (9.69 - robust; 11.01 - classical). Additionally, the robust pooled scale estimate is smaller than the classical pooled scale estimate (14.36 - robust; 15.36 - classical). Using these estimates to calculate Cohen's d measure indicates that the effect size is downwardly biased. Cohen's d based on classical estimators suggests a medium to large effect size (.74), whereas Cohen's d based on robust estimators suggests a very large effect size (.94). In terms of sample size planning for future experiments, the robust Cohen's d would suggest that a much smaller sample size would be needed to achieve the same power for a smaller effect size using non-robust estimators of location and scale - a considerable savings in terms of data that needs to be collected.

Cohen J. (1992) A power primer. Psychological Bulletin, 112, 155-159.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey.
Doksum, K.A. & Sievers, G.L. (1976). Plotting with confidence: graphical comparisons of two populations. Biometrika 63, 421-434.
Hoaglin, D. C., Mosteller, F., & Tukey, J.W. (1983). Understanding robust and exploratory data analysis. New York: Wiley.
Lax, D.A. (1985). Robust estimators of scale: finite sample performance in long-tailed symmetric distributions. Journal of the American Statistical Association, 80(391), 736-741.
Wilcox, Rand R. (1997). Introduction to Robust Estimation and Hypothesis Testing. Academic Press, New York.