This month we demonstrate the Percentile Bootstrap using the GNU S language, "R". R is a statistical programming environment that is a clone of the S and S-Plus language developed at Lucent Technologies. In the following document we illustrate the use of a GNU Web interface to the R engine on the "Terra" server, http://terra.acs.unt.edu/cgi-bin/R/Rprog. This GNU Web interface is a derivative of the "Rcgi" Perl scripts available for download from the CRAN website, http://www.cran.r-project.org (the main "R" website). Scripts can be submitted interactively, edited, and be re-submitted with changed parameters by selecting the hypertext link buttons that appear below the figures. For example, clicking the "Run Program" button below samples 100 random numbers from a normal distribution, then generates a histogram. To view any text output, scroll to the bottom of the browser window. To view the histogram, select the "Display Graphic" link. The script can be edited and resubmitted by changing the script in the form window and then selecting "Run the R Program". Selecting the browser "back page" button will return the reader to this document.
Bootstrapping is an approach to statistical inference that makes few
assumptions about the underlying probability distribution that describes the
data (Efron, 1993). This approach assumes that the empirical cumulative
distribution function is a
reasonable estimate of the unknown, population cumulative distribution function
(or put another way, the empirical density function approximates the population
density function). Using the data
as an approximation to the population density function, data is re-sampled with
replacement from the observed sample to create an empirical sampling
distribution for the test statistic under consideration.
If we knew the population density function (e.g.
normal probability distribution function), we could just sample from from this
probability density function, as in a Monte Carlo simulation, and generate the
sampling distribution to within any degree of precision (or, we could make a
normality assumption regarding the population). 
Lacking this information then, we assume that the observed data is a good
estimate of the population density function.
The empirical cumulative distribution function of the observed data is a
step function, taking jumps of height 1/n at each of the sample points.
As n increases, the function becomes smooth, and looks more like the
population cumulative distribution function. As n increase to infinity, the
observed cumulative distribution function converges to the population cumulative
distribution function. Now, using the observed
data sample as a proxy population, we resample with replacement to produce
another data set whose length is equal to the length of the original observed
data sample. This re-sampling
scheme is known as bootstrap re-sampling, and the re-sampled data sets are known
as bootstrap samples. In this re-sampling
scheme, bootstrap samples may contain duplicate copies of the original data
points in the new sample. For
example, an observed data set (our proxy population) that consists of the data
points: 4,3,5,6,7,2,2,1,7,9 (mean=4.6), could produce the following data with re-sampling (with
replacement): 2,4,3,6,1,7,9,5,5,1 (mean=4.3). For
each bootstrap sample generated, we would calculate a statistic (e.g. location
parameters: mean, median, M-estimator, etc) that is of concern.
The empirical distribution of these calculated statistics is referred to
as an empirical sampling distribution. This
empirical sampling distribution can be used as an approximation to the
theoretical population sampling distribution. To calculate scores that correspond the 2.5th and
97.5th confidence intervals, we simply find the scores that
correspond to the upper and lower
and (1-
) percentiles of the distribution. For example, for 1000 bootstrap samples of the sample mean,
we can calculate confidence intervals by finding the upper and lower 2.5th
and 97.5th percentiles using the following approach:
round((.05/2)x(1000))=25 for lower percentile; and
round((1-(.05/2))x(1000))=975. Moreover,
the standard deviation of the bootstrapped test statistics is an approximation
of the standard error of the test statistic under consideration. This approach to calculating bootstrap confidence intervals is
called the Percentile Bootstrap method (Efron, 1993). The bootstrap distribution, as outlined above is centered
around the estimated population value of the test statistic.
Other variations of bootstrap re-sampling center the bootstrap
distribution around zero, depending on how the null hypothesis is being tested.
Doksum & Sievers (1976) report data on a study designed to assess the effects of ozone on weight gain in rats. The experimental group consisted of 22 seventy-day old rats kept in an ozone environment for 7 days. The experimental group consisted of 23 rats of the same age, and were kept in an ozone-free environment. Weight gain is measured in grams. The following GNU S code (R code) assigns the data to the vectors x and y and then plots the quantile-quantile plots for the normal distribution for each group. Notice that each group has heavy tails, and do not conform to a normal distribution.
Step 1: Generate the bootstrap alternative distribution. A) Re-sample with replacement from vector x with replacement to generate a bootstrap sample, x1, with length of original vector x. B) Re-sample with replacement from vector y with replacement to generate a bootstrap sample, y1, with length of original vector y. C) Calculate measures of location for both bootstrap samples x1 and y1. D) Subtract the two measures of location. This is one bootstrap difference, and represents the difference between measures of location under the empirical alternate distribution. This empirical distribution is centered on the population difference under the alternate hypothesis. Step 2: Calculate the critical scores that correspond to the 2.5th and 97.5th critical alpha regions under the empirical alternate distribution. The critical scores are the scores that correspond to the 2.5th and 97.5th percentiles of the empirical alternate distribution. We can calculate the percentiles using the following approach: round((.05/2)x(#bootstrap samples)) for lower percentile; and round((1-(.05/2))x(#bootstrap samples)). Next, locate the scores that correspond to those percentiles. The following GNU S code implements this algorithm:
The 95% confidence interval that is reported represents our best estimate for the lower and upper bounds of
the difference in M-estimators for the two groups. This interval captures the true population difference in
M-estimators 95% of the time assuming that the null hypothesis is true (no difference between the groups). If
this interval contains zero, we fail to reject the null hypothesis of no difference in the population. If the interval
doesn't contain zero, we take this as a rejection of no difference in the population. Additionally, the narrower
the confidence interval the more precise our estimate is (less error in our estimation).
The Bootstrap re-sampling process is a very computer intensive procedure. Additionally, the iteratively re-weighted least squares algorithm that the M-estimator procedure uses is also computer intensive. Run times for the GNU S code may take a few minutes to run. Allowing the program to finishes before resubmitting will keep the GNU S server from slowing down. In general, having larger numbers of bootstrap samples increases the accuracy of the confidence intervals. However, the cost of increasing the number of bootstrap samples can be substantial.
Doksum,
K.A. & Sievers, G.L. (1976). Plotting with confidence: graphical
comparisons
of two populations. Biometrika 63, 421-434.
Efron, B. (1993). An Introduction to the Bootstrap. New York: Chapman and Hall.