#
#
########### Cross Validation ###########
#
# <<< ### >>>
# The next three paragraphs are an amalgamation of content from Wikipedia:
# http://en.wikipedia.org/wiki/Cross-validation_(statistics)
# Andrew Moore:
# http://www.autonlab.org/tutorials/overfit.html
# and Jeff Schneider
# http://www.cs.cmu.edu/~schneide/tut5/node42.html
# Full references & resources listed at the bottom of the script.
# <<< ### >>>
# Cross validation is a model evaluation method that is better than simply looking
# at the residuals. Cross validation addresses the 'over-fitting' problem.
# Over-fitting refers to the fact that the same data was used to fit/specify the
# model, AND assess that model's fit.
# The problem with residual evaluations is that they do not give an indication of how
# well the model will do when it is asked to make new predictions for data it has
# not already seen. One way to overcome this problem is to not use the entire data
# set when building a model. Some of the data is removed before modeling begins.
# When the model has been built, the data that was removed can be used to test the
# performance of the model on ``new'' data. This is the basic idea for a
# whole class of model evaluation methods called cross validation.
# The holdout method is the simplest kind of cross validation. The data set is
# separated into two sets, called the training set and the testing set. The model
# is fit using the training set only. Then the model is used to predict the outcome
# values for the data in the testing set (it has never seen these output values
# before). The errors it makes are accumulated to give the mean absolute test set
# error, which is used to evaluate the model. The advantage of this method is that
# it is usually preferable to the residual method and takes no longer to compute.
# However, its evaluation can have a high variance. The evaluation may depend
# heavily on which data points end up in the training set and which end up in the
# test set, and thus the evaluation may be significantly different depending on
# how the division is made.
# K-fold cross validation is one way to improve over the holdout method. The data
# set is divided into k subsets, and the holdout method is repeated k times. Each
# time, one of the k subsets is used as the test set and the other k-1 subsets are
# put together to form a training set. Then the average error across all k trials
# is computed. The advantage of this method is that it matters less how the data
# gets divided. Every data point gets to be in a test set exactly once, and gets
# to be in a training set k-1 times. The variance of the resulting estimate is
# reduced as k is increased. The disadvantage of this method is that the training
# algorithm has to be rerun from scratch k times, which means it takes k times as
# much computation to make an evaluation. A variant of this method is to randomly
# divide the data into a test and training set k different times. The advantage of
# doing this is that you can independently choose how large each test set is and
# how many trials you average over.
## CREATE THE POPULATION.
# Using the 'MASS' library to generate some multivariate normal variables; each
# with a mean of 10.
# This is the population of *true* scores for these variables.
N <- 1000000
library(MASS)
Sigma <- matrix(c(1.0, .80, .50, .20, .00, .00, .00, .00,
.80, 1.0, .00, .00, .15, .15, .05, .10,
.50, .00, 1.0, .00, .10, .05, .10, .15,
.20, .00, .00, 1.0, .10, .15, .15, .05,
.00, .15, .10, .10, 1.0, .15, .25, .25,
.00, .15, .05, .15, .15, 1.0, .35, .55,
.00, .05, .10, .15, .25, .35, 1.0, .85,
.00, .10, .15, .05, .25, .55, .85, 1.0), ncol = 8)
x <- mvrnorm(N, Sigma, mu=c(10,10,10,10,10,10,10,10), empirical = TRUE)
# Create a population identification variable (to identify cases among the population).
p.id <- seq(1:N)
# Create the population data frame and re-name the variable columns.
population.df <- data.frame(p.id, x)
names(population.df)[2] <- "y"
names(population.df)[3] <- "x1"
names(population.df)[4] <- "x2"
names(population.df)[5] <- "x3"
names(population.df)[6] <- "x4"
names(population.df)[7] <- "x5"
names(population.df)[8] <- "x6"
names(population.df)[9] <- "x7"
head(population.df)
summary(population.df)
# Cleaning up the workspace.
ls() # Lists all the objects in the workspace.
rm(N, Sigma, p.id, x) # remove the objects no longer needed (rm = remove).
ls() # Lists all the objects in the workspace.
# Detach the 'MASS' package, as it is no longer needed.
detach("package:MASS")
# Notice, only x1, x2, x3 are related to y.
round(cor(population.df[,2:9]), 2)
# The actual or correctly specified population linear model of 'y' only includes x1, x2, x3.
pop.lm <- lm(y ~ x1 + x2 + x3, population.df)
summary(pop.lm)
# Model specification error and multicollinearity are introduced by including
# the other variables; which add nothing to y (we know because the correlation
# between y and x4, x5, x6, and x7 is specified at exactly 0.00 above when we created
# the variance/covariance matrix on which the data was based).
# The model below is named 'bad.model' because of the mis-specification; which
# also introduces multicollinearity because of the intercorrelations of X4, X5,
# x6, & x7; and their relationships with the 'true' predictors (x1, x2, & x3).
bad.model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, population.df)
summary(bad.model)
rm(bad.model)
# Keep in mind, the significance values (p-values) are directly related to the number of
# cases (i.e. more cases = lower p-values). Here, they are extremely deflated with
# the size of the data (population N = 1000000). Also, R-squared and even adjusted
# R-squared are biased (over estimating model fit) due to "overfitting" which refers
# to specifying the model with the same data with which you measure model fit (prediction
# errors).
## CREATE THE SAMPLE.
# Drawing a (first) sample from the population which was created above.
n <- 100
s.id <- seq(1:n)
samp <- sample(population.df[,1], n, replace = FALSE)
sample1.df <- data.frame(s.id,population.df[samp,])
rm(n, s.id, samp)
summary(sample1.df)
nrow(sample1.df)
# Draw a second sample from the population (for later use toward the bottom of the script).
n <- 100
s.id <- seq(1:n)
samp <- sample(population.df[,1], n, replace = FALSE)
sample2.df <- data.frame(s.id,population.df[samp,])
rm(n, s.id, samp)
summary(sample2.df)
nrow(sample2.df)
# Wrote out the 'population.df' data frame as a comma separated values (.csv) data file.
# No need to run here; just showing the script/command as an example.
write.table(population.df,
"C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_population.df.txt",
sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA")
# Wrote out the 'sample1.df' data frame as a comma separated values (.csv) data file.
# No need to run here; just showing the script/command as an example.
write.table(sample1.df,
"C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_sample1.df.txt",
sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA")
# Wrote out the 'sample2.df' data frame as a comma separated values (.csv) data file.
# No need to run here; just showing the script/command as an example.
write.table(sample2.df,
"C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_sample2.df.txt",
sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA")
# Saved out the workspace 'CrossValidation_003.Rdata' to ensure all three data files
# would be accessible for useRs interested this script.
# No need to run here; just showing the script/command as an example.
save.image("C:\\Users\\jds0282\\Desktop\\Workstuff\\Jon_R\\CrossValidation\\CrossValidation_003.RData")
# <<< All four files were/are posted on the web; to be accessed below. >>> #
##################################################################################################################
#
### Close the R session or run the three lines directly below and begin again below; importing the specific
### data I made above to insure repeated or consistent results (rather than creating new 'simulated' data
### each time the script is run, which would result in different values for each object/analysis).
#
##################################################################################################################
ls()
rm(pop.lm, population.df, sample1.df, sample2.df)
ls()
#
#
######## Cross Validation ########
#
# The data used in this script is simulated and contains one continuous (interval/ratio)
# outcome variable and 7 other continuous (interval/ratio) variables. All 8 variables
# have an approximate mean of 10 and are (multivariate) normally distributed.
# Initially, a population was created (N = 1000000) and from it samples could be
# randomly drawn by randomly selecting population identification numbers (p.id).
# The first sample (n = 100) was drawn (read in from the web so that results are
# consistent for all users of this script). The sample contains an additional
# column for identifying each case of the sample (s.id).
# The general idea below is that we are working from a perspective of a researcher
# or research team with a theoretically motivated study in which we 'believe'
# 7 measured variables (x1 - x7) predict the outcome (y). Our goal is to use
# cross validation to estimate the prediction error of our model and if warranted,
# identify a linear model (OLS regression) which offers the lowest prediction error
# estimate(s).
# Of course, the 'real' goal of this script is to show various approaches to cross
# validation (and *true* validation) of a model; as well as showing some of the
# things which can (and often do) compromise the validity of model fit estimates
# and prediction error estimates. Keep in mind throughout; we use a simple (OLS)
# linear regression as an example here, but the ideas conveyed here apply to
# other types of modeling (e.g. GLM, HLM, SEM, etc.).
# Read in the sample data file from the web naming it "sample1.df" as below; and
# get the ubiquitous 'head' and 'summary' of the data to see what it looks like.
sample1.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_sample1.df.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
head(sample1.df)
summary(sample1.df)
# Fit all the original sample 1 data to the linear model and calculate the Average
# Residual Squared Error for all cases in the sample (baseline [and biased] prediction
# error estimate: RSE.n).
samp1.lm <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, sample1.df)
summary(samp1.lm)
RSE.n <- (sum((sample1.df$y - samp1.lm$fitted.values)^2))/nrow(sample1.df)
RSE.n
# These initial estimates (generated above) indicate superb model fit (R-squared &
# Adj. R-squared) and an extremely small prediction error estimate (RSE.n); however,
# they are all compromised by overfitting.
####### Split-half validation (also called 'split-sample' or hold-out validation).
# Divide the sample data into two halves, the training set and the testing set.
nrow(sample1.df)
training.set <- sample1.df[1:50,]
head(training.set)
testing.set <- sample1.df[51:100,]
head(testing.set)
# Specify/fit the model with the training set.
model.1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = training.set)
summary(model.1)
# Apply the specified model (coefficients) to predictor values from the testing set
# to predict (model based) outcome values of the testing set.
attach(testing.set)
model.1$coefficients
y.hat <- model.1$coefficients[1] + model.1$coefficients[2]*x1 + model.1$coefficients[3]*x2 +
model.1$coefficients[4]*x3 + model.1$coefficients[5]*x4 + model.1$coefficients[6]*x5 +
model.1$coefficients[7]*x6 + model.1$coefficients[8]*x7
# Compare these predicted values to the actual values of the outcome in the testing set.
t.test(y.hat, y, paired = TRUE)
cor.test(y.hat, y)
library(car)
scatterplot(y.hat, y)
# The results above look great; no difference between the two (paired) means,
# the correlation is virtually +1....perhaps a bit too good to be true?
# Calculating the residuals (errors).
errors <- y - y.hat
summary(errors)
# Calculating the average error.
avg.err <- sum(y - y.hat)/length(y)
avg.err
# OR:
mean(y - y.hat)
mean(errors)
# Squared Prediction Errors.
pe2 <- (y - y.hat)^2
summary(pe2)
# Average Cross validation sums of squares.
cv.ss <- sum((y - y.hat)^2)/length(y)
cv.ss
# Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n). Notice, they
# are virtually the same, with RSE.n slightly biased (estimating less prediction error).
RSE.n
cv.ss
# Detach the 'testing.set' data.
detach(testing.set)
# Cleaning up the workspace.
ls()
rm(avg.err, cv.ss, errors, model.1, pe2, testing.set, training.set, y.hat)
ls()
###### Leave-One-Out cross-validation (LOOCV).
# The LOOCV strategy in its most basic form, simply takes one observation out of
# the data and sets it aside as the 'testing set' like what was done above. Then
# the model is applied to the training set of n - 1 cases (i.e. the data minus
# the single testing set case) and the resulting coefficients are applied to the
# testing set case to produce a predicted value which in turn is then compared to
# the actual value (of y) of that single case. Below, we avoid the most basic form
# of LOOCV and instead iteratively conduct the procedure across all cases (i.e. each
# case is 'left out' once).
# Setting up the initial conditions and creating an empty vector to store the values
# of y.hat for each iteration.
n <- 0
y.hat <- as.vector(0)
# Running the LOOCV iteratively ('for loop').
for (i in 1 : nrow(sample1.df)){
n <- 1 + n
model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = sample1.df[-n,])
y.hat[i] <- model$coefficients[1] + model$coefficients[2]*sample1.df[n,4] +
model$coefficients[3]*sample1.df[n,5] + model$coefficients[4]*sample1.df[n,6] +
model$coefficients[5]*sample1.df[n,7] + model$coefficients[6]*sample1.df[n,8] +
model$coefficients[7]*sample1.df[n,9] + model$coefficients[8]*sample1.df[n,10]
}
# Calculating the errors and taking a look at them.
errors <- sample1.df$y - y.hat
summary(errors)
hist(errors)
cor(y.hat, sample1.df$y)
scatterplot(y.hat, sample1.df$y)
# Average Cross validation sums of squares.
cv.ss <- sum((sample1.df$y - y.hat)^2)/1000
cv.ss
# Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n); again, both
# are very close to zero; with the cv.ss indicating an even smaller estimate of
# prediction error.
RSE.n
cv.ss
# Cleaning up the workspace.
ls()
rm(cv.ss, errors, i, model, n, y.hat)
ls()
###### Bootstrapped LOOCV.
# Here, we create 10 bootstrapped samples (sample WITH replacement from the original
# sample) and apply the LOOCV from above to each of the 10 bootstrapped samples;
# each time saving the y.hat values and then calculating the errors, cv.ss, and
# correlation between y.hat and the actual values of y in each bootstrapped sample.
# Creating empty objects for storing the output values.
errors <- as.data.frame(matrix(0, ncol = 10, nrow = 50))
cv.ss <- as.vector(0)
corr.yhat <- as.vector(0)
# Running the bootstrapped LOOCV; which takes a minute or so.
for (i in 1 : 10){
boot.id <- sample(sample1.df$s.id, 50, replace = TRUE)
boot.sample <- data.frame(sample1.df[boot.id,])
n <- 0
y.hat <- as.vector(0)
for (k in 1:nrow(boot.sample)){
n <- 1 + n
model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = boot.sample[-n,])
y.hat[k] <- model$coefficients[1] + model$coefficients[2]*boot.sample[n,4] +
model$coefficients[3]*boot.sample[n,5] + model$coefficients[4]*boot.sample[n,6] +
model$coefficients[5]*boot.sample[n,7] + model$coefficients[6]*boot.sample[n,8] +
model$coefficients[7]*boot.sample[n,9] + model$coefficients[8]*boot.sample[n,10]
}
errors[,i] <- boot.sample$y - y.hat
cv.ss[i] <- (sum(boot.sample$y - y.hat)^2)/50
corr.yhat[i] <- cor(boot.sample$y, y.hat)
}
# Summary of the errors for each of the 10 iterations (loops); each column corresponds
# to each loop/bootstrapped sample.
summary(errors)
# Average Cross validation sums of squares for each of the 10 iterations (loops).
cv.ss
# Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n) from each iteration
# and the bootstrapped average of the cv.ss across iterations (an average of the averages).
RSE.n
cv.ss
mean(cv.ss)
# All 10 bootstrapped estimates (and their average) are lower than the RSE.n.
# The average of the averages across iterations provides the more robust estimated
# prediction error.
# Correlations between y and y.hat for each of the 10 iterations; all nearly 1.0.
corr.yhat
# Average correlation of the 10 iterations.
mean(corr.yhat)
# Cleaning up the workspace.
ls()
rm(cv.ss, boot.id, boot.sample, corr.yhat, errors, i, k, model, n, y.hat)
ls()
###### Bootstrapped (cross validation): Estimates of Prediction Error (Efron & Tibshirani, 1993).
# The bootstrapped cross validation approach represents a robust method of estimating
# the prediction error for a model. Each bootstrapped sample (sampling WITH replacement
# from the original sample) has the model fit to it and the subsequent coefficients
# are then used to predict scores on the original sample as well as the bootstrapped
# sample values of the outcome. Natrually, the errors associated with the original sample
# estimates will be larger than the errors associated with the bootstrapped sample estimates;
# because, the bootstrapped sample was used to fit the model and generate the coefficients.
# Then, using the DIFFERENCE between the original sample errors and the bootstrapped
# sample errors provides us with an estimate of bias or "optimism" (Efron & Tibshirani,
# 1993, p. 248). Finally, the average optimism (average of each optimism from all
# the bootstrapping) can be added to the original sample RSE.n; which corrects it
# and provides a more accurate estimate of average prediction error.
# Setting initial conditions and creating an empty data frame to store results.
n <- 100 # number of cases in each bootstrapped sample = number of cases in the sample data.
N <- 200 # Number of bootstrapped samples to take/run.
prediction.errors <- as.data.frame(matrix(0, ncol = 3, nrow = N))
# Running the bootstrapped cross validation (which is surprisingly quick).
for (i in 1 : N){
boot.id <- sample(sample1.df$s.id, n, replace = TRUE)
boot.sample <- data.frame(sample1.df[boot.id,])
model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = boot.sample)
samp.fv <- model$coefficients[1] + model$coefficients[2]*sample1.df$x1 +
model$coefficients[3]*sample1.df$x2 + model$coefficients[4]*sample1.df$x3 +
model$coefficients[5]*sample1.df$x4 + model$coefficients[6]*sample1.df$x5 +
model$coefficients[7]*sample1.df$x6 + model$coefficients[8]*sample1.df$x7
s.error <- sum((sample1.df$y - samp.fv)^2)/n
b.error <- sum((boot.sample$y - model$fitted.values)^2)/n
optimism <- s.error - b.error
prediction.errors[i,] <- c(s.error, b.error, optimism)
}
# The 'prediction.errors' data frame contains the following: the original
# sample-based prediction error estimate (average cross validation sums of squares),
# the bootstrapped sample-based prediction error estimate (average cross validation
# sums of squares), and the optimism estimate (difference between the previous two);
# 'prediction.errors' contains all 3 for each N = 200 iterations (bootstraps).
names(prediction.errors)[1] <- "Sample.Error.Estimates"
names(prediction.errors)[2] <- "Boot.Error. Estimates"
names(prediction.errors)[3] <- "Optimism"
summary(prediction.errors)
# Averages of each.
avg.s.pe <- mean(prediction.errors[,1])
avg.s.pe
avg.b.pe <- mean(prediction.errors[,2])
avg.b.pe
avg.optimism <- mean(prediction.errors[,3])
avg.optimism
# Compare the sample and bootstrapped average cross validation sums of squares ('avg.s.pe' &
# 'avg.b.pe' from above) to the Average Residual Squared Error (RSE.n).
RSE.n
# Improved bootstraped prediction error estimate which adds a bias correction to the
# original RSE.n; because, RSE.n is biased downwardly (i.e. predicted error estimate
# is smaller [less error] than it really should be due to overfitting). See Efron,
# 1993, p. 237 - 249.
avg.optimism + RSE.n
# Cleaning up the workspace.
ls()
rm(avg.b.pe, avg.optimism, avg.s.pe, b.error, boot.id, boot.sample, i, model, n, N,
optimism, prediction.errors, s.error, samp.fv)
ls()
###### Real cross validation (*true* cross validation)
# 'Real cross validation' = collecting another sample of data from the same
# population and using the new data (measured with the same instruments) to
# "validate" the model's accuracy. Of course, in actual research it is often
# impossible to do this, either because of funding, time constraints, or because
# the model is based on fixed effects and not random effects (e.g. the population
# changes because individual members move into and out of it, time is an unmeasured
# confound which influences the model, the effects modeled are really nested and the
# nesting effect was not measured or modeled, etc.).
# Collecting new data: here, this is very easy because we have simulated the data
# by first generating a simulated population and then sampling from it to build
# the model. This is why our estimates are very 'unbiased' above; most of the
# prediction error estimates (regardless of cross validation technique) have
# produced very similar estimates (virtually the same near-zero estimates of
# prediction error).
# Collecting (drawing) a new sample from the population.
# Here; read in the second sample from the web, naming it "sample2.df".
sample2.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_sample2.df.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
head(sample2.df)
summary(sample2.df)
# Now we can use the original sample's linear model coefficients, applied to
# the new (2nd) sample's data (predictor variables), to 'predict' values on the
# outcome variable (y) of the new (2nd) sample data set. Then we can compare the
# 'predicted' values (y.hat) with the actual values of 'y' from the new data.
# First, a summary of the original sample's linear model coefficients.
summary(samp1.lm)
samp1.lm$coefficients
# Applying the coefficients to the new predictor variable data to create the
# 'predicted' values of the outcome (y.hat).
y.hat <- samp1.lm$coefficients[1] + samp1.lm$coefficients[2]*sample2.df$x1 + samp1.lm$coefficients[3]*sample2.df$x2 +
samp1.lm$coefficients[4]*sample2.df$x3 + samp1.lm$coefficients[5]*sample2.df$x4 +
samp1.lm$coefficients[6]*sample2.df$x5 + samp1.lm$coefficients[7]*sample2.df$x6 +
samp1.lm$coefficients[8]*sample2.df$x7
summary(y.hat)
# Comparison of the 'predicted' values (y.hat) to the actual (new) values of the
# outcome (sample2.df$y).
summary(y.hat)
summary(sample2.df$y)
mean(y.hat)
mean(sample2.df$y)
var(y.hat)
var(sample2.df$y)
sd(y.hat)
sd(sample2.df$y)
# The following creates two histograms (in one graphics window); showing y.hat and y.
oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,1))
hist(y.hat, col = "lightgreen", xlim = c(6,14), main = "Histogram of 'y.hat'", xlab = "Mean = 10.025, SD = 1.034")
abline(v = mean(y.hat), col = "blue", lwd = 2, lty = "dashed")
hist(sample2.df$y, col = "lightblue", xlim = c(6,14), main = "Histogram of new 'y'", xlab = "Mean = 10.030, SD = 1.034")
abline(v = mean(sample2.df$y), col = "darkgreen", lwd = 2, lty = "dashed")
par(oldpar)
# Further comparisons.
t.test(y.hat, sample2.df$y, paired = TRUE)
cor.test(y.hat, sample2.df$y)
# So, clearly our model (based on the original sample) is doing an excellent job
# of predicting new values of 'y' based on new values of all the predictors.
# This is due to a few advantages of simulating or creating the data in the manner
# we did: the variables have zero measurement error and the cases were randomly
# sampled from a defined (and constant) population (i.e. zero sampling error/bias);
# although, because this data is simulated, there is a slight chance the same
# case(es) may appear in multiple samples. However, that was not the case here;
# meaning, each case is unique to sample 1 or sample 2. This is largely due to
# the overwhelmingly large population size (N = 1000000) in relation to sample
# sizes (n = 100).
# Furthermore, our samples have greater than a 10 to 1 ratio of cases to variables.
# Simply put, our sample sizes, though not large, are adequate because; for each of our
# 8 variables we have at least 10 cases -- it would obviously be better to have
# a 20 to 1 (or even higher) ratio; but, 10 to 1 is about the lowest acceptable
# ratio. However, one should also remember that the sample size in relation to
# the population size is important. The smaller the sample (in comparison to
# the population), the more likely the sample, even a random sample, will contain
# bias.
# In a 'real-world' setting, it is not uncommon to have violations of one, or
# more of the above conditions -- any one of which would bias the estimate of RSE.n;
# meaning the model would have more prediction error than the RSE.n would indicate.
# Lastly, two other threats to the accuracy of estimating prediction errors for a model
# are model search error and model specification error. Model search error is the
# bias created by the method used to select the model 'form' (i.e. the method
# used to choose linear form vs. quadratic form).
# Model specification error can happen in three ways;
# (1) using the wrong model 'form' (e.g. using a linear model rather than a quadratic
# model when the quadratic more accurately represents the data), (2) errors of
# omission (leaving crucial variables out of the model), and (3) errors of
# inclusion (including non-essential variables in the model). The last of which,
# errors of inclusion; can increase multicollinearity and cause bias in model
# fit estimates, prediction error estimates, and individual parameter estimates
# (i.e. coefficients). To clarify this last point, notice that throughout the
# above results, our model has performed almost perfectly with respect to model
# fit (R-squared & Adj. R-squared), small residuals (errors), and very small
# predicted error estimates. However, closer inspection of the sample(s) data
# (and the population data) will reveal two related errors with our model.
# First, we have model specification errors of inclusion; four of our predictors
# are not related to the outcome (AT ALL!), which means they have no business
# being included in the model. The only thing they do is artificially increase
# fit measures (R-squared & Adj. R-squared). But, these four variables are even
# more disruptive because they bring with them multicollinearity (intercorrelations
# among themselves and to a lesser extent with the actual 3 predictors); a condition
# which decreases the validity of the coefficients in terms of variable importance.
# To 'see' the relationships referred to above, simply take a look at the correlation
# matrix of 'y' and all the 'x' variables. Below, both sample correlation matrices
# are displayed with rounding two places after the decimal (if interested in seeing
# the matrices without rounding use this: cor(sample1.df[,3:10]) and
# this: cor(sample2.df[,3:10])
round(cor(sample1.df[,3:10]), 2)
round(cor(sample2.df[,3:10]), 2)
# Now take a look at the population.
# Read in the population data frame from the web (this takes a couple of minutes
# due to the file being over 100 mb; N = 1000000) and name it "population.df".
# NOTE:
# Because the population data file is so big, the internet connection "times out"
# before all the data can be read. As an example, you will not be able to get
# all 1000000 cases when you run the standard 'read.table' function we typically
# use here; as can be seen with the 'tail' function (which prints the last 6 rows
# of the object -- data.frame):
population.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_population.df.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
tail(population.df)
# There are two ways we can 'get around' this (there are probably more ways, I simply
# have not spent time trying).
# First, (not recommended) we can use the old fashioned method of downloading
# the 'cv_population.df.txt' file from the URL above and save it to your desktop, then
# alter the following code to reflect the path to YOUR desktop in order to read
# the data into your R session (workspace).
population.df <- read.table("C:\\Users\\jds0282\\Desktop\\Workstuff\\Jon_R\\CrossValidation\\cv_population.df.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
tail(population.df)
# The second method (recommended) involves loading the saved workspace which I
# initially created when writing this script (which is available on the web).
# When this workspace is loaded into your R session, all the objects you have
# will still be available, the only additions will be the objects: 'connection'
# and 'population.df'; if you have an 'incomplete' version of 'population.df'
# from above, do not worry, it will be over-written with the complete version
# from the workspace.
ls() # list the objects in your workspace.
connection <- url("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/CrossValidation_003.RData")
load(connection)
ls() # list the objects in your workspace.
# Now, we can get back to what we were doing...
head(population.df)
tail(population.df)
nrow(population.df) # Number of rows matches the population identification (p.id) of the last case from the 'tail' function.
summary(population.df)
# You probably noticed a warning message:
# "closing unused connection 3 (gzcon(http://www.unt.edu/rss/class/....."
# Not to worry, R simply closed the connection; nothing was lost.
# Take a look at the population correlation matrix; note specifically that only
# x1, x2, and x3 are related to y.
round(cor(population.df[,2:9]), 2)
# To compare the 'true' population model with the mis-specified model, run both on
# the population. Be advised, this takes a minute or so due to the 1000000 cases.
pop.lm <- lm(y ~ x1 + x2 + x3, population.df)
summary(pop.lm)
bad.model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, population.df)
summary(bad.model)
# Based on the results of the 'bad.model' it appears superior; smaller residuals,
# larger R-squared and Adj. R-squared, and smaller residual standard errors. But,
# we 'know' those are biased because the correlation matrix(-ces) shows us what is
# *truly* related to the outcome (y) and what is not. Of course, the p-values cannot
# be relied upon due to the enormous number of cases. One final look:
cor(population.df[,2:9])
# If you're not familiar with the scientific notation; where 'e' is + (positive
# exponent) moves the decimal to the right and where 'e' is - (negative exponent)
# moves the decimal to the left. For example;
# 5.23e+04 = 52300 and 5.23e-04 = .000523
5.23e+04
5.23e-04
# Cleaning the workspace for the final time.
ls()
rm(bad.model, connection, oldpar, pop.lm, population.df, RSE.n, samp1.lm,
sample1.df, sample2.df, y.hat)
ls()
# If all this scripting / code intimidates you; have no fear, there are several
# R packages / libraries which are capable of utilizing these methods with specific
# functions (meaning, all this code serves only to inform one of the underlying
# processes of, and threats to, cross-validation and estimates of prediction error).
# Some of those packages / libraries will be explored in another script / tutorial.
##################### REFERENCES / RESOURCES #####################
#
# Chernick, M. R. (2008). Bootstrap methods: A guide for practitioners and
# Researchers (2nd ed.). Hoboken, NJ: John Wiley & Sons, Inc.
#
# Efron, B. (1983). Estimating the error rate of a prediction rule: Some improvements on
# cross-validation. Journal of the American Statistical Association, 78, 316 - 331.
#
# Efron, B, & Tibshirani, R. (1993). An introduction to the bootstrap. New York:
# Chapman & Hall.
#
# Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap
# method. Journal of the American Statistical Association, 92(438), 548 - 560.
#
# Harrell, F., Lee, K., & Mark, D. (1996). Tutorial in Biostatistics: Multivariable prognostic
# models: Issues in developing models, evaluating assumptions and adequacy, and measuring
# and reducing errors. Statistics in Medicine, 15, 361 - 387.
# Available at: http://www.yaroslavvb.com/papers/steyerberg-application.pdf
#
# Harrell, F. (1998). Comparisons of strategies for validating binary logistic regression
# models. Available at: http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf
#
# Harrell, F. E. (2001). Regression modeling strategies: With applications to
# linear models, logistic regression, and survival analysis. New York:
# Springer-Verlag, Inc.
#
# Harrell, F. E. (2009). Package 'Design'. Available at CRAN:
# http://cran.r-project.org/web/packages/Design/index.html
#
# Maindonald, J., & Braun, W. J. (2011). Package 'DAAG'. Available at CRAN:
# http://cran.r-project.org/web/packages/DAAG/index.html
#
# Moore, A. (2008). Cross-Validation: tutorial slides. Available at:
# http://www.autonlab.org/tutorials/overfit.html
#
# Ripley, B. (2010). Package 'boot'. Available at CRAN:
# http://cran.r-project.org/web/packages/boot/index.html
#
# Schneider, J. (1997). Cross validation. Availabe at:
# http://www.cs.cmu.edu/~schneide/tut5/node42.html
#
# Wikipedia. (2011). Cross-validation (statistics). Available at:
# http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
#
#
# End; last updated: May 10, 2011.