# # ########### Cross Validation ########### # # <<< ### >>> # The next three paragraphs are an amalgamation of content from Wikipedia: # http://en.wikipedia.org/wiki/Cross-validation_(statistics) # Andrew Moore: # http://www.autonlab.org/tutorials/overfit.html # and Jeff Schneider # http://www.cs.cmu.edu/~schneide/tut5/node42.html # Full references & resources listed at the bottom of the script. # <<< ### >>> # Cross validation is a model evaluation method that is better than simply looking # at the residuals. Cross validation addresses the 'over-fitting' problem. # Over-fitting refers to the fact that the same data was used to fit/specify the # model, AND assess that model's fit. # The problem with residual evaluations is that they do not give an indication of how # well the model will do when it is asked to make new predictions for data it has # not already seen. One way to overcome this problem is to not use the entire data # set when building a model. Some of the data is removed before modeling begins. # When the model has been built, the data that was removed can be used to test the # performance of the model on ``new'' data. This is the basic idea for a # whole class of model evaluation methods called cross validation. # The holdout method is the simplest kind of cross validation. The data set is # separated into two sets, called the training set and the testing set. The model # is fit using the training set only. Then the model is used to predict the outcome # values for the data in the testing set (it has never seen these output values # before). The errors it makes are accumulated to give the mean absolute test set # error, which is used to evaluate the model. The advantage of this method is that # it is usually preferable to the residual method and takes no longer to compute. # However, its evaluation can have a high variance. The evaluation may depend # heavily on which data points end up in the training set and which end up in the # test set, and thus the evaluation may be significantly different depending on # how the division is made. # K-fold cross validation is one way to improve over the holdout method. The data # set is divided into k subsets, and the holdout method is repeated k times. Each # time, one of the k subsets is used as the test set and the other k-1 subsets are # put together to form a training set. Then the average error across all k trials # is computed. The advantage of this method is that it matters less how the data # gets divided. Every data point gets to be in a test set exactly once, and gets # to be in a training set k-1 times. The variance of the resulting estimate is # reduced as k is increased. The disadvantage of this method is that the training # algorithm has to be rerun from scratch k times, which means it takes k times as # much computation to make an evaluation. A variant of this method is to randomly # divide the data into a test and training set k different times. The advantage of # doing this is that you can independently choose how large each test set is and # how many trials you average over. ## CREATE THE POPULATION. # Using the 'MASS' library to generate some multivariate normal variables; each # with a mean of 10. # This is the population of *true* scores for these variables. N <- 1000000 library(MASS) Sigma <- matrix(c(1.0, .80, .50, .20, .00, .00, .00, .00, .80, 1.0, .00, .00, .15, .15, .05, .10, .50, .00, 1.0, .00, .10, .05, .10, .15, .20, .00, .00, 1.0, .10, .15, .15, .05, .00, .15, .10, .10, 1.0, .15, .25, .25, .00, .15, .05, .15, .15, 1.0, .35, .55, .00, .05, .10, .15, .25, .35, 1.0, .85, .00, .10, .15, .05, .25, .55, .85, 1.0), ncol = 8) x <- mvrnorm(N, Sigma, mu=c(10,10,10,10,10,10,10,10), empirical = TRUE) # Create a population identification variable (to identify cases among the population). p.id <- seq(1:N) # Create the population data frame and re-name the variable columns. population.df <- data.frame(p.id, x) names(population.df)[2] <- "y" names(population.df)[3] <- "x1" names(population.df)[4] <- "x2" names(population.df)[5] <- "x3" names(population.df)[6] <- "x4" names(population.df)[7] <- "x5" names(population.df)[8] <- "x6" names(population.df)[9] <- "x7" head(population.df) summary(population.df) # Cleaning up the workspace. ls() # Lists all the objects in the workspace. rm(N, Sigma, p.id, x) # remove the objects no longer needed (rm = remove). ls() # Lists all the objects in the workspace. # Detach the 'MASS' package, as it is no longer needed. detach("package:MASS") # Notice, only x1, x2, x3 are related to y. round(cor(population.df[,2:9]), 2) # The actual or correctly specified population linear model of 'y' only includes x1, x2, x3. pop.lm <- lm(y ~ x1 + x2 + x3, population.df) summary(pop.lm) # Model specification error and multicollinearity are introduced by including # the other variables; which add nothing to y (we know because the correlation # between y and x4, x5, x6, and x7 is specified at exactly 0.00 above when we created # the variance/covariance matrix on which the data was based). # The model below is named 'bad.model' because of the mis-specification; which # also introduces multicollinearity because of the intercorrelations of X4, X5, # x6, & x7; and their relationships with the 'true' predictors (x1, x2, & x3). bad.model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, population.df) summary(bad.model) rm(bad.model) # Keep in mind, the significance values (p-values) are directly related to the number of # cases (i.e. more cases = lower p-values). Here, they are extremely deflated with # the size of the data (population N = 1000000). Also, R-squared and even adjusted # R-squared are biased (over estimating model fit) due to "overfitting" which refers # to specifying the model with the same data with which you measure model fit (prediction # errors). ## CREATE THE SAMPLE. # Drawing a (first) sample from the population which was created above. n <- 100 s.id <- seq(1:n) samp <- sample(population.df[,1], n, replace = FALSE) sample1.df <- data.frame(s.id,population.df[samp,]) rm(n, s.id, samp) summary(sample1.df) nrow(sample1.df) # Draw a second sample from the population (for later use toward the bottom of the script). n <- 100 s.id <- seq(1:n) samp <- sample(population.df[,1], n, replace = FALSE) sample2.df <- data.frame(s.id,population.df[samp,]) rm(n, s.id, samp) summary(sample2.df) nrow(sample2.df) # Wrote out the 'population.df' data frame as a comma separated values (.csv) data file. # No need to run here; just showing the script/command as an example. write.table(population.df, "C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_population.df.txt", sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA") # Wrote out the 'sample1.df' data frame as a comma separated values (.csv) data file. # No need to run here; just showing the script/command as an example. write.table(sample1.df, "C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_sample1.df.txt", sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA") # Wrote out the 'sample2.df' data frame as a comma separated values (.csv) data file. # No need to run here; just showing the script/command as an example. write.table(sample2.df, "C:/Users/jds0282/Desktop/Workstuff/Jon_R/CrossValidation/cv_sample2.df.txt", sep=",", col.names=TRUE, row.names=FALSE, quote=TRUE, na="NA") # Saved out the workspace 'CrossValidation_003.Rdata' to ensure all three data files # would be accessible for useRs interested this script. # No need to run here; just showing the script/command as an example. save.image("C:\\Users\\jds0282\\Desktop\\Workstuff\\Jon_R\\CrossValidation\\CrossValidation_003.RData") # <<< All four files were/are posted on the web; to be accessed below. >>> # ################################################################################################################## # ### Close the R session or run the three lines directly below and begin again below; importing the specific ### data I made above to insure repeated or consistent results (rather than creating new 'simulated' data ### each time the script is run, which would result in different values for each object/analysis). # ################################################################################################################## ls() rm(pop.lm, population.df, sample1.df, sample2.df) ls() # # ######## Cross Validation ######## # # The data used in this script is simulated and contains one continuous (interval/ratio) # outcome variable and 7 other continuous (interval/ratio) variables. All 8 variables # have an approximate mean of 10 and are (multivariate) normally distributed. # Initially, a population was created (N = 1000000) and from it samples could be # randomly drawn by randomly selecting population identification numbers (p.id). # The first sample (n = 100) was drawn (read in from the web so that results are # consistent for all users of this script). The sample contains an additional # column for identifying each case of the sample (s.id). # The general idea below is that we are working from a perspective of a researcher # or research team with a theoretically motivated study in which we 'believe' # 7 measured variables (x1 - x7) predict the outcome (y). Our goal is to use # cross validation to estimate the prediction error of our model and if warranted, # identify a linear model (OLS regression) which offers the lowest prediction error # estimate(s). # Of course, the 'real' goal of this script is to show various approaches to cross # validation (and *true* validation) of a model; as well as showing some of the # things which can (and often do) compromise the validity of model fit estimates # and prediction error estimates. Keep in mind throughout; we use a simple (OLS) # linear regression as an example here, but the ideas conveyed here apply to # other types of modeling (e.g. GLM, HLM, SEM, etc.). # Read in the sample data file from the web naming it "sample1.df" as below; and # get the ubiquitous 'head' and 'summary' of the data to see what it looks like. sample1.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_sample1.df.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) head(sample1.df) summary(sample1.df) # Fit all the original sample 1 data to the linear model and calculate the Average # Residual Squared Error for all cases in the sample (baseline [and biased] prediction # error estimate: RSE.n). samp1.lm <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, sample1.df) summary(samp1.lm) RSE.n <- (sum((sample1.df$y - samp1.lm$fitted.values)^2))/nrow(sample1.df) RSE.n # These initial estimates (generated above) indicate superb model fit (R-squared & # Adj. R-squared) and an extremely small prediction error estimate (RSE.n); however, # they are all compromised by overfitting. ####### Split-half validation (also called 'split-sample' or hold-out validation). # Divide the sample data into two halves, the training set and the testing set. nrow(sample1.df) training.set <- sample1.df[1:50,] head(training.set) testing.set <- sample1.df[51:100,] head(testing.set) # Specify/fit the model with the training set. model.1 <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = training.set) summary(model.1) # Apply the specified model (coefficients) to predictor values from the testing set # to predict (model based) outcome values of the testing set. attach(testing.set) model.1$coefficients y.hat <- model.1$coefficients[1] + model.1$coefficients[2]*x1 + model.1$coefficients[3]*x2 + model.1$coefficients[4]*x3 + model.1$coefficients[5]*x4 + model.1$coefficients[6]*x5 + model.1$coefficients[7]*x6 + model.1$coefficients[8]*x7 # Compare these predicted values to the actual values of the outcome in the testing set. t.test(y.hat, y, paired = TRUE) cor.test(y.hat, y) library(car) scatterplot(y.hat, y) # The results above look great; no difference between the two (paired) means, # the correlation is virtually +1....perhaps a bit too good to be true? # Calculating the residuals (errors). errors <- y - y.hat summary(errors) # Calculating the average error. avg.err <- sum(y - y.hat)/length(y) avg.err # OR: mean(y - y.hat) mean(errors) # Squared Prediction Errors. pe2 <- (y - y.hat)^2 summary(pe2) # Average Cross validation sums of squares. cv.ss <- sum((y - y.hat)^2)/length(y) cv.ss # Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n). Notice, they # are virtually the same, with RSE.n slightly biased (estimating less prediction error). RSE.n cv.ss # Detach the 'testing.set' data. detach(testing.set) # Cleaning up the workspace. ls() rm(avg.err, cv.ss, errors, model.1, pe2, testing.set, training.set, y.hat) ls() ###### Leave-One-Out cross-validation (LOOCV). # The LOOCV strategy in its most basic form, simply takes one observation out of # the data and sets it aside as the 'testing set' like what was done above. Then # the model is applied to the training set of n - 1 cases (i.e. the data minus # the single testing set case) and the resulting coefficients are applied to the # testing set case to produce a predicted value which in turn is then compared to # the actual value (of y) of that single case. Below, we avoid the most basic form # of LOOCV and instead iteratively conduct the procedure across all cases (i.e. each # case is 'left out' once). # Setting up the initial conditions and creating an empty vector to store the values # of y.hat for each iteration. n <- 0 y.hat <- as.vector(0) # Running the LOOCV iteratively ('for loop'). for (i in 1 : nrow(sample1.df)){ n <- 1 + n model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = sample1.df[-n,]) y.hat[i] <- model$coefficients[1] + model$coefficients[2]*sample1.df[n,4] + model$coefficients[3]*sample1.df[n,5] + model$coefficients[4]*sample1.df[n,6] + model$coefficients[5]*sample1.df[n,7] + model$coefficients[6]*sample1.df[n,8] + model$coefficients[7]*sample1.df[n,9] + model$coefficients[8]*sample1.df[n,10] } # Calculating the errors and taking a look at them. errors <- sample1.df$y - y.hat summary(errors) hist(errors) cor(y.hat, sample1.df$y) scatterplot(y.hat, sample1.df$y) # Average Cross validation sums of squares. cv.ss <- sum((sample1.df$y - y.hat)^2)/1000 cv.ss # Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n); again, both # are very close to zero; with the cv.ss indicating an even smaller estimate of # prediction error. RSE.n cv.ss # Cleaning up the workspace. ls() rm(cv.ss, errors, i, model, n, y.hat) ls() ###### Bootstrapped LOOCV. # Here, we create 10 bootstrapped samples (sample WITH replacement from the original # sample) and apply the LOOCV from above to each of the 10 bootstrapped samples; # each time saving the y.hat values and then calculating the errors, cv.ss, and # correlation between y.hat and the actual values of y in each bootstrapped sample. # Creating empty objects for storing the output values. errors <- as.data.frame(matrix(0, ncol = 10, nrow = 50)) cv.ss <- as.vector(0) corr.yhat <- as.vector(0) # Running the bootstrapped LOOCV; which takes a minute or so. for (i in 1 : 10){ boot.id <- sample(sample1.df$s.id, 50, replace = TRUE) boot.sample <- data.frame(sample1.df[boot.id,]) n <- 0 y.hat <- as.vector(0) for (k in 1:nrow(boot.sample)){ n <- 1 + n model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = boot.sample[-n,]) y.hat[k] <- model$coefficients[1] + model$coefficients[2]*boot.sample[n,4] + model$coefficients[3]*boot.sample[n,5] + model$coefficients[4]*boot.sample[n,6] + model$coefficients[5]*boot.sample[n,7] + model$coefficients[6]*boot.sample[n,8] + model$coefficients[7]*boot.sample[n,9] + model$coefficients[8]*boot.sample[n,10] } errors[,i] <- boot.sample$y - y.hat cv.ss[i] <- (sum(boot.sample$y - y.hat)^2)/50 corr.yhat[i] <- cor(boot.sample$y, y.hat) } # Summary of the errors for each of the 10 iterations (loops); each column corresponds # to each loop/bootstrapped sample. summary(errors) # Average Cross validation sums of squares for each of the 10 iterations (loops). cv.ss # Compare the 'cv.ss' to the Average Residual Squared Error (RSE.n) from each iteration # and the bootstrapped average of the cv.ss across iterations (an average of the averages). RSE.n cv.ss mean(cv.ss) # All 10 bootstrapped estimates (and their average) are lower than the RSE.n. # The average of the averages across iterations provides the more robust estimated # prediction error. # Correlations between y and y.hat for each of the 10 iterations; all nearly 1.0. corr.yhat # Average correlation of the 10 iterations. mean(corr.yhat) # Cleaning up the workspace. ls() rm(cv.ss, boot.id, boot.sample, corr.yhat, errors, i, k, model, n, y.hat) ls() ###### Bootstrapped (cross validation): Estimates of Prediction Error (Efron & Tibshirani, 1993). # The bootstrapped cross validation approach represents a robust method of estimating # the prediction error for a model. Each bootstrapped sample (sampling WITH replacement # from the original sample) has the model fit to it and the subsequent coefficients # are then used to predict scores on the original sample as well as the bootstrapped # sample values of the outcome. Natrually, the errors associated with the original sample # estimates will be larger than the errors associated with the bootstrapped sample estimates; # because, the bootstrapped sample was used to fit the model and generate the coefficients. # Then, using the DIFFERENCE between the original sample errors and the bootstrapped # sample errors provides us with an estimate of bias or "optimism" (Efron & Tibshirani, # 1993, p. 248). Finally, the average optimism (average of each optimism from all # the bootstrapping) can be added to the original sample RSE.n; which corrects it # and provides a more accurate estimate of average prediction error. # Setting initial conditions and creating an empty data frame to store results. n <- 100 # number of cases in each bootstrapped sample = number of cases in the sample data. N <- 200 # Number of bootstrapped samples to take/run. prediction.errors <- as.data.frame(matrix(0, ncol = 3, nrow = N)) # Running the bootstrapped cross validation (which is surprisingly quick). for (i in 1 : N){ boot.id <- sample(sample1.df$s.id, n, replace = TRUE) boot.sample <- data.frame(sample1.df[boot.id,]) model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = boot.sample) samp.fv <- model$coefficients[1] + model$coefficients[2]*sample1.df$x1 + model$coefficients[3]*sample1.df$x2 + model$coefficients[4]*sample1.df$x3 + model$coefficients[5]*sample1.df$x4 + model$coefficients[6]*sample1.df$x5 + model$coefficients[7]*sample1.df$x6 + model$coefficients[8]*sample1.df$x7 s.error <- sum((sample1.df$y - samp.fv)^2)/n b.error <- sum((boot.sample$y - model$fitted.values)^2)/n optimism <- s.error - b.error prediction.errors[i,] <- c(s.error, b.error, optimism) } # The 'prediction.errors' data frame contains the following: the original # sample-based prediction error estimate (average cross validation sums of squares), # the bootstrapped sample-based prediction error estimate (average cross validation # sums of squares), and the optimism estimate (difference between the previous two); # 'prediction.errors' contains all 3 for each N = 200 iterations (bootstraps). names(prediction.errors)[1] <- "Sample.Error.Estimates" names(prediction.errors)[2] <- "Boot.Error. Estimates" names(prediction.errors)[3] <- "Optimism" summary(prediction.errors) # Averages of each. avg.s.pe <- mean(prediction.errors[,1]) avg.s.pe avg.b.pe <- mean(prediction.errors[,2]) avg.b.pe avg.optimism <- mean(prediction.errors[,3]) avg.optimism # Compare the sample and bootstrapped average cross validation sums of squares ('avg.s.pe' & # 'avg.b.pe' from above) to the Average Residual Squared Error (RSE.n). RSE.n # Improved bootstraped prediction error estimate which adds a bias correction to the # original RSE.n; because, RSE.n is biased downwardly (i.e. predicted error estimate # is smaller [less error] than it really should be due to overfitting). See Efron, # 1993, p. 237 - 249. avg.optimism + RSE.n # Cleaning up the workspace. ls() rm(avg.b.pe, avg.optimism, avg.s.pe, b.error, boot.id, boot.sample, i, model, n, N, optimism, prediction.errors, s.error, samp.fv) ls() ###### Real cross validation (*true* cross validation) # 'Real cross validation' = collecting another sample of data from the same # population and using the new data (measured with the same instruments) to # "validate" the model's accuracy. Of course, in actual research it is often # impossible to do this, either because of funding, time constraints, or because # the model is based on fixed effects and not random effects (e.g. the population # changes because individual members move into and out of it, time is an unmeasured # confound which influences the model, the effects modeled are really nested and the # nesting effect was not measured or modeled, etc.). # Collecting new data: here, this is very easy because we have simulated the data # by first generating a simulated population and then sampling from it to build # the model. This is why our estimates are very 'unbiased' above; most of the # prediction error estimates (regardless of cross validation technique) have # produced very similar estimates (virtually the same near-zero estimates of # prediction error). # Collecting (drawing) a new sample from the population. # Here; read in the second sample from the web, naming it "sample2.df". sample2.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_sample2.df.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) head(sample2.df) summary(sample2.df) # Now we can use the original sample's linear model coefficients, applied to # the new (2nd) sample's data (predictor variables), to 'predict' values on the # outcome variable (y) of the new (2nd) sample data set. Then we can compare the # 'predicted' values (y.hat) with the actual values of 'y' from the new data. # First, a summary of the original sample's linear model coefficients. summary(samp1.lm) samp1.lm$coefficients # Applying the coefficients to the new predictor variable data to create the # 'predicted' values of the outcome (y.hat). y.hat <- samp1.lm$coefficients[1] + samp1.lm$coefficients[2]*sample2.df$x1 + samp1.lm$coefficients[3]*sample2.df$x2 + samp1.lm$coefficients[4]*sample2.df$x3 + samp1.lm$coefficients[5]*sample2.df$x4 + samp1.lm$coefficients[6]*sample2.df$x5 + samp1.lm$coefficients[7]*sample2.df$x6 + samp1.lm$coefficients[8]*sample2.df$x7 summary(y.hat) # Comparison of the 'predicted' values (y.hat) to the actual (new) values of the # outcome (sample2.df$y). summary(y.hat) summary(sample2.df$y) mean(y.hat) mean(sample2.df$y) var(y.hat) var(sample2.df$y) sd(y.hat) sd(sample2.df$y) # The following creates two histograms (in one graphics window); showing y.hat and y. oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,1)) hist(y.hat, col = "lightgreen", xlim = c(6,14), main = "Histogram of 'y.hat'", xlab = "Mean = 10.025, SD = 1.034") abline(v = mean(y.hat), col = "blue", lwd = 2, lty = "dashed") hist(sample2.df$y, col = "lightblue", xlim = c(6,14), main = "Histogram of new 'y'", xlab = "Mean = 10.030, SD = 1.034") abline(v = mean(sample2.df$y), col = "darkgreen", lwd = 2, lty = "dashed") par(oldpar) # Further comparisons. t.test(y.hat, sample2.df$y, paired = TRUE) cor.test(y.hat, sample2.df$y) # So, clearly our model (based on the original sample) is doing an excellent job # of predicting new values of 'y' based on new values of all the predictors. # This is due to a few advantages of simulating or creating the data in the manner # we did: the variables have zero measurement error and the cases were randomly # sampled from a defined (and constant) population (i.e. zero sampling error/bias); # although, because this data is simulated, there is a slight chance the same # case(es) may appear in multiple samples. However, that was not the case here; # meaning, each case is unique to sample 1 or sample 2. This is largely due to # the overwhelmingly large population size (N = 1000000) in relation to sample # sizes (n = 100). # Furthermore, our samples have greater than a 10 to 1 ratio of cases to variables. # Simply put, our sample sizes, though not large, are adequate because; for each of our # 8 variables we have at least 10 cases -- it would obviously be better to have # a 20 to 1 (or even higher) ratio; but, 10 to 1 is about the lowest acceptable # ratio. However, one should also remember that the sample size in relation to # the population size is important. The smaller the sample (in comparison to # the population), the more likely the sample, even a random sample, will contain # bias. # In a 'real-world' setting, it is not uncommon to have violations of one, or # more of the above conditions -- any one of which would bias the estimate of RSE.n; # meaning the model would have more prediction error than the RSE.n would indicate. # Lastly, two other threats to the accuracy of estimating prediction errors for a model # are model search error and model specification error. Model search error is the # bias created by the method used to select the model 'form' (i.e. the method # used to choose linear form vs. quadratic form). # Model specification error can happen in three ways; # (1) using the wrong model 'form' (e.g. using a linear model rather than a quadratic # model when the quadratic more accurately represents the data), (2) errors of # omission (leaving crucial variables out of the model), and (3) errors of # inclusion (including non-essential variables in the model). The last of which, # errors of inclusion; can increase multicollinearity and cause bias in model # fit estimates, prediction error estimates, and individual parameter estimates # (i.e. coefficients). To clarify this last point, notice that throughout the # above results, our model has performed almost perfectly with respect to model # fit (R-squared & Adj. R-squared), small residuals (errors), and very small # predicted error estimates. However, closer inspection of the sample(s) data # (and the population data) will reveal two related errors with our model. # First, we have model specification errors of inclusion; four of our predictors # are not related to the outcome (AT ALL!), which means they have no business # being included in the model. The only thing they do is artificially increase # fit measures (R-squared & Adj. R-squared). But, these four variables are even # more disruptive because they bring with them multicollinearity (intercorrelations # among themselves and to a lesser extent with the actual 3 predictors); a condition # which decreases the validity of the coefficients in terms of variable importance. # To 'see' the relationships referred to above, simply take a look at the correlation # matrix of 'y' and all the 'x' variables. Below, both sample correlation matrices # are displayed with rounding two places after the decimal (if interested in seeing # the matrices without rounding use this: cor(sample1.df[,3:10]) and # this: cor(sample2.df[,3:10]) round(cor(sample1.df[,3:10]), 2) round(cor(sample2.df[,3:10]), 2) # Now take a look at the population. # Read in the population data frame from the web (this takes a couple of minutes # due to the file being over 100 mb; N = 1000000) and name it "population.df". # NOTE: # Because the population data file is so big, the internet connection "times out" # before all the data can be read. As an example, you will not be able to get # all 1000000 cases when you run the standard 'read.table' function we typically # use here; as can be seen with the 'tail' function (which prints the last 6 rows # of the object -- data.frame): population.df <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/cv_population.df.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) tail(population.df) # There are two ways we can 'get around' this (there are probably more ways, I simply # have not spent time trying). # First, (not recommended) we can use the old fashioned method of downloading # the 'cv_population.df.txt' file from the URL above and save it to your desktop, then # alter the following code to reflect the path to YOUR desktop in order to read # the data into your R session (workspace). population.df <- read.table("C:\\Users\\jds0282\\Desktop\\Workstuff\\Jon_R\\CrossValidation\\cv_population.df.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) tail(population.df) # The second method (recommended) involves loading the saved workspace which I # initially created when writing this script (which is available on the web). # When this workspace is loaded into your R session, all the objects you have # will still be available, the only additions will be the objects: 'connection' # and 'population.df'; if you have an 'incomplete' version of 'population.df' # from above, do not worry, it will be over-written with the complete version # from the workspace. ls() # list the objects in your workspace. connection <- url("http://www.unt.edu/rss/class/Jon/R_SC/Module9/CrossValidation/CrossValidation_003.RData") load(connection) ls() # list the objects in your workspace. # Now, we can get back to what we were doing... head(population.df) tail(population.df) nrow(population.df) # Number of rows matches the population identification (p.id) of the last case from the 'tail' function. summary(population.df) # You probably noticed a warning message: # "closing unused connection 3 (gzcon(http://www.unt.edu/rss/class/....." # Not to worry, R simply closed the connection; nothing was lost. # Take a look at the population correlation matrix; note specifically that only # x1, x2, and x3 are related to y. round(cor(population.df[,2:9]), 2) # To compare the 'true' population model with the mis-specified model, run both on # the population. Be advised, this takes a minute or so due to the 1000000 cases. pop.lm <- lm(y ~ x1 + x2 + x3, population.df) summary(pop.lm) bad.model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, population.df) summary(bad.model) # Based on the results of the 'bad.model' it appears superior; smaller residuals, # larger R-squared and Adj. R-squared, and smaller residual standard errors. But, # we 'know' those are biased because the correlation matrix(-ces) shows us what is # *truly* related to the outcome (y) and what is not. Of course, the p-values cannot # be relied upon due to the enormous number of cases. One final look: cor(population.df[,2:9]) # If you're not familiar with the scientific notation; where 'e' is + (positive # exponent) moves the decimal to the right and where 'e' is - (negative exponent) # moves the decimal to the left. For example; # 5.23e+04 = 52300 and 5.23e-04 = .000523 5.23e+04 5.23e-04 # Cleaning the workspace for the final time. ls() rm(bad.model, connection, oldpar, pop.lm, population.df, RSE.n, samp1.lm, sample1.df, sample2.df, y.hat) ls() # If all this scripting / code intimidates you; have no fear, there are several # R packages / libraries which are capable of utilizing these methods with specific # functions (meaning, all this code serves only to inform one of the underlying # processes of, and threats to, cross-validation and estimates of prediction error). # Some of those packages / libraries will be explored in another script / tutorial. ##################### REFERENCES / RESOURCES ##################### # # Chernick, M. R. (2008). Bootstrap methods: A guide for practitioners and # Researchers (2nd ed.). Hoboken, NJ: John Wiley & Sons, Inc. # # Efron, B. (1983). Estimating the error rate of a prediction rule: Some improvements on # cross-validation. Journal of the American Statistical Association, 78, 316 - 331. # # Efron, B, & Tibshirani, R. (1993). An introduction to the bootstrap. New York: # Chapman & Hall. # # Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap # method. Journal of the American Statistical Association, 92(438), 548 - 560. # # Harrell, F., Lee, K., & Mark, D. (1996). Tutorial in Biostatistics: Multivariable prognostic # models: Issues in developing models, evaluating assumptions and adequacy, and measuring # and reducing errors. Statistics in Medicine, 15, 361 - 387. # Available at: http://www.yaroslavvb.com/papers/steyerberg-application.pdf # # Harrell, F. (1998). Comparisons of strategies for validating binary logistic regression # models. Available at: http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf # # Harrell, F. E. (2001). Regression modeling strategies: With applications to # linear models, logistic regression, and survival analysis. New York: # Springer-Verlag, Inc. # # Harrell, F. E. (2009). Package 'Design'. Available at CRAN: # http://cran.r-project.org/web/packages/Design/index.html # # Maindonald, J., & Braun, W. J. (2011). Package 'DAAG'. Available at CRAN: # http://cran.r-project.org/web/packages/DAAG/index.html # # Moore, A. (2008). Cross-Validation: tutorial slides. Available at: # http://www.autonlab.org/tutorials/overfit.html # # Ripley, B. (2010). Package 'boot'. Available at CRAN: # http://cran.r-project.org/web/packages/boot/index.html # # Schneider, J. (1997). Cross validation. Availabe at: # http://www.cs.cmu.edu/~schneide/tut5/node42.html # # Wikipedia. (2011). Cross-validation (statistics). Available at: # http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 # # # End; last updated: May 10, 2011.