# # ###### Showing how to create empicially defensible Composite (or Indicator) Scores. # # Essentially, we will be using factor analysis to generate the composite # scores. In a very real sense, the best composite scores are factor scores # when there is a known, or strongly supported belief in, structure of the data. # There are a few ways to go about generating the composite, or factor, scores # based on what type of structure you believe the data contains (e.g. single # factor, multiple correlated factors, multiple uncorrelated factors, bifactor # model, hierarchical factor model, etc.); and how the variables are measured # or scaled (e.g., nominal scaled, ordinal or Likert scaled, interval/ratio # scaled, etc.). # The general procedure for generating composite / indicator scores includes the # following steps. (1) Convert, or recode, the nominal or ordinal (Likert) responses # to numeric responses, (2) apply a factor analysis model which reflects the known # structure of the variables, (3) save the factor scores and factor loadings, # (4) rescale the factor scores using the factor loadings, the weighted mean, and # the weighted standard deviation of the original data so that the composite # scores reflect (as nearly as possible) the original semantic (word) meaning # of the original data. In this process, the factor loadings serve as weights # for the weighted mean and weighted standard deviation calculations. # First, import some example data (the data file used below has been simulated). data.df <- read.table( "http://www.unt.edu/rss/class/Jon/R_SC/Module4/CompositeIndicators_001.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) summary(data.df); nrow(data.df) # In this example, we were interested in the activity levels of persons diagnosed # with diabetes. We have 3 groups of questions: # q1 - q4 assessed cognitive activity, # q5 - q9 assessed physical activity, # and q10 - q14 assessed social activity. # The response choices for q1 - q4 and q10 - q14 were the same: # 1 = strongly disagree, 2 = disagree, 3 = agree, 4 = strongly agree # The response choices for q5 - q9 were: # 1 = lethargic, 2 = not very active, 3 = no difference, 4 = more active, 5 = hyperactive # In the current example, we have three groups of questions; each group measures # a particular latent construct (i.e. indirect measurement) and those three # latent constructs support a general activity construct. The above # statements represent the known or strongly supported hypothesis of the data's # (factor) structure; in this example, a hierarchical factor model. # First, let's consider the situation where you have a set of Likert # scaled items which you believe are the result of one continuously scaled # latent factor, then you would need to recode the ordinal responses # as continuous (interval/ratio), then simply run a one factor model, collect # the factor scores, and rescale the factor scores as composites scores which # reflect the original metric. # As an example, consider q1, q2, q3, q4. subset.1 <- data.df[,7:10]; summary(subset.1) # Next, recode the responses into numbers which reflect the ordinality of the # original responses (the words). This can be tricky sometimes because R cannot # tell if "agree" should be a 1, 2, 3, etc. So, it's best to impose the values # on specific labels using a fairly simple function. recoding.4.point <- function(data){ new.data <- data.frame(matrix(rep(0, nrow(data)*ncol(data)), ncol = ncol(data))) for (j in 1:ncol(data)){ for (i in 1:nrow(data)){ if(data[i,j] == "strongly disagree"){new.data[i,j] <- 1} if(data[i,j] == "disagree"){new.data[i,j] <- 2} if(data[i,j] == "agree"){new.data[i,j] <- 3} if(data[i,j] == "strongly agree"){new.data[i,j] <- 4} } } names(new.data) <- names(data) return(new.data) } head(subset.1) subset.1 <- recoding.4.point(subset.1); head(subset.1) # Since we now have a properly re-coded (and numeric) version of the subset/data, # we can apply a one-factor model. Remember, factor analysis assumes the variables # are linearly related; if they are not, then factor analysis is not appropriate. fa.subset.1 <- factanal(x = subset.1, factors = 1, scores = "regression") fa.subset.1 # To extract the factor scores, simply use the "$scores" operator on the factor # analysis object. subset.1.scores <- as.vector(fa.subset.1$scores) summary(subset.1.scores) # You'll notice the factor scores have a mean of zero, so in order for them to # have (semantic) meaning, we must convert them back into the scale of the # original 1 - 4 responses. To do this, we will need three things; the new # factor scores, the raw data, and the factor loadings. The factor loadings # augment the meaning of the composite scores by providing insight into how # each question contributed to the composite scores. Notice in this example # each of the four questions contributed approximately equally to the # composite scores (i.e. the loadings are roughly equal). However, if you have # loadings which are not (roughly) equal, then you must communicate the loadings # (and why they are important) to anyone interpreting or using the composite scores. fa1.loadings <- fa.subset.1$loadings[,1] fa1.loadings # Small function for calculating the weighted standard devaition; needed below. weighted.sd <- function(x, w){ sum.w <- sum(w) sum.w2 <- sum(w^2) mean.w <- sum(x * w) / sum(w) x.sd.w <- sqrt((sum.w / (sum.w^2 - sum.w2)) * sum(w * (x - mean.w)^2)) return(x.sd.w) } # This rescaling function simply puts the scores back into the metric of the # original questions (Keep in mind, some of the final scores may be slightly # below '1' and some slightly above '4'; this results because we modeled the # latent 'true scores'. re.scale <- function(f.scores, raw.data, loadings){ fz.scores <- (f.scores + mean(f.scores))/(sd(f.scores)) means <- apply(raw.data, 1, weighted.mean, w = loadings) sds <- apply(raw.data, 1, weighted.sd, w = loadings) grand.mean <- mean(means) grand.sd <- mean(sds) final.scores <- ((fz.scores * grand.sd) + grand.mean) return(final.scores) } final.scores.ss1 <- re.scale(subset.1.scores, subset.1, fa1.loadings) # Now, we have one set of scores / one variable, which contains each participant's # score of cognitive engagement (subset.1 = ss1: q1 - q4). summary(final.scores.ss1) # We can go ahead and apply this same general procedure to the other two sets # of questions (q5 - q9; physical engagement, & q10 - q14; social engagement). # However, because q5 - q9 have a 5 point Likert response format, we need a # second recoding function to put those responses into numeric format. recoding.5.point <- function(data){ new.data <- data.frame(matrix(rep(0, nrow(data)*ncol(data)), ncol = ncol(data))) for (j in 1:ncol(data)){ for (i in 1:nrow(data)){ if(data[i,j] == "lethargic"){new.data[i,j] <- 1} if(data[i,j] == "not very active"){new.data[i,j] <- 2} if(data[i,j] == "no difference"){new.data[i,j] <- 3} if(data[i,j] == "more active"){new.data[i,j] <- 4} if(data[i,j] == "hyperactive"){new.data[i,j] <- 5} } } names(new.data) <- names(data) return(new.data) } subset.2 <- recoding.5.point(data.df[,11:15]); summary(subset.2) subset.3 <- recoding.4.point(data.df[,16:20]); summary(subset.3) # Here, we create a single function which will take the numeric data and apply # the 1 factor model, extract the factor scores, extract the factor loadings, # and apply the re-scaling function. This function returns a list object which # includes two elements: the rescaled scores and the factor loadings. get.scores.fun <- function(data){ fact <- factanal(data, factors = 1, scores = "regression") f.scores <- fact$scores[,1] f.loads <- fact$loadings[,1] rescaled.scores <- re.scale(f.scores, data, f.loads) output.list <- list(rescaled.scores, f.loads) names(output.list) <- c("rescaled.scores","factor.loadings") return(output.list) } # Now, we can apply the above function to subset.2 and subset.3. scores.and.loadings.2 <- get.scores.fun(subset.2) scores.and.loadings.2$factor.loadings scores.and.loadings.3 <- get.scores.fun(subset.3) scores.and.loadings.3$factor.loadings # Then, we can extract the scores. final.scores.ss2 <- scores.and.loadings.2$rescaled.scores summary(final.scores.ss2) final.scores.ss3 <- scores.and.loadings.3$rescaled.scores summary(final.scores.ss3) # Now, we can create a data frame which contains just the composite scores # for each subset or section of the questionnaire. composite.data <- data.frame(final.scores.ss1, final.scores.ss2, final.scores.ss3) names(composite.data) <- c("cognitive", "physical", "social") summary(composite.data) # Now that we have numeric composite scores representing the three subsets # of questions, we can focus on the global or general activity construct. # This is relatively easy; we can simply apply the get scores function to # the new composite data frame, because the known model is a hierarchical model. # Keep in mind, we could have simply run one factor analysis, specifying 3 # factors instead of running each subset separately. However, if we had # run one model (with 3 factors), we would have had some cross-loading # questions. If those cross-loadings were substantial, then they might call # into question the factor structure (i.e. question 2 was supposed to load # on factor 1, but instead loaded most on factor 3…). If this data was known # to support a bifactor factor model, then we would submit the 14 original # questions (converted to numeric variables) to the 'get scores function' # or apply the 'omega' function from the 'psych' package. In the bifactor # model, the global or general factor is supported by all the questions; # in the hierarchical factor model, the global or general factor is supported # by the sub-factors. global.activity <- get.scores.fun(composite.data) # Take note of the factor loadings which contributed to the composite scores; # again, here they are all roughly equal, but often they are not and therefore # need to be included in any intepretation or use of the composite scores. global.activity$factor.loadings # Now, we can extract the global composite scores and then add the resulting # vector to the existing composite data frame. global.activity <- global.activity$rescaled.scores composite.data <- data.frame(data.df[,1:6], composite.data, global.activity) summary(composite.data) # Cleaning up the workspace. ls() rm(fa.subset.1, fa1.loadings, final.scores.ss1, final.scores.ss2, final.scores.ss3, get.scores.fun, global.activity, re.scale, recoding.4.point, recoding.5.point, scores.and.loadings.2, scores.and.loadings.3, subset.1, subset.1.scores, subset.2, subset.3, weighted.sd) ls() # End; Feb. 8, 2012.