#
#
###### Showing how to create empicially defensible Composite (or Indicator) Scores.
#
# Essentially, we will be using factor analysis to generate the composite
# scores. In a very real sense, the best composite scores are factor scores
# when there is a known, or strongly supported belief in, structure of the data.
# There are a few ways to go about generating the composite, or factor, scores
# based on what type of structure you believe the data contains (e.g. single
# factor, multiple correlated factors, multiple uncorrelated factors, bifactor
# model, hierarchical factor model, etc.); and how the variables are measured
# or scaled (e.g., nominal scaled, ordinal or Likert scaled, interval/ratio
# scaled, etc.).
# The general procedure for generating composite / indicator scores includes the
# following steps. (1) Convert, or recode, the nominal or ordinal (Likert) responses
# to numeric responses, (2) apply a factor analysis model which reflects the known
# structure of the variables, (3) save the factor scores and factor loadings,
# (4) rescale the factor scores using the factor loadings, the weighted mean, and
# the weighted standard deviation of the original data so that the composite
# scores reflect (as nearly as possible) the original semantic (word) meaning
# of the original data. In this process, the factor loadings serve as weights
# for the weighted mean and weighted standard deviation calculations.
# First, import some example data (the data file used below has been simulated).
data.df <- read.table(
"http://www.unt.edu/rss/class/Jon/R_SC/Module4/CompositeIndicators_001.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
summary(data.df); nrow(data.df)
# In this example, we were interested in the activity levels of persons diagnosed
# with diabetes. We have 3 groups of questions:
# q1 - q4 assessed cognitive activity,
# q5 - q9 assessed physical activity,
# and q10 - q14 assessed social activity.
# The response choices for q1 - q4 and q10 - q14 were the same:
# 1 = strongly disagree, 2 = disagree, 3 = agree, 4 = strongly agree
# The response choices for q5 - q9 were:
# 1 = lethargic, 2 = not very active, 3 = no difference, 4 = more active, 5 = hyperactive
# In the current example, we have three groups of questions; each group measures
# a particular latent construct (i.e. indirect measurement) and those three
# latent constructs support a general activity construct. The above
# statements represent the known or strongly supported hypothesis of the data's
# (factor) structure; in this example, a hierarchical factor model.
# First, let's consider the situation where you have a set of Likert
# scaled items which you believe are the result of one continuously scaled
# latent factor, then you would need to recode the ordinal responses
# as continuous (interval/ratio), then simply run a one factor model, collect
# the factor scores, and rescale the factor scores as composites scores which
# reflect the original metric.
# As an example, consider q1, q2, q3, q4.
subset.1 <- data.df[,7:10]; summary(subset.1)
# Next, recode the responses into numbers which reflect the ordinality of the
# original responses (the words). This can be tricky sometimes because R cannot
# tell if "agree" should be a 1, 2, 3, etc. So, it's best to impose the values
# on specific labels using a fairly simple function.
recoding.4.point <- function(data){
new.data <- data.frame(matrix(rep(0, nrow(data)*ncol(data)), ncol = ncol(data)))
for (j in 1:ncol(data)){
for (i in 1:nrow(data)){
if(data[i,j] == "strongly disagree"){new.data[i,j] <- 1}
if(data[i,j] == "disagree"){new.data[i,j] <- 2}
if(data[i,j] == "agree"){new.data[i,j] <- 3}
if(data[i,j] == "strongly agree"){new.data[i,j] <- 4}
}
}
names(new.data) <- names(data)
return(new.data)
}
head(subset.1)
subset.1 <- recoding.4.point(subset.1); head(subset.1)
# Since we now have a properly re-coded (and numeric) version of the subset/data,
# we can apply a one-factor model. Remember, factor analysis assumes the variables
# are linearly related; if they are not, then factor analysis is not appropriate.
fa.subset.1 <- factanal(x = subset.1, factors = 1, scores = "regression")
fa.subset.1
# To extract the factor scores, simply use the "$scores" operator on the factor
# analysis object.
subset.1.scores <- as.vector(fa.subset.1$scores)
summary(subset.1.scores)
# You'll notice the factor scores have a mean of zero, so in order for them to
# have (semantic) meaning, we must convert them back into the scale of the
# original 1 - 4 responses. To do this, we will need three things; the new
# factor scores, the raw data, and the factor loadings. The factor loadings
# augment the meaning of the composite scores by providing insight into how
# each question contributed to the composite scores. Notice in this example
# each of the four questions contributed approximately equally to the
# composite scores (i.e. the loadings are roughly equal). However, if you have
# loadings which are not (roughly) equal, then you must communicate the loadings
# (and why they are important) to anyone interpreting or using the composite scores.
fa1.loadings <- fa.subset.1$loadings[,1]
fa1.loadings
# Small function for calculating the weighted standard devaition; needed below.
weighted.sd <- function(x, w){
sum.w <- sum(w)
sum.w2 <- sum(w^2)
mean.w <- sum(x * w) / sum(w)
x.sd.w <- sqrt((sum.w / (sum.w^2 - sum.w2)) * sum(w * (x - mean.w)^2))
return(x.sd.w)
}
# This rescaling function simply puts the scores back into the metric of the
# original questions (Keep in mind, some of the final scores may be slightly
# below '1' and some slightly above '4'; this results because we modeled the
# latent 'true scores'.
re.scale <- function(f.scores, raw.data, loadings){
fz.scores <- (f.scores + mean(f.scores))/(sd(f.scores))
means <- apply(raw.data, 1, weighted.mean, w = loadings)
sds <- apply(raw.data, 1, weighted.sd, w = loadings)
grand.mean <- mean(means)
grand.sd <- mean(sds)
final.scores <- ((fz.scores * grand.sd) + grand.mean)
return(final.scores)
}
final.scores.ss1 <- re.scale(subset.1.scores, subset.1, fa1.loadings)
# Now, we have one set of scores / one variable, which contains each participant's
# score of cognitive engagement (subset.1 = ss1: q1 - q4).
summary(final.scores.ss1)
# We can go ahead and apply this same general procedure to the other two sets
# of questions (q5 - q9; physical engagement, & q10 - q14; social engagement).
# However, because q5 - q9 have a 5 point Likert response format, we need a
# second recoding function to put those responses into numeric format.
recoding.5.point <- function(data){
new.data <- data.frame(matrix(rep(0, nrow(data)*ncol(data)), ncol = ncol(data)))
for (j in 1:ncol(data)){
for (i in 1:nrow(data)){
if(data[i,j] == "lethargic"){new.data[i,j] <- 1}
if(data[i,j] == "not very active"){new.data[i,j] <- 2}
if(data[i,j] == "no difference"){new.data[i,j] <- 3}
if(data[i,j] == "more active"){new.data[i,j] <- 4}
if(data[i,j] == "hyperactive"){new.data[i,j] <- 5}
}
}
names(new.data) <- names(data)
return(new.data)
}
subset.2 <- recoding.5.point(data.df[,11:15]); summary(subset.2)
subset.3 <- recoding.4.point(data.df[,16:20]); summary(subset.3)
# Here, we create a single function which will take the numeric data and apply
# the 1 factor model, extract the factor scores, extract the factor loadings,
# and apply the re-scaling function. This function returns a list object which
# includes two elements: the rescaled scores and the factor loadings.
get.scores.fun <- function(data){
fact <- factanal(data, factors = 1, scores = "regression")
f.scores <- fact$scores[,1]
f.loads <- fact$loadings[,1]
rescaled.scores <- re.scale(f.scores, data, f.loads)
output.list <- list(rescaled.scores, f.loads)
names(output.list) <- c("rescaled.scores","factor.loadings")
return(output.list)
}
# Now, we can apply the above function to subset.2 and subset.3.
scores.and.loadings.2 <- get.scores.fun(subset.2)
scores.and.loadings.2$factor.loadings
scores.and.loadings.3 <- get.scores.fun(subset.3)
scores.and.loadings.3$factor.loadings
# Then, we can extract the scores.
final.scores.ss2 <- scores.and.loadings.2$rescaled.scores
summary(final.scores.ss2)
final.scores.ss3 <- scores.and.loadings.3$rescaled.scores
summary(final.scores.ss3)
# Now, we can create a data frame which contains just the composite scores
# for each subset or section of the questionnaire.
composite.data <- data.frame(final.scores.ss1, final.scores.ss2, final.scores.ss3)
names(composite.data) <- c("cognitive", "physical", "social")
summary(composite.data)
# Now that we have numeric composite scores representing the three subsets
# of questions, we can focus on the global or general activity construct.
# This is relatively easy; we can simply apply the get scores function to
# the new composite data frame, because the known model is a hierarchical model.
# Keep in mind, we could have simply run one factor analysis, specifying 3
# factors instead of running each subset separately. However, if we had
# run one model (with 3 factors), we would have had some cross-loading
# questions. If those cross-loadings were substantial, then they might call
# into question the factor structure (i.e. question 2 was supposed to load
# on factor 1, but instead loaded most on factor 3…). If this data was known
# to support a bifactor factor model, then we would submit the 14 original
# questions (converted to numeric variables) to the 'get scores function'
# or apply the 'omega' function from the 'psych' package. In the bifactor
# model, the global or general factor is supported by all the questions;
# in the hierarchical factor model, the global or general factor is supported
# by the sub-factors.
global.activity <- get.scores.fun(composite.data)
# Take note of the factor loadings which contributed to the composite scores;
# again, here they are all roughly equal, but often they are not and therefore
# need to be included in any intepretation or use of the composite scores.
global.activity$factor.loadings
# Now, we can extract the global composite scores and then add the resulting
# vector to the existing composite data frame.
global.activity <- global.activity$rescaled.scores
composite.data <- data.frame(data.df[,1:6], composite.data, global.activity)
summary(composite.data)
# Cleaning up the workspace.
ls()
rm(fa.subset.1, fa1.loadings, final.scores.ss1, final.scores.ss2, final.scores.ss3,
get.scores.fun, global.activity, re.scale, recoding.4.point, recoding.5.point,
scores.and.loadings.2, scores.and.loadings.3, subset.1, subset.1.scores,
subset.2, subset.3, weighted.sd)
ls()
# End; Feb. 8, 2012.