R code
#Create a dataset. In
this arena most stat packages are the same, though with SPSS and SAS
#using old Mainframe type code, I'd probably go with R. However, I would avoid
text based entry
#altogether as it leaves more room for error compared to spreadsheet entry.
#There are many ways to do this in R, but I most often use matrix or data.frame
#In these examples the only real difference is whether you find entering by rows
or columns easier. However
#some functions specifically require matrices while others dataframes.
#Using the matrix function. "c" combines whatever follows in parentheses, nrow
and ncol the
#the number of rows and columns respectively. byrow=T or TRUE tells it to fill
up rows first.
Cars1=matrix(c(
"AMC Concord", 22, 2930, 4099,
"AMC Pacer", 17, 3350, 4749,
"AMC Spirit", 22, 2640, 3799,
"Buick Century", 20, 3250, 4816,
"Buick Electra", 15, 4080, 7827), nrow=5,ncol=4,byrow=T)
colnames(Cars1)=c("Model", "MPG","WEIGHT","PRICE")
Cars1
#Using data.frame, enter each variable at once and
combine. Here the object names become the column names.
Model=c("AMC Concord","AMC Pacer","AMC Spirit","Buick Century","Buick Electra")
MPG=c(22,17,22,20,15)
WEIGHT=c(2930,3350,2640,3250,4080)
PRICE=c(4099,4749,3799,4816,7827)
Cars1=data.frame(Model,MPG,WEIGHT,PRICE)
#Yet another way that uses the data.frame before the
matrix function.
Cars1=data.frame(matrix(c(
"AMC Concord", 22, 2930, 4099,
"AMC Pacer", 17, 3350, 4749,
"AMC Spirit", 22, 2640, 3799,
"Buick Century", 20, 3250, 4816,
"Buick Electra", 15, 4080, 7827), nrow=5,ncol=4,byrow=T))
colnames(Cars1)=c("Model", "MPG","WEIGHT","PRICE")
Cars1
#To save a dataset, after creating or importing one with a
given name, e.g. Cars1,just give it a filename and place to go
save("Cars1", file="C:/Documents and
Settings/mjc0016/Desktop/5700/Code/Cars1.rda")
#To call get it later, one can then use the attach or load
functions
attach("C:/Documents and Settings/mjc0016/Desktop/5700/Code/Cars1.rda")
Cars1
load("C:/Documents and Settings/mjc0016/Desktop/5700/Code/Cars1.rda")
Cars1
library(foreign)
#Import SPSS file
Dataset <- read.spss("C:/Documents and Settings/mjc0016/Desktop/ozone.sav",
use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
#Import Excel file.
#Honestly, I wouldn't do an Excel import by command line. Whereas SAS and SPSS
ignore most of what might be in an Excel spreadsheet and hence can break if there are formulas, colored
#cells etc., R
uses an approach for all sorts of databases, like Access and others along with
Excel and the code is not too intuitive. I would suggest R-commander and the
menu approach.
Sort
#Create a dataset
ID=1:10
Age=round(rnorm(10,50,1)) #round makes the ages, with mean of 50 sd 1, whole
numbers
diag=c("Depression","Bipolar");Diagnosis=sample(diag,10,replace=T)
data=data.frame(ID,Age,Diagnosis)
data
attach(data)
#Sort by Age
databyAge=data[order(Age),]
#Sorting by more than one variable. Does age first then
Diagnoisis within.
databyAgeandDiag=data[order(Age,Diagnosis),]
#Same as above except oldest will be displayed first
databyreverseAgeDiag=data[order(Diagnosis,-Age),]
Split
#Create a control-treat design dataset of 50 total. Note that as this is random sampling you will not (necessarily) get equal groups unless you force it to.
group=sample(c("Control","Treatment"),50,
replace=T)
ContTreat=sort(group)
x=rnorm(25,10,2)
y=rnorm(25,15,3)
scores=c(x,y)
dataset=data.frame(ContTreat, scores)
#One way to do the t.test as comparing samples of two
groups. The first line shows the mean of the two #groups, the second does a
t.test by telling the program what the two samples to be compared are i.e., the
#control group vs. the treatment group.
lapply(split(scores, ContTreat),mean)
t.test(split(scores, ContTreat)$Control,split(scores, ContTreat)$Treatment)
#Another way to do it in terms of the General Linear Model y = bx, in this case,
scores predicted by the #grouping variable
t.test(scores~ContTreat)
Merge
#Adding to
pre-existing datasets. Note that the 'cbind' and 'rbind' functions assume
perfectly matching #datasets while 'merge' does not.
#Create some variables to work with as here, or import your own dataset and work
with those.
x=rnorm(50)
y=rnorm(50)
z=rnorm(50)
#Initial dataset only contains 50 cases, and 2 variables
dataset=data.frame(x,y)
#Add variables (z), new dataset now has 3 variables x,y,z.
Uses cbind i.e. column bind.
newdata.variables=cbind(dataset, z)
newdata.variables
#Alternate method
newdata.variables=data.frame(dataset,z)
#add cases (n=10). Uses rbind i.e. row bind. To the new
dataset created above, add 10 new cases
x=rnorm(10)
y=rnorm(10)
z=rnorm(10)
cases=data.frame(x,y,z)
newdata.cases=rbind(newdata.variables, cases)
newdata.cases
#Adding a variable with missing values
z=rnorm(45)
#create a dataset with 2 variables, ID and Z
zvar=data.frame(ID=1:45,z)
newdata=data.frame(ID=1:50,dataset)
newdata2=merge(newdata,zvar, by="ID", all.x=T)
newdata2
#Adding cases with missing variables
#Having already made the dataset 'cases' above, I've made a new one that only
contains the first two #variables. Note that if I had put[1:2,] it would have
selected only the first 2 rows.
cases2=cases[,1:2]
Using Subsets of the data
#Examining only part
of a dataset is terribly easy in R. In general Dataset[rows,columns] is all it
takes but #can be done in other ways. In the following assume an original
dataset of 20 variables and N = 100
#create such a dataset if you want
x=rnorm(1000)
dataset=matrix(sample(x),ncol=20,nrow=100)
#take the first 7 variables
(newdata= dataset[,1:7])
#take variables from the middle of a dataset
(newdata= dataset[,15:19])
#take the first 7 cases
(newdata= dataset[1:7,])
#take the first 50 cases of the first 5 variables
(newdata= dataset[1:50,1:5])
#Get really specific, the first 10 and last 50 cases of
the first 3 and last 2 variables.
#I create objects representing the specific rows and variables for clarity.
rows=c(1:10,51:100)
variables=c(1:3,19:20)
newdata=dataset[rows,variables]
#Alternate method: 'subset' function. The following uses
the airquality dataset that's
#preloaded as part of the base R package.
airquality
#Look at only the months before August
newdata=subset(airquality,Month < 8)
newdata
Recoding Variables
#Numeric into factor. Almost always a very bad idea. I
kept the sd as it typically is for the WAIS, but you
#might have to rerun or increase the sample size to actually get an 'outlier'
(IQ <- round(rnorm(100,100,15)))
IQlabels = cut(IQ, breaks=c(0,70, 130,200), labels=c('Below Average', 'Average',
'Above Average'))
IQlabels
#Make a factor numeric.
(newIQ=as.numeric(IQlabels))
Computing new variables
#R has a built in a suite of operators for calculations on matrices. Because of this it is extremely easy to do any sort of computation, and there often is nothing special needed, i.e. you don't need a special 'compute' command to create new variables, and in general its capabilities are more advanced than the other two programs.
newvar=var1+var2
If... else
Use ifelse to create a new variable whose responses depend on values of other variables. R also use a more traditional if else (separate) type of approach similar to that of SPSS and SAS.
Other functions regarding data manipulation: reshape, stack, melt
Obtaining Frequency Information
#Create a variable
called grades based on categorical data, i.e. letter grades. 'Catdata' and
'grades' are arbitrary names
#I chose, 50 is the size of the sample, replace=TRUE says to sample with
replacement, and prob= assigns weights by which
#to sample, in this case most of the grades will be As and Bs
catdata=c("A","B","C","D")
grades=sample(catdata, 50, replace=TRUE, prob=c(.5,.35,.1,.05))
grades
#Simple frequencies
table(grades)
#Barplot based on the table above, with titles and blue
bars.
barplot(table(grades), main="Grades for Year 2008", xlab="Letter Grade", col="blue")
#The following is akin to Proc Freq in SAS or CROSSTABS in
SPSS
library(gmodels)
CrossTable(grades)
#create new variables
newvar=rnorm(100,100,15)
newvar2=rnorm(100,100,15)
#make a new dataset of the two variables
dataset=cbind(newvar, newvar2)
#descriptives
summary(dataset)
#alternative approach (there are many); this uses the
psych package
library(psych)
describe(dataset)
#histogram in density form
hist(newvar,freq=F, col="wheat")
curve(dnorm(x,mean=mean(newvar),sd=sd(newvar)), col="red",add=TRUE)
#Note that this uses "density" rather than frequence
(hence freq=FALSE, with =TRUE (or just T) instead
#it would have plotted frequencies; TRUE is the default and not necessary to add
to the hist funciton). While
#SPSS and others do so, it is incorrect, and is like implying age and income are
on the same scale.
#As the density and frequency graphs look the same however, this more of a
nitpicky statistician thing.
Two dependent samples
Both samples must be specified and paired =T
t.test(sample1, sample2, mu=value, alternative = c("two.sided", "less", "greater"), paired = TRUE, conf.level = value)
Multiple t-tests: in
Regression analysis can be conducted in a multitude of ways via a number of packages. By far the most common is the simple lm base package function. Others I use a lot are ols from the Design package, lmrob from the robustbase package, and lmRob from the robust package. All of them by default are plottable though ols goes out of its way to not make this an intuitive process like the others. All of the model objects contain within them many more things such as fitted values, residuals and more.
Testing assumptions
After fitting a model in R-commander, it's terribly easy to test assumptions graphically and statistically in the menus e.g. Model/Graphs/Basic Diagnostic Plots, Model/Numerical Diagnostics/, and that is my recommendation. However, I provide basic code for the statistical tests below which use the lmtest library (first three) and base packages. Note however there are other tests available.
bptest(DV ~ PREDICTOR, varformula = ~
fitted.values(mymodel), studentize=FALSE, data=mydata) #Breusch-Pagan test
for heteroscedasticity
dwtest(DV ~ PREDICTOR, alternative="two.sided", data=mydata)
#Durbin-Watson statistic for autocorrelation among residuals
resettest(DV ~ PREDICTOR, power=2:3, type="regressor", data=mydata) #Reset
test for nonlinearity
shapiro.test(mymodel$residuals) #Shapiro-Wilks test for non-normality
Dealing with violations of assumptions
# Bootstrap 95% CI for regression
coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mydata, statistic=bs,
R=1000, formula=mymodelformula)
# view results
results
plot(results, index=1) # intercept
plot(results, index=2) # predictor coefficient boot.ci(results, type="bca", index=1) # intercept CI. Type 'bca'
is a bias-corrected version of the bootstrap, but there are other options.
?boot.ci
boot.ci(results, type="bca", index=2) # predictor CI rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
results <- boot(data=mydata, statistic=rsq,
R=1000, formula=mymodelformula)
results
plot(results)
boot.ci(results, type="bca")Whatever effect size you want, even if R doesn't have it you can calculate it with R and probably easy. However it does contain the basic ones such as d and every program calculates R-squared. However R will also easily provide interval estimates and one can bootstrap and robustify as well. I provide a function for the latter. All of the following (except my function obviously) require the MBESS package.