R code

Entering Data

#Create a dataset. In this arena most stat packages are the same, though with SPSS and SAS
#using old Mainframe type code, I'd probably go with R. However, I would avoid text based entry
#altogether as it leaves more room for error compared to spreadsheet entry.


#There are many ways to do this in R, but I most often use matrix or data.frame
#In these examples the only real difference is whether you find entering by rows or columns easier. However
#some functions specifically require matrices while others dataframes.


#Using the matrix function. "c" combines whatever follows in parentheses, nrow and ncol the
#the number of rows and columns respectively. byrow=T or TRUE tells it to fill up rows first.

Cars1=matrix(c(
"AMC Concord", 22, 2930, 4099,
"AMC Pacer", 17, 3350, 4749,
"AMC Spirit", 22, 2640, 3799,
"Buick Century", 20, 3250, 4816,
"Buick Electra", 15, 4080, 7827), nrow=5,ncol=4,byrow=T)
colnames(Cars1)=c("Model", "MPG","WEIGHT","PRICE")
Cars1

#Using data.frame, enter each variable at once and combine. Here the object names become the column names.
Model=c("AMC Concord","AMC Pacer","AMC Spirit","Buick Century","Buick Electra")
MPG=c(22,17,22,20,15)
WEIGHT=c(2930,3350,2640,3250,4080)
PRICE=c(4099,4749,3799,4816,7827)
Cars1=data.frame(Model,MPG,WEIGHT,PRICE)

#Yet another way that uses the data.frame before the matrix function.
Cars1=data.frame(matrix(c(
"AMC Concord", 22, 2930, 4099,
"AMC Pacer", 17, 3350, 4749,
"AMC Spirit", 22, 2640, 3799,
"Buick Century", 20, 3250, 4816,
"Buick Electra", 15, 4080, 7827), nrow=5,ncol=4,byrow=T))
colnames(Cars1)=c("Model", "MPG","WEIGHT","PRICE")
Cars1

#To save a dataset, after creating or importing one with a given name, e.g. Cars1,just give it a filename and place to go
save("Cars1", file="C:/Documents and Settings/mjc0016/Desktop/5700/Code/Cars1.rda")

#To call get it later, one can then use the attach or load functions
attach("C:/Documents and Settings/mjc0016/Desktop/5700/Code/Cars1.rda")
Cars1

load("C:/Documents and Settings/mjc0016/Desktop/5700/Code/Cars1.rda")
Cars1


Importing Data

library(foreign)

#Import SPSS file
Dataset <- read.spss("C:/Documents and Settings/mjc0016/Desktop/ozone.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)


#Import Excel file.
#Honestly, I wouldn't do an Excel import by command line. Whereas SAS and SPSS ignore most of what might be in an Excel spreadsheet and hence can break if there are formulas, colored #cells etc., R uses an approach for all sorts of databases, like Access and others along with Excel and the code is not too intuitive. I would suggest R-commander and the menu approach.


Manipulating Data

Sort

#Create a dataset
ID=1:10
Age=round(rnorm(10,50,1)) #round makes the ages, with mean of 50 sd 1, whole numbers
diag=c("Depression","Bipolar");Diagnosis=sample(diag,10,replace=T)
data=data.frame(ID,Age,Diagnosis)
data

attach(data)

#Sort by Age
databyAge=data[order(Age),]

#Sorting by more than one variable. Does age first then Diagnoisis within.
databyAgeandDiag=data[order(Age,Diagnosis),]

#Same as above except oldest will be displayed first
databyreverseAgeDiag=data[order(Diagnosis,-Age),]

Split

#Create a control-treat design dataset of 50 total. Note that as this is random sampling you will not (necessarily) get equal groups unless you force it to.

group=sample(c("Control","Treatment"),50, replace=T)
ContTreat=sort(group)
x=rnorm(25,10,2)
y=rnorm(25,15,3)
scores=c(x,y)
dataset=data.frame(ContTreat, scores)

#One way to do the t.test as comparing samples of two groups. The first line shows the mean of the two #groups, the second does a t.test by telling the program what the two samples to be compared are i.e., the #control group vs. the treatment group.

lapply(split(scores, ContTreat),mean)
t.test(split(scores, ContTreat)$Control,split(scores, ContTreat)$Treatment)

#Another way to do it in terms of the General Linear Model y = bx, in this case, scores predicted by the #grouping variable

t.test(scores~ContTreat)

Merge

#Adding to pre-existing datasets. Note that the 'cbind' and 'rbind' functions assume perfectly matching #datasets while 'merge' does not.

#Create some variables to work with as here, or import your own dataset and work with those.

x=rnorm(50)
y=rnorm(50)
z=rnorm(50)

#Initial dataset only contains 50 cases, and 2 variables
dataset=data.frame(x,y)

#Add variables (z), new dataset now has 3 variables x,y,z. Uses cbind i.e. column bind.
newdata.variables=cbind(dataset, z)
newdata.variables

#Alternate method
newdata.variables=data.frame(dataset,z)

#add cases (n=10). Uses rbind i.e. row bind. To the new dataset created above, add 10 new cases
x=rnorm(10)
y=rnorm(10)
z=rnorm(10)
cases=data.frame(x,y,z)
newdata.cases=rbind(newdata.variables, cases)
newdata.cases

#Adding a variable with missing values
z=rnorm(45)
#create a dataset with 2 variables, ID and Z
zvar=data.frame(ID=1:45,z)
newdata=data.frame(ID=1:50,dataset)
newdata2=merge(newdata,zvar, by="ID", all.x=T)
newdata2

#Adding cases with missing variables

#Having already made the dataset 'cases' above, I've made a new one that only contains the first two #variables. Note that if I had put[1:2,] it would have selected only the first 2 rows.

cases2=cases[,1:2]

Using Subsets of the data

#Examining only part of a dataset is terribly easy in R. In general Dataset[rows,columns] is all it takes but #can be done in other ways. In the following assume an original dataset of 20 variables and N = 100

#create such a dataset if you want
x=rnorm(1000)
dataset=matrix(sample(x),ncol=20,nrow=100)

#take the first 7 variables
(newdata= dataset[,1:7])

#take variables from the middle of a dataset
(newdata= dataset[,15:19])

#take the first 7 cases
(newdata= dataset[1:7,])

#take the first 50 cases of the first 5 variables
(newdata= dataset[1:50,1:5])

#Get really specific, the first 10 and last 50 cases of the first 3 and last 2 variables.
#I create objects representing the specific rows and variables for clarity.

rows=c(1:10,51:100)
variables=c(1:3,19:20)
newdata=dataset[rows,variables]

#Alternate method: 'subset' function. The following uses the airquality dataset that's
#preloaded as part of the base R package.

airquality

#Look at only the months before August
newdata=subset(airquality,Month < 8)
newdata

Recoding Variables

#Numeric into factor. Almost always a very bad idea. I kept the sd as it typically is for the WAIS, but you
#might have to rerun or increase the sample size to actually get an 'outlier'


(IQ <- round(rnorm(100,100,15)))
IQlabels = cut(IQ, breaks=c(0,70, 130,200), labels=c('Below Average', 'Average', 'Above Average'))
IQlabels

#Make a factor numeric.
(newIQ=as.numeric(IQlabels))

Computing new variables

#R has a built in a suite of operators for calculations on matrices.  Because of this it is extremely easy to do any sort of computation, and there often is nothing special needed, i.e. you don't need a special 'compute' command to create new variables, and in general its capabilities are more advanced than the other two programs.

newvar=var1+var2

If... else

Use ifelse to create a new variable whose responses depend on values of other variables.  R also use a more traditional if else (separate) type of approach similar to that of SPSS and SAS.

x = c(0,0,1,1)
y = c(0,0,1,2)
data1 = data.frame(x,y)
data1$z1= ifelse (data1$x==1, 1, 0)  #Create a new variable z1 that, if x is 1, z1 is 1, and everything else for z1 is 0
data1$z1

data1$z2 = ifelse(data1$x == 1,1,0); ifelse(data1$y==2,1,0)    #Two conditionals: if x is 1, z2 is 1, everything else for z2 is 0. If y is 2, z2 is 1 also and everything else 0 (that's not covered by the x part.  So z1 will be a 1 only if both x and y are 1.

data1$w=c(1,0,1,0)
data1$z3=ifelse(data1$x & data1$y & data1$w == 1,1,0)   #Example with 3 conditionals.  Z3 is 1 only if x y and w are 1.  Any other combination results in a 0 for z3
data1$z3

Other functions regarding data manipulation: reshape, stack, melt


Obtaining Frequency Information

#Create a variable called grades based on categorical data, i.e. letter grades. 'Catdata' and 'grades' are arbitrary names
#I chose, 50 is the size of the sample, replace=TRUE says to sample with replacement, and prob= assigns weights by which
#to sample, in this case most of the grades will be As and Bs

catdata=c("A","B","C","D")
grades=sample(catdata, 50, replace=TRUE, prob=c(.5,.35,.1,.05))
grades

#Simple frequencies
table(grades)

#Barplot based on the table above, with titles and blue bars.
barplot(table(grades), main="Grades for Year 2008", xlab="Letter Grade", col="blue")

#The following is akin to Proc Freq in SAS or CROSSTABS in SPSS
library(gmodels)
CrossTable(grades)

#yet another way
library(prettyR)
freq(grades)


#More complex setting. I sampled based on percentages in this class as of 8-25-08.
progcats=c("Counseling","Clinical","Bmed","Experimental")
program=sample(progcats, 50, replace=TRUE, prob=c(.51,.22,.16,.11))
dataset=data.frame(program, grades)
dataset
CrossTable(program,grades)

#Grouped bar graph. This uses the table function above for the basic frequencies. Regarding the barplot function, I had a little fun #with the colors. The legend creates the box that lets people know which program goes with which color. For "beside"- if FALSE, the #columns of height are portrayed as stacked bars, and if TRUE the columns are portrayed as juxtaposed bars. Try it the other way just #to see it.
counts <- table(program, grades)
barplot(counts, main="Grades by Progarm", xlab="Grades", col=c("skyblue","tomato", "wheat","seagreen"), legend = rownames(counts), beside=TRUE)

Getting Measures of Central Tendency and Variability

Central Tendency
#Mean and median.  As simple as it gets.
mean(varname)
median(varname)

#Trimmed mean. Built in the same 'mean' function with the "tr" addition
mean(varname, tr=.2)

#Wilcox's packages must be used for winsorized means, M-estimators etc. Change your working directory to wherever these files are, then #type the following to have a wealth of robust measures at your disposal.
source("Rallfunv1.v7.txt")
source("Rallfunv2.v7.txt")
win(varname, tr=.2)

#Harmonic mean. Requires the psych package. I actually created my own function to handle missing values and the choice of going across
#rows or columns if you ever need it.
library(psych)
harmonic.mean(varname)

Variance

sd(varname)
mad(varname)

#If you have not sourced the Wilcox Rallfun libraries as above, do so for the winsorized variance.
winvar(varname)
#Winsorized standard deviation.
sqrt(winvar(varname))

#Going further. Compare them all together. Remember, you can name them whatever you like.
x=rnorm(100, 100,15)
meanx=mean(x)
medianx=median(x)
trmean=mean(x,tr=.2)
winmean=win(x,tr=.2)
hmean=harmonic.mean(x)
sdx=sd(x)
madx=mad(x)
winsdx=sqrt(winvar(x))
 
#Here I present a way to get straight to the output by putting parentheses around everything.  No 'print' type of step is really #needed  as we have been doing with e.g. data=(blah blah) <return> data.
(means=data.frame(meanx, medianx,trmean,winmean,hmean, sdx, madx, winsdx))

Summarizing Data

#create new variables
newvar=rnorm(100,100,15)
newvar2=rnorm(100,100,15)

#make a new dataset of the two variables
dataset=cbind(newvar, newvar2)

#descriptives
summary(dataset)

#alternative approach (there are many); this uses the psych package
library(psych)
describe(dataset)


#histogram in density form
hist(newvar,freq=F, col="wheat")
curve(dnorm(x,mean=mean(newvar),sd=sd(newvar)), col="red",add=TRUE)


#Note that this uses "density" rather than frequence (hence freq=FALSE, with =TRUE (or just T) instead
#it would have plotted frequencies; TRUE is the default and not necessary to add to the hist funciton). While
#SPSS and others do so, it is incorrect, and is like implying age and income are on the same scale.
#As the density and frequency graphs look the same however, this more of a nitpicky statistician thing.


t-tests

The t.test function is all that's required minimally.
 
One-sample case
t.test(variablename, mu=value, alternative = c("two.sided", "less", "greater"), conf.level = value)
 
Two independent samples
Two samples must be specified or formula provided.  Choices include an alternative hypothesis, specific null hypothesis (i.e. you don't have to test a 'nil' difference), and confidence level. By default a Welch's test is performed assuming unequal variances, i.e. var.equal=F.
 
t.test(sample1, sample2, mu=value, alternative = c("two.sided", "less", "greater"),mu = value, paired = FALSE, var.equal = TRUEORFALSE, conf.level = value)
t.test(DV~groupvar, mu=value, alternative = c("two.sided", "less", "greater"),mu = value, paired = FALSE, var.equal = TRUEORFALSE, conf.level = value)

Two dependent samples

Both samples must be specified and paired =T

t.test(sample1, sample2, mu=value, alternative = c("two.sided", "less", "greater"), paired = TRUE, conf.level = value)

Multiple t-tests: in


Correlation

There are lots of ways to get a correlation or correlation matrix as there is not simply one correlation specific.  The standard fare way is shown here.  X is either a dataframe, or if a single variable, one can specify the y to correlate with it, method can be one of 3 commonly seen in stats packages.  If you insist on testing whether a specific correlation is zero or not, use cor.test in a similar fashion.
 
cor(x, y = NULL, use = "all.obs", method = c("pearson", "kendall", "spearman"))
cor.test(x, y, alternative = c("two.sided", "less", "greater"),method = c("pearson", "kendall", "spearman"),conf.level = 0.95)
 
Obtaining robust correlations can be quite easy also.  Here we have the option of the minimum volume ellipsoid, minimum covariance determinant or classical Pearson r.
 
library(MASS)
cov.rob(data, cor = TRUE, method = c("mve", "mcd", "classical"))
 
Assuming you've downloaded the Wilcox libraries to your R folder, you can source them and use the functionality there.  pbcor refers to the percentage bend correlation. relplot is like a scatterplot boxplot.
 
source("Rallfunv1-v7")
source("Rallfunv2-v7")
pbcor(var1,var2)
relplot(var1,var2)

Simple Regression

Regression analysis can be conducted in a multitude of ways via a number of packages.  By far the most common is the simple lm base package function.  Others I use a lot are ols from the Design package, lmrob from the robustbase package, and lmRob from the robust package.  All of them by default are plottable though ols goes out of its way to not make this an intuitive process like the others.  All of the model objects contain within them many more things such as fitted values, residuals and more.

modelreg=lm(DV~predictor, data=dataset)
library(Design)
modelreg2=ols(DV~predictor, x=T, y=T, data=dataset) #x and y=T free them up for bootstrapping later
library(robustbase)
modelrob=lmrob(DV~predictor, data=dataset)
library(robust)
modelrob2=lmRob(DV~predictor, data=dataset)
 
Example plot
plot(modelreg)

Testing assumptions

After fitting a model in R-commander, it's terribly easy to test assumptions graphically and statistically in the menus e.g. Model/Graphs/Basic Diagnostic Plots,  Model/Numerical Diagnostics/, and that is my recommendation.  However, I provide basic code for the statistical tests below which use the lmtest library (first three) and base packages.  Note however there are other tests available.

bptest(DV ~ PREDICTOR, varformula = ~ fitted.values(mymodel), studentize=FALSE, data=mydata)  #Breusch-Pagan test for heteroscedasticity
dwtest(DV ~ PREDICTOR, alternative="two.sided", data=mydata)   #Durbin-Watson statistic for autocorrelation among residuals
resettest(DV ~ PREDICTOR, power=2:3, type="regressor", data=mydata)  #Reset test for nonlinearity
shapiro.test(mymodel$residuals)  #Shapiro-Wilks test for non-normality

Dealing with violations of assumptions

To get a sense of bias, use the Design library and ols function as above
modelreg2=ols(DV~predictor, x=T, y=T, data=dataset)
validate(modelreg2)
For heteroscedasticity and dealing with outliers, one can compare results to a robust regression as above, though model misspecification may be the issue
For interval estimates not reliant on the normality assumption, we'll use the boot library and a borrowed function (called bs here).
# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mydata, statistic=bs,
   R=1000, formula=mymodelformula)

# view results
results
plot(results, index=1) # intercept
plot(results, index=2) # predictor coefficient
boot.ci(results, type="bca", index=1) # intercept CI.  Type 'bca' is a bias-corrected version of the bootstrap, but there are other options.  ?boot.ci
boot.ci(results, type="bca", index=2) # predictor CI
 
For interval estimates of R2, one can use the MBESS library.
library(MBESS)
ci.R2(R2=myRsquare, N=mysamplesize, K=my#ofpredictors, conf.level=.95, Random.Predictors=TRUE)
 
However, interval estimates based on noncentral distributions are very wide and as such uninformative for smaller sample sizes except perhaps as a means of providing a lower bound.  Here is a bootstrap approach that has been noted to perform better.  Again is uses the boot library as above and follows the same basic procedure as we did for the coefficient.
rsq <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
}

results <- boot(data=mydata, statistic=rsq,
   R=1000, formula=mymodelformula)

results
plot(results)

boot.ci(results, type="bca")
 

Power
 
Power analysis is best used prior to a study to help determine a viable sample size for correct rejection of null hypotheses.  However it will not: give you the time nor resources to actually collect said data, it will not tell you what your power will be if there are outliers, and it won't tell you what your power is if you don't meet your assumptions.  Furthermore the correct sample size is always "as much as you can with the time you have".  In any case, here is how to in R.
For simple group comparisons.  There is also power.anova.test and power.prop.test for the ANOVA and proportion test setting respectively.
 
power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05, power = NULL, type = c("two.sample", "one.sample", "paired"),alternative = c("two.sided", "one.sided"), strict = FALSE)
 
Example.  As sd defaults to one, a delta = .5 would be a Cohen's d of .50.
power.t.test(power = .80, delta = .5, sig.level=.05)
 
For regression. This requires the MBESS package. R2 is the population R-squared, p is the number of predictors.
library(MBESS)
ss.power.R2(Population.R2 = .10, alpha.level = 0.05, desired.power = 0.80, p=1)

Effect Size

Whatever effect size you want, even if R doesn't have it you can calculate it with R and probably easy.  However it does contain the basic ones such as d and every program calculates R-squared.  However R will also easily provide interval estimates and one can bootstrap and robustify as well.  I provide a function for the latter.  All of the following (except my function obviously) require the MBESS package.

Input your R-squared and other model specifics.  K refers to the number of predictors.
ci.R2(R2=?, N=?, K=?, conf.level=?)
 
Two ways to calculate Cohen's d (i.e. standardized mean difference).
smd(Mean.1=?, Mean.2=?, s.1=?, s.2=?, n.1=?, n.2=?)
smd(Group1name,Group2name)
 
Now, once you have it, get the interval estimate of it.
ci.smd(smd=?, n.1=?, n.2=?, conf.level=?)
 
 
My robustd function
robustd= function(g1,g2,trim=.2){
m1=mean(g1,tr=trim)
m2=mean(g2,tr=trim)
n1=length(g1)
n2=length(g2)
df1=n1-1
df2=n2-1
var1=winvar(g1)
var2=winvar(g2)
pooledwinsd= (sqrt((df1*var1 + df2*var2)/(df1+df2)))
robustd= (m1-m2)/pooledwinsd
return("Robust Cohen's d"=robustd)
}
 
Example of use.
robustd(group1, group2)