|
VI. SAS Procedures
The following covers some of the most commonly used SAS procedures with which
you can run some basic statistical analyses. Go to File, Import Data... to
import the
Example Data 1 file using the Import Wizard with SPSS File (*.sav) source
and member name example1 as
was done previously.
Before we really begin; you should consider the use of the OPTIONS
statement when submitting any program (i.e. syntax). The options statement can
be tacked on to just about any program or procedure. What the options statement
does is allow you to control the number of characters per line and lines per
page of the output generated by the program or procedure to which the options
statement is included. The generic form of the options statement follows:
OPTIONS LINESIZE=x PAGESIZE=y;
The x refers to the number of characters per line and the y refers to the number
of lines per page. The reason the options statement is mentioned here is
because, SAS can be quite costly in terms of the amount of output generated when
one considers printing it or copying and pasting it into a word processing
program. For instance, the sixth edition of the Publication Manual of the
American Psychological Association (APA) generally recommends using Times New
Roman 12 point font on a page with 1 inch margins at top, bottom, left, and
right. This configuration in Microsoft Word results in a page that contains
approximately 78 characters per line and 46 lines per page. Therefore, if you
are accustom to using the APA Publication Manual guidelines for formatting
documents, you may want to use an options statement to configure each SAS output
so that it fits neatly on a pre-formatted document page. An example of the use
of the options statement is provided in the syntax for the PROC PRINT example
below -- noticeable because, like all usable syntax on these web pages, it is
shown in bold Courier New 10 point
font on the web page.
1. PROC PRINT
PROC PRINT is frequently used to check the data being read by SAS. It prints
out the observations in a SAS data set, using any or some of the variables. The
complete syntax for PROC PRINT is as follows:
PROC PRINT DATA= SAS-data-set
DOUBLE
NOOBS
UNIFORM
LABEL
SPLIT= 'split-character'
N
ROUND
HEADING= direction
ROWS= page-format
WIDTH= column-width;
VAR variable-list;
ID variable-list;
BY variable-list;
PAGEBY BY-variable;
SUMBY BY-variable;
SUM variable-list;
The most common use is to have the PROC PRINT following the data step to
verify the data:
For the current example with ExampleData1.sav (using member name example1 in SAS); use the following syntax
(with optional OPTIONS statement included):
PROC PRINT DATA=example1;
OPTIONS LINESIZE=78 PAGESIZE=46;
RUN;
2. PROC CONTENTS
This procedure prints descriptions of the contents of one or more files from a
SAS library. Another common procedure to verify the data set read into SAS
library, especially for a sizeable data set. It is crucial, for example, to
check if all observations and variables are read in correctly. PROC CONTENTS
prints descriptions of the contents of one or more files from a SAS data
library. It is useful for documenting permanent SAS data sets (library members
of DATA type).
Specific information pertaining to the physical characteristics of a member
depends on whether the file is a SAS data set or another type of SAS file.
Syntax:
PROC CONTENTS <DATA= <libref.>member>
<DIRECTORY>
<FMTLEN>
<MEMTYPE= (mtype-list)>
<NODS>
<NOPRINT>
<OUT= SAS-data-set>
<POSITION>
<SHORT>
<DETAILS|NODETAILS>;
For the current example:
PROC CONTENTS DATA=example1;
RUN;
An often used command when first looking
at data is the data command in conjunction with the label command to assign
labels to variables. For the current example; we assign a new data step
consisting of our data, but with some variables having been assigned labels.
DATA example1a;
SET example1;
LABEL Sex ="Gender"
recall1 ="Recall at time 1"
recall2 ="Recall at time 2";
RUN;
PROC CONTENTS DATA=example1a;
RUN;
3. PROC MEANS
PROC MEANS computes statistics for an entire SAS data set or for groups of
observations in the data set. If you use a BY statement, PROC MEANS calculates
descriptive statistics separately for groups of observations. Each group is
composed of observations having the same values of the variables used in the BY
statement. The groups can be further subdivided by the use of the CLASS
statement. PROC MEANS can optionally create one or more SAS data sets containing
the statistics calculated.
The full syntax for PROC MEANS is as follows:
PROC MEANS <option-list> <statistic-keyword-list>;
VAR variable-list;
BY variable-list;
CLASS variable-list;
FREQ variable;
WEIGHT variable;
ID variable-list;
OUTPUT <OUT= SAS-data-set> <output-statistic-list>
<MINID|MAXID <(var-1<(id-list-1)>
<...var-n<(id-list-n)>>)>=name-list>;
We can get descriptive statistics for all of the variables using proc
means as shown below.
PROC MEANS DATA=example1;
RUN;
We can get descriptive statistics separately by gender (i.e., broken down by
SEX) as shown below.
PROC MEANS DATA=example1;
CLASS Sex;
RUN;
We can get descriptive statistics on the outcome or dependent variable recall
at time 1 (recall1) separately by gender (i.e., broken down by
SEX) as shown below.
PROC MEANS DATA=example1;
CLASS Sex;
VAR recall1;
RUN;
We can get descriptive statistics on recall1 separated by gender (i.e., broken down by
SEX) and class standing (cl_st) as shown below.
PROC MEANS DATA=example1;
CLASS Sex cl_st;
VAR recall1;
RUN;
We can also subset the data do get very specific descriptive statistics. For
instance, if we review the output or know the numeric codes for each value of
our variables, we can request a subset of the data (example1fj) be generated
from the original data (example1) which contains only persons who are sex = 1
and cl_st = 3 which corresponds to females whose class standing is Junior.
DATA example1fj;
SET example1;
IF sex='1'AND cl_st='3';
PROC MEANS DATA=example1fj;
VAR recall1;
RUN;
We can verify we have gotten what we wanted by referring to the previous
output showing descriptive statistics for males and female across all four
levels of class standing. In both the current output and previous output we
notice there were 27 females who were Juniors.
4. PROC UNIVARIATE
This procedure is useful for basic descriptive statistics of the variables. It
provides detail on the distribution of a variable. Features include:
- detail on the extreme values of a variable
- quartiles, such as the median
- several plots to picture the distribution
- frequency tables
- a test to determine that the data are normally distributed.
If a BY statement is used, descriptive statistics are calculated separately
for groups of observations.
Syntax:
PROC UNIVARIATE DATA= SASdataset
NOPRINT
PLOT
FREQ
NORMAL
PCTLDEF= value
VARDEF= DF|WEIGHT|WGT|N|WDF
ROUND= roundoff unit...;
VAR variables;
BY variables;
FREQ variable;
WEIGHT variable;
ID variables;
OUTPUT OUT= SASdataset keyword= names...;
We can get detailed descriptive statistics for family income
using proc univariate as shown below.
PROC UNIVARIATE DATA=example1;
VAR fam_income;
RUN;
We can also use PROC UNIVARIATE to get conditional univariate summaries using
the 'by' command; but first, we need to sort the 'by variable'.
PROC SORT DATA=example1;
BY Sex;
RUN;
PROC UNIVARIATE DATA=example1;
BY Sex;
VAR recall1;
RUN;
Another very handy function which can be performed with PROC UNIVARIATE is
identification of outliers. To accomplish this, we insert two optional commands
or statements into the basic proc univariate syntax. These optional statements
are NORMAL and PLOT.
PROC UNIVARIATE DATA=example1 NORMAL
PLOT;
VAR recall1;
ID id;
RUN;
In the preceding syntax, we ran a PROC UNIVARIATE program on recall at time 1
(recall1) and use values of the variable participant identification (id) to
IDENTIFY (ID) outlying values of recall1. In the next syntax we perform the same basic procedures, but separately for each
gender (produces 7 pages of output).
PROC UNIVARIATE DATA=example1 NORMAL
PLOT;
BY Sex;
VAR recall1;
ID id;
RUN;
5. PROC FREQ
The procedure produces one-way to n-way frequency and crosstabulation tables. It shows the distribution of variable values and crosstabulation tables with
combined frequency distributions for two or more variables. For one-way tables,
PROC FREQ can compute chi-square tests for equal or specified proportions. For
two-way tables, PROC FREQ computes tests and measures of association. For n-way
tables, PROC FREQ does stratified analysis, computing statistics within as well
as across strata.
Syntax:
PROC FREQ options;
OUTPUT <OUT= SAS-data-set><output-statistic-list>;
TABLES requests / options;
WEIGHT variable;
EXACT statistic-keywords;
BY variable-list;
We can get a frequency distribution of age using
proc freq as shown below.
PROC FREQ DATA=example1;
TABLES age;
RUN;
We can make a two way table showing the frequencies for class standing by sex as shown below.
PROC FREQ DATA=example1;
TABLES cl_st * Sex;
RUN;
Labeling values is a two step process. First, we must create the label
formats with proc format using a value statement. Next, we attach
the label format to the variable with a format statement. This format
statement can be used in either proc or data steps. An example of
the proc format step for creating the value formats on class standing (cl_st)
follows.
PROC FORMAT;
VALUE cl_stf 1="Fre"
2="Sop"
3="Jun"
4="Sen";
RUN;
Now that the format for class standing (cl_st) have been created, they
must be linked to the variable class standing. This is accomplished by
including a format statement in either a proc or a data
step. In the program below the format statement is used in a proc
freq to change 'cl_st'.
PROC FREQ DATA=example1;
FORMAT cl_st cl_stf.;
TABLES cl_st;
RUN;
6. PROC TABULATE
PROC TABULATE constructs tables of descriptive statistics using class
variables, analysis variables, and keywords for statistics. Tables can have one
to three dimensions: column; row and column; or page, row, and column.
The statistics that PROC TABULATE computes are many of the same statistics
computed by other descriptive procedures such as MEANS, FREQ, and SUMMARY. In
order for PROC TABULATE to execute, you need either a CLASS or VAR statement,
and a TABLE statement. There are no default variables chosen for the procedure.
Syntax:
PROC TABULATE <option-list>;
CLASS class-variable-list;
VAR analysis-variable-list;
FREQ variable;
WEIGHT variable;
FORMAT variable-list-1 format-1 <...variable-list-n format-n>;
LABEL variable-1='label-1' <...variable-n='label-n'>;
BY <NOTSORTED> <DESCENDING> variable-1
<...<DESCENDING> VARIABLE-N>;
TABLE <<page_expression,> row_expression,> column_expression
</ table-option-list>;
KEYLABEL keyword-1 ='description-1'
<...keyword-n='description-n'>;
We can create a basic table of individuals' recall at time 2 (recall2) by
gender (sex).
PROC TABULATE DATA=example1;
CLASS sex;
VAR recall2;
TABLE (recall2)*mean, sex;
RUN;
7. PROC GCHART & PROC GPLOT
Making a simple graph in SAS.
We can make a simple vertical bar chart; with recall at time 1. Because
recall 1 is a continuous variable, SAS automatically assigns five bins.
TITLE 'Simple Vertical Bar Chart ';
PROC GCHART DATA=example1;
VBAR recall1;
RUN;
You can control the number of bins for a continuous variable with the
level= option on the vbar statement. The program
below creates a vertical bar chart with seven bins for recall1.
TITLE 'Bar Chart - Control Number of Bins';
PROC GCHART;
VBAR recall1/LEVELS=9;
RUN;
On the other hand, cl_st has only four categories and SAS's
tendency to bin into five categories and use midpoints would not do justice to
the data. So when you want to use the actual values of the variable to label
each bar you will want to use the discrete option on the
vbar statement.
We can make a bar chart showing the frequencies of family income
as shown below.
TITLE 'Bar Chart with Discrete Option';
PROC GCHART DATA=example1;
VBAR cl_st/DISCRETE;
RUN;
Simply changing 'VBAR' to 'HBAR' will produce the same graph horizontally
opposed to vertically.
TITLE 'Bar Chart with Discrete Option';
PROC GCHART DATA=example1;
HBAR cl_st/DISCRETE;
RUN;
We can create a variety of scatter plots using the PROC PLOT function. It
allows us to see the relationship between two continuous variables. The program
below creates a scatter plot for recall2 * recall1. This means
that recall2 will be plotted on the vertical axis, and
recall1 will be plotted on the horizontal axis.
TITLE 'Scatterplot - Two Variables';
PROC GPLOT DATA=example1;
PLOT recall2*recall1;
RUN;
You may want to examine the relationship between two continuous variables and
see which points fall into one or another category of a third variable. The
program below creates a scatter plot for recall2*recall1 with
each gender (Sex) marked. You specify
recall2*recall1=Sex on the plot statement to have each
level of sex identified on the plot.
TITLE 'Scatterplot - Male/Female Marked';
PROC GPLOT DATA=example1;
PLOT recall2*recall1=Sex;
RUN;
The program below creates a scatter plot for recall2*recall1
with each level of Sex marked. The proc gplot
is specified exactly the same as in the previous example. The only difference
is the inclusion of symbol statements to control the look of
the graph through the use of the operands V=, I=,
and C=.
SYMBOL1 V=circle C=black I=none;
SYMBOL2 V=star C=red I=none;
TITLE 'Scatterplot - Different Symbols';
PROC GPLOT DATA=example1;
PLOT recall2*recall1=Sex;
RUN;
QUIT;
Symbol1 is used for the lowest value of Sex
and symbol2 is used for the next lowest value.
V= controls the type of point to be plotted. We
requested a circle to be plotted for domestic cars, and a
star (asterisk) for males.
I= none causes SAS not to plot a line joining the points.
C= controls the color of the plot. We requested black for
females, and red for males. (Sometimes the C= option is needed for
any options to take effect.)
To plot a regression line along with the points we use the I operand of the
symbol statement. The program below creates a scatter plot for
recall2*recall1 with such an OLS regression line. The regression line
is produced with the I=R operand on the symbol
statement.
SYMBOL1 V=circle C=blue I=r;
TITLE 'Scatterplot - With Regression Line ';
PROC GPLOT DATA=example1;
PLOT recall2*recall1;
RUN;
QUIT;
|