Page One

Campus Computing News

EduTex 2002 Proceedings Available

Do you have something to tell Everyone?

Today's Cartoon

RSS Matters

SAS Corner

The Network Connection

Link of the Month


Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
Research and Statistical Support

SAS Corner

By Dr. Karl Ho, Research and Statistical Support Services Manager

Reading Web survey data (*)

If you are a SAS user and interested in a DYI web survey, this article will be handy for you. I provide a sample program that reads in web survey data in one shot. For more details about creating a web survey using FrontPage, consult my ACS Short Course notes at New Technologies for Survey Research I and II.

When FrontPage collects the data from the instrument, it will generates a text data file either delimited by tabs or commas.  The tab-delimited file will look like the following:

"gender" "age" "class" "trans_d" "trans_b" "trans_t" "dessert" "name" "Date" "Time" "Remote Name"
"2" "3" "5" "" "" "1" "4" "Mary Poe" "16 Nov 1999" "12:25:50" ""
"1" "3" "6" "1" "" "1" "2" "Karl Ho" "16 Nov 1999" "12:26:46" ""
"1" "5" "6" "1" "" "" "5" "Steven Hawkins" "16 Nov 1999" "12:27:31" ""
"2" "6" "6" "" "" "1" "1" "Marilyn Monroe" "16 Nov 1999" "12:27:56" ""
"1" "6" "6" "1" "1" "1" "2" "George Bush" "16 Nov 1999" "12:28:30" ""
"2" "3" "4" "" "1" "" "2" "Jane Doe" "16 Nov 1999" "12:29:06" ""
"1" "5" "1" "1" "1" "1" "4" "Henry Ford" "16 Nov 1999" "12:29:24" ""
"2" "4" "5" "1" "" "" "3" "Helen Hunt" "16 Nov 1999" "12:29:43" ""
"1" "2" "2" "1" "" "1" "5" "Peter Smith" "16 Nov 1999" "12:30:25" ""
"2" "6" "6" "" "" "1" "4" "" "16 Nov 1999" "12:31:07" ""

This data set is generated from a sample web survey asking the preference for dessert and transportation mode. I intentionally leave some entries to be blank and have text variables with embedded space.  

FrontPage automatically wraps around values with double quotes. If you read in the data using the Text Import Wizard in SAS, the program will strip the double-quotes but meanwhile treat all values as string variables. One get- around is to pre-process the data by using a text editor to remove all the double quotes. That will be a simple but quite an "ad hoc" process. Another problem is the embedded spaces and delimiters. Usually, text data from web survey are separated or delimited by a delimiter that serves as a mark for reading the next variable. Space and comma are common delimiters. However, space is a bad delimiter for survey data, especially if you have a text variable like name. Even if you separate last name and first name, you will have the redundant delimiter problem reading names like Van Exel or Sue Ellen that carry an embedded space. Comma is not better when you have a comment field. Users will add comma in their long comments, which will trick the program into believing that the comma in the comment is a delimiter. With these two problems in mind (double-quotes and redundant delimiters), I design a SAS program that provides a better way to read data from FrontPage-based web surveys.

1. The omnipresent quotation marks

Well, this is relatively easy. SAS Data step has an option under INFILE statement that takes care of that.  The DSD option will strips off the quotation marks that surround values in the data. An extra bonus is it will treat two consecutive quotation marks (i.e. '' or "") as missing value. Nevertheless, the DSD option assumes the delimiter to be a comma. As mentioned above, we don't really like comma as delimiters.  Well, at least it solves half of our problem. We will fix this using another option that follows right away. 

2. Delimiters, delimiters, delimiters.....

I select tab as the delimiter for survey data instead of space and comma. The tab character is an ASCII character even though we don't see it as a visible text on screen. Its beauty is no respondents can enter a tab as in neither the name field or the comment field. So, the data stream for each case will only contain the machine-generated tab characters and we won't have the redundant delimiter problem. To specify reading tab-delimited data in SAS, we use the DLM='09'x option in the INFILE statement. (**)

Okay, here is the program:

FILENAME MYFILE 'C:\temp\dessertdata.txt';
INPUT gender age class trans_d trans_b trans_t dessert name $
Date :
date11. Time :time10. Remote :$char24.;

This program, simple enough, is to read in a web survey text data file "dessertdata.txt" under the TEMP directory on C drive. The DLM and DSD options are used in the INFILE statement. Note that I start reading the data at the second row, because the first row will be the variable names.   

3. Informats

Hold up one second, we are not done yet. The program is using the list input method to read in the delimited file. For this mode of input method, however, informats could not be used. How do we specify width and formats for time, date and long-string comment variables like we usually do using formatted input? We can add this feature by using a colon (:) as the format modifier. The format modifier instructs the program to read up to the maximum number of bytes specified in the informat (24 columns for $char24), or when it reaches a delimiter, which is another tab. The following is what the data look like when I read them into SAS: 

VIEWTABLE: work._last_

The date and time variables are in SAS date and time values. We will change the variables using date7. and time8. formats and make them look like:

VIEWTABLE: work._last_


Check out the blanks and the missing values. They are all clean, which is what data scrubbers like me are happy to see.  


(*) Adapted from SAS Technical note TS-673 "Reading Delimited Text Files into SAS"
(**) For mainframe users using EBCDIC data, the representation is '05'x.