Accessing and Using ICPSR Datasets

By James Yarbrough, ACS Statistical Consultant ( james@cc1.unt.edu)

The University of North Texas, along with other universities throughout the world, belongs to a consortium that offers an extensive archive of data sets to its members. The consortium is called the Inter-university Consortium for Political and Social Research (ICPSR). Typically referred to as "ICPSR datasets," the data files are often very large, usually of very high quality, are based on professional and sophisticated sampling methods, and cover a wide array of social, psychological, economic, political, business and market, and other areas of interest. Many are longitudinal files that lend themselves well to time series approaches. Various countries throughout the world are represented as study populations, often within the same study, allowing the possibility of foreign and comparative national studies. Still others represent historical data for the purposes of historical research. Census data for various regions is also available.

Membership has it's privileges

As a member of the Consortium, the University of North Texas esp., its faculty and students may use any of the datasets in the ICPSR archive at no additional cost beyond annual membership fees (i.e.., there is no cost for individual users). Over time, as faculty and students request and use ICPSR datasets, the UNT collection grows. Hence, there are many ICPSR datasets that UNT is already in possession of and that are available for immediate use by faculty and students who are aware of how to access them. Access to our collection of ICPSR datasets involves a minimal knowledge of Job Control Language (JCL), presented below. datasets that are not already in the UNT archive can be ordered (electronically) and are typically accessible within 2 to 4 weeks. Contact someone in the Statistical and Research Support group at Academic Computing Services, 565-2324, if you are interested in ordering a particular ICPSR study.

Making use of the datasets

ICPSR publishes a comprehensive catalog of its archival holdings annually. Typically, the first step in the process of making use of ICPSR datasets is simply to look through the ICPSR catalog and determine which dataset you are interested in using. This catalog can be found in various locations around the UNT campus: many individual faculty members and academic department offices have copies, a copy is available on reserve at Willis library, and the Statistical and Research Support group has a copy.

The second step is to find out whether or not UNT has the data file you select in its archival holdings. The easiest way to do this is to search an index file of current holdings this file is a CMS file called current icpsr d. The file is located on the public d disk of the Academic Mainframe. Use the following steps to search this file:

  1. Login to your CMS user account.
  2. Browse the file by issuing the command browse current icpsr d from the CMS Ready; prompt.
  3. From the command prompt, enter the command /#### where #### is the four-digit ICPSR study number of the study you are interested in.

Doing this will result in one of two outcomes. One possibility is that CMS will respond with the message DATA NOT FOUND. In this case, UNT does not have the particular study in its archive and you will need to contact Academic Computing to order the dataset if you are still interested in it. Again, it usually takes from 2 to 4 weeks to access a dataset that must be ordered from ICPSR. The second possible outcome from step 3 occurs if UNT does have the particular study in its archive. In this case, after entering /####, CMS will position the screen to the line where information concerning that study is found.

For example, if you had been interested in the AIDS Supplement to the 1987 National Health Interview Study, from the ICPSR catalog you will have determined that the ICPSR study number for that study is 9271. In this instance, in step 3 you would have entered /9271. CMS would respond by generating the screen of information shown below.


Sample ICPSR Output on CMS
  9271
  National Health Interview Survey, 1987: AIDS Supplement
  105595
  9273
  Annual Data on Nine Economic and Military Characteristics of 78 Nations,  48-83
  105160
  9275
  General Social survey Cumulative File, 1972-1989
  105645
  9286
  International Crisis Behavior Project, 1918-1988
  301929             300768    105160
  9287
  Offender Based Transaction Statistics (OBTS), 1987: Alaska, Ca., Del., Minn.,
  105160
  9300
  World Tables of Economic and Social Indicators, 1950-1988
  103652   2   3     105719    105246
  9303
  Detroit Area Study, 1981: A Study of the Family
  105612
  9304

Understanding the output

There are three lines of information for each ICPSR study. The first line contains only the ICPSR study number. Hence, you can see that the first three lines of the page of the above CMS file refer to ICPSR study number 9271. Focusing our attention now on just these three lines, we have:

  1. 9271
  2. National Health Interview Survey, 1987: AIDS Supplement
  3. 105595

Of course, the second line is the ICPSR study name. The third line of information is the Volume/Serial number of the UNT archive tape that contains the data (including codebook(s), dictionary, SPSS or SAS code for reading data, etc., depending on which supplementary files are available for the particular study requested). In this case, we can see that files for ICPSR study number 9271 can be found on the UNT tape with Volume/Serial number 105595.

Filenames, file numbers, and Job Control Language

The next step in the process of accessing this study data is to learn a little more about the filenames and file numbers of the files associated with the study. In particular, you will need to find the values of certain parameters for purposes of including the JCL lines shown in the table on page 12 in your SPSS program that reads the data.

The JCL statements in the table (followed by the beginning of an SPSS program) should be included at the beginning of the SPSS program (or SAS program for SAS, substitute EXEC SAS for EXEC SPSS) that you SUBMIT to MVS in order to read the data. In place of idnn you should type in your CMS User-ID; also, type in your name instead of Your Name on line one. In the place of mvspw, put your MVS password. Note that your MVS password was originally the same as your CMS logon password; however, if you changed your logon password, your MVS password remains the same as before be sure to use the correct password in this field!

The information that you must determine and supply in order to run this program is that which is specific to the particular data file you want to use. The information will be included in the Data Definition (DD) statement in the above program code, it is the line that begins with //DATAIN DD. The fields of information unique to this study are:

As mentioned previously, the Volume/Serial number is determined from a search of the CMS archive listing file, current icpsr d. You will use this Volume/Serial number in a separate MVS program to gather the remaining information (file name, file number, and tape format) needed to write the above Data Definition statement. This latter MVS program is commonly referred to as a "tape map" program. It describes the file names and file numbers for each file contained on a UNT tape. (Note: each Volume/Serial number identifies a unique UNT archive tape, and each tape contains from one to many files. Hence, the file(s) you are interested in may be just one file among many on a UNT archive tape.)


ICPSR Tapemap Program

//idnnMAP JOB (idnn,:30,1),your name,CLASS=B,PASSWORD=XXXXXX

/*ROUTE  PRINT UNTVM1.idnn

/*ROUTE PUNCH UNTVM1.idnn

//TAPEMAP  PROC VOL=IDUNNO

//MAPPIT EXEC PGM=TAPEMAPS

//SYSUT1 DD LABEL=(1,BLP,EXPDT=98000),

//  VOL=SER=&VOL;,DISP=SHR,UNIT=TAPE9

//STEPLIB DD DSN=SYS2.A000.MVS.UTILS.LOAD,DISP=SHR

//SYSPRINT DD SYSOUT=A

//SYSUDUMP DD SYSOUT=A

// PEND

//MAP EXEC TAPEMAP,VOL=105595

 


JCL for SPSS Job to Read ICPSR Data

//idnnSPSS JOB (idnn,:05,1),Your Name,CLASS=A,PASSWORD=mvspw

/*ROUTE PUNCH UNTVM1.idnn

/*ROUTE PRINT UNTVM1.idnn

// EXEC SPSSX

//DATAIN DD DSN=ICPSR.DA9271,UNIT=TAPE9,DISP=SHR,

//  VOL=SER=105595,LABEL=(20,SL)

data list file = DATAIN

 /v1 1-2 v2 3-5...

Mapping a Tape

An example of the tape map program that you will need to run in order to determine the specific information you need to access a dataset is shown below. A sample tape map program is also located on the CMS public d disk which you may copy to your own CMS account, modify for your particular use, and submit from your account. To copy this program, called icpsr tapmap d, to your CMS account, issue the following command from the CMS Ready; prompt:

copyfile icpsr tapmap d = = a

The icpsr tapemap program that you would SUBMIT for the current example (study number 9271 on Volume/Serial = 105595) would appear as shown in the table below.

In place of idnn, your name, and XXXXXX, put your CMS user id, actual name, and MVS password, respectively. The UNIT parameter refers to the kind of tape the data is stored on in the UNT archive. Possibilities include either reel (designated as TAPE9) or cartridge (designated as TAPECR). You can tell which of these to use by noting the value of the Volume/Serial number associated with the tape. Volume/Serial numbers beginning with a 1 (that is, from 100000 to 199999) are reel format; for these you would use UNIT=TAPE9 in the above tape map program. Volume/Serial numbers beginning with a 3 (that is, from 300000 to 399999) are cartridge format; for these you would use UNIT=TAPECR in the above tape map program. Since, in the present example, the Volume/Serial number begins with a 1 (105595), we would use UNIT=TAPE9. Finally, in the last line of the program, use the actual Volume/Serial number that you found in your search of the current icpsr d document on CMS.

After submitting the tape map program to MVS, the output will be returned to your reader list. All of the information relevant to your needs will be located near the end of this output. For the current example, the output of interest looks like that in the table on page 13.

Interpreting the output

It is useful if you are able to make heads or tails of this information. You will notice that on this particular UNT tape there are a total of 22 files. All of the files appear to be various kinds of ICPSR files that is, each is associated with a particular ICPSR study. Notice that only one file is associated with the study we are interested in it is file number 20, with the file name of ICPSR.DA9271. Information about the nature of each file is contained in the file name itself. The letters preceding the ICPSR study numbers in the file names essentially tell what kind of file each is. The following key may be useful in interpreting these letters:

From the tape map program, the following information is gathered:

This information is included in the data definition (DD) statement of the SPSS or SAS program that you submit to read the data. The table on page 13 is an example of the JCL required to access this data (filename, file number, type of tape, and Volume/Serial number are in bold).


Click here for a sample table


In order to write an SPSS data list or SAS input statement to read the data, you need to know information about which variables are located in which columns in the raw data. This information is usually found in a codebook associated with the particular study/dataset. For many ICPSR studies, code books are available in hard copy format and are available at Willis Library. Some studies, however, do not come with hard copy versions of the codebook; these usually have electronic versions that can be downloaded from tape, stored in your CMS account, and read within CMS (i.e.., using browse') or sent from CMS to the printer in ISB to generate a printout version. In order to download an electronic version of an ICPSR codebook from MVS to your CMS account, you need to know the same information about the codebook file as you did for the actual data file.That is, you need to know the Volume/Serial number, filename, file number, and the type of tape it is stored on in the UNT tape archive. Codebook files are located on the same tapes as the corresponding study data, hence the Volume/Serial number is the same as for the data file. The other information (filename, file number, type of tape) is found in the same way as for the data file it can be found in the same tape map output that was generated for purposes of finding information on the data file.

Another Example

In the above example, there was no codebook associated with the study. In cases such as this, check with the library to see if there is a hardcopy available. If, instead of the Health survey, we had been interested in ICPSR study number 8352, the German Election Study, 1983, we would have found an electronic version of the codebook. In the tapemap output, we can see that file number 12, with filename ICPSR. CB8352, is the codebook for the German Election data. Since it is located on the same Volume/Serial tape (105595), it is also on reel (hence, UNIT = TAPE9). With this information, it is possible to run an IEBGENER program that will read this file (codebook) of the MVS tape and write it to your CMS account. The program to do this is shown on the right.

Sample JCL to Access File ICPSR.DA9271

//idnnSPSS JOB (idnn,:05,1),Your Name,CLASS=A,PASSWORD=XXXXXX

/*ROUTE  PUNCH UNTVM1.idnn

/*ROUTE October 3, 1995PRINT UNTVM1.idnn

// EXEC SPSSX

//DATAIN DD DSN=ICPSR.DA9271,UNIT=TAPE9,DISP=SHR,

// VOL=SER=105595,LABEL=(20,SL)

data list file = DATAIN

 /v1 1-2 v2 3-5... 

Information derived from the tape map procedure is in bold; be sure to customize this file by supplying your own CMS used, your name, and your MVS password where appropriate.

After successfully executing this job, a file (i.e.., the codebook) will appear in your CMS reader which you may then reprint' to receive it in your CMS fillets. You can then read the codebook directly or make a printout of it (using the mainframe printer) in order to obtain the information necessary to write the data list (SPSS) or input (SAS) statement to read the data. From there on out, its your research... you can do any statistical analysis of the data that you wish.


Sample JCL to Read and Write a Codebook 
File

//idnnCDBK JOB (idnn,1,15),Your Name,CLASS=B,PASSWORD=XXXXXX

/*ROUTE PUNCH UNTVM1.idnn

/*ROUTE PRINT UNTVM1.idnn

// EXEC PGM=IEBGENER

//SYSPRINT DD SYSOUT=(A,,LP2X)

//SYSIN DD DUMMY

//SYSUT1 DD DSN=ICPSR.CB8352,UNIT=TAPE9,DISP=SHR,

// VOL=SER=105595,LABEL=(12,SL)

//SYSUT2 DD SYSOUT=A

/*

//

If you have problems or questions about this server, please contact us as soon as possible. You can send mail to the following address: www@unt.edu