Notes
Slide Show
Outline
1
MARC Content Designation and Utilization

  • An Empirical Investigation of Metadata Utilization by Library Catalogers


  • The MARC Content Designation Utilization Project


  • William E. Moen
  • <wemoen@unt.edu>
    School of Library and Information Sciences
    Texas Center for Digital Knowledge
    University of North Texas
2
Metadata – Rules & Practice
  • Library catalogers create metadata – bibliographic records
  • Follow cataloging rules and other standards to create the bibliographic data
  • Encode the bibliographic data into MARC records
  • MARC – communications format and metadata scheme
  • Approximately 2,000 structures for encoding data
3
Richness of MARC
4
What do catalogers use?
  • Given the cataloging rules…
  • Given the detailed structuring of bibliographic data in MARC records…
  • Given training of the catalogers…
  • Given local policies and practices…
  • What can we learn by examining a large set of MARC bibliographic records?
5
The MCDU Project
  • MARC Content Designation Utilization
    • Provide empirical evidence of catalogers’ use of MARC content designation
    • Identify commonly used elements of bibliographic records
    • Contribute to community discussion about core elements in MARC bibliographic records
    • Explore the evolution of MARC content designation
    • Develop research approach to understand the factors influencing levels of MARC content designation use
6
Metadata Record as Artifact
  • Metadata creation as process
  • Decisions by metadata creators
  • Influenced by… ?
  • Artifact: something created by humans for some practical or utilitarian purpose
  • Artifact reflects decisions, policies…
  • Artifact reflects metadata utilization decisions
    • Decisions to use or not use available metadata elements
7
Metadata Utilization Analysis
  • Developing a re-usable methodology
  • Developing software tools and database design for storing large datasets
  • Identifying questions to be answered
  • Determining methods for analyzing data to address questions
  • Compiling results


8
Dataset and preparation
  • 56,177,383 MARC 21 Bibliographic Records from OCLC WorldCat
  • Decomposed the records to store in MySQL
    • Parsing Tool
    • 82 hours to process and load records
    • 295 GB final database size (with indexing)
  • Structuring of decomposed records align with analytical questions
9
Analytical questions
  • What is the average length of the records?
  • What is the average length of the records?
  • What is the frequency of types of records?
  • What is the status of the record?
  • What is the frequency of encoding levels?
  • What is the frequency of the descriptive cataloging forms?
  • What are the total occurrences of all control and data fields?
  • What are the total occurrences of each control and data field?
  • What are the total occurrences of all subfields?
  • What are the total occurrences of each subfield?
  • What percentage of records contains at least one occurrence of each control and data field?
  • What percentage of records contains at least one occurrence of each subfield?
10
Additional data preparation
  • Analysis required determining frequency counts by format of material (ten)
  • Concern about significant differences in patterns of utilization between Library of Congress and OCLC member cataloging
  • Partitioned decomposed data into 20 databases
    • Based on source of cataloging
    • Based on format of material
11
Querying database
  • Q08: What is the distribution of records by Type of Record?
  • Natural Language query: Interrogate table LEADER, field 06_RecType. Count occurrence of each possible value of field 06_RecType and group by 06_RecType.
  • SQL query: select `06_RecType`, count(`06_RecType`), count(`06_RecType`)/(select count(`06_RecType`) from `Leader`) from `Leader` group by `06_RecType`
12
Questions to Answer
  • Two categories of questions
    • General profile of the dataset
      • What is the distribution of records by Type of Record?
      • What is the distribution of records by Length of Record?
    • Occurrences of content designation structures,
      • What is the number of total occurrences of all control and data fields and how many unique field tags are used?
      • In how many and in what percentage of records is each unique field/subfield combination used at least once?
13
Occurrence Summary
14
Example Results
  • 7,595,887 LC-created records in dataset
  • Type of Record: Book, Pamphlets, and Printed Sheets
  • Total number of unique fields: 167
  • Number of fields accounting for 80% of occurrences: 14 fields (8.3%)
  • Number of fields accounting for 90% of occurrences: 21 fields (12.6%)
  • Approximately 110 fields occur in less than 1% of all records.
  • [Note: Fields are cataloger-supplied, not system-supplied]
15
 
16
Making sense of numbers
  • Frequency counts provide raw but informative data
  • Threshold – concept to delineate a change in trend in utilization
  • Determining commonly occurring elements
    • Comparing to recommended core records
    • Comparing to recommendations for national level records
    • Comparing the FRBR user tasks data
17
Implications
  • Empirical basis for decisions about core elements in a metadata scheme
  • Profiling repositories of metadata for aggregators
  • Organizations can use methodologies and tools to analyze local utilization levels
  • Contributions to changes in cataloging rules, practices, policies, and standards
18
References
  • MARC Content Designation Utilization Project
    • http://www.mcdu.unt.edu/
  • Assessing Metadata Utilization: An Analysis of MARC Content Designation Use
    • http://www.unt.edu/wmoen/publications/MARCPaper_Final2003pdf.pdf
  • Format Content Designation Analysis: Set Profiling and Analysis Queries
    • http://www.mcdu.unt.edu/wp-content/FANSetProfilingAnalysisQuerieswemFinal20Dec05.pdf