|
1
|
- An Empirical Investigation of Metadata Utilization by Library Catalogers
- The MARC Content Designation Utilization Project
- William E. Moen
- <wemoen@unt.edu>
School of Library and Information Sciences
Texas Center for Digital Knowledge
University of North Texas
|
|
2
|
- Library catalogers create metadata – bibliographic records
- Follow cataloging rules and other standards to create the bibliographic
data
- Encode the bibliographic data into MARC records
- MARC – communications format and metadata scheme
- Approximately 2,000 structures for encoding data
|
|
3
|
|
|
4
|
- Given the cataloging rules…
- Given the detailed structuring of bibliographic data in MARC records…
- Given training of the catalogers…
- Given local policies and practices…
- What can we learn by examining a large set of MARC bibliographic
records?
|
|
5
|
- MARC Content Designation Utilization
- Provide empirical evidence of catalogers’ use of MARC content
designation
- Identify commonly used elements of bibliographic records
- Contribute to community discussion about core elements in MARC
bibliographic records
- Explore the evolution of MARC content designation
- Develop research approach to understand the factors influencing levels
of MARC content designation use
|
|
6
|
- Metadata creation as process
- Decisions by metadata creators
- Influenced by… ?
- Artifact: something created by humans for some practical or utilitarian
purpose
- Artifact reflects decisions, policies…
- Artifact reflects metadata utilization decisions
- Decisions to use or not use available metadata elements
|
|
7
|
- Developing a re-usable methodology
- Developing software tools and database design for storing large datasets
- Identifying questions to be answered
- Determining methods for analyzing data to address questions
- Compiling results
|
|
8
|
- 56,177,383 MARC 21 Bibliographic Records from OCLC WorldCat
- Decomposed the records to store in MySQL
- Parsing Tool
- 82 hours to process and load records
- 295 GB final database size (with indexing)
- Structuring of decomposed records align with analytical questions
|
|
9
|
- What is the average length of the records?
- What is the average length of the records?
- What is the frequency of types of records?
- What is the status of the record?
- What is the frequency of encoding levels?
- What is the frequency of the descriptive cataloging forms?
- What are the total occurrences of all control and data fields?
- What are the total occurrences of each control and data field?
- What are the total occurrences of all subfields?
- What are the total occurrences of each subfield?
- What percentage of records contains at least one occurrence of each
control and data field?
- What percentage of records contains at least one occurrence of each
subfield?
|
|
10
|
- Analysis required determining frequency counts by format of material
(ten)
- Concern about significant differences in patterns of utilization between
Library of Congress and OCLC member cataloging
- Partitioned decomposed data into 20 databases
- Based on source of cataloging
- Based on format of material
|
|
11
|
- Q08: What is the distribution of records by Type of Record?
- Natural Language query: Interrogate table LEADER, field 06_RecType.
Count occurrence of each possible value of field 06_RecType and group by
06_RecType.
- SQL query: select `06_RecType`, count(`06_RecType`),
count(`06_RecType`)/(select count(`06_RecType`) from `Leader`) from
`Leader` group by `06_RecType`
|
|
12
|
- Two categories of questions
- General profile of the dataset
- What is the distribution of records by Type of Record?
- What is the distribution of records by Length of Record?
- Occurrences of content designation structures,
- What is the number of total occurrences of all control and data fields
and how many unique field tags are used?
- In how many and in what percentage of records is each unique
field/subfield combination used at least once?
|
|
13
|
|
|
14
|
- 7,595,887 LC-created records in dataset
- Type of Record: Book, Pamphlets, and Printed Sheets
- Total number of unique fields: 167
- Number of fields accounting for 80% of occurrences: 14 fields (8.3%)
- Number of fields accounting for 90% of occurrences: 21 fields (12.6%)
- Approximately 110 fields occur in less than 1% of all records.
- [Note: Fields are cataloger-supplied, not system-supplied]
|
|
15
|
|
|
16
|
- Frequency counts provide raw but informative data
- Threshold – concept to delineate a change in trend in utilization
- Determining commonly occurring elements
- Comparing to recommended core records
- Comparing to recommendations for national level records
- Comparing the FRBR user tasks data
|
|
17
|
- Empirical basis for decisions about core elements in a metadata scheme
- Profiling repositories of metadata for aggregators
- Organizations can use methodologies and tools to analyze local
utilization levels
- Contributions to changes in cataloging rules, practices, policies, and
standards
|
|
18
|
- MARC Content Designation Utilization Project
- Assessing Metadata Utilization: An Analysis of MARC Content Designation
Use
- http://www.unt.edu/wmoen/publications/MARCPaper_Final2003pdf.pdf
- Format Content Designation Analysis: Set Profiling and Analysis Queries
- http://www.mcdu.unt.edu/wp-content/FANSetProfilingAnalysisQuerieswemFinal20Dec05.pdf
|