|
|
|
UNT Website Scheduled for Harvest*
By Cathy N.
Hartman, Assistant Dean, Digital and Information Technologies
University of North Texas Libraries and Fellow, Texas Center for Digital
Knowledge
T he Information Technology
Services (ITS) in the UNT Libraries will conduct a harvest of the www.unt.edu Web presence
next week. This snapshot of the UNT
Web presence will be carried out in each academic year during the fall
and the spring semesters. The captured files will become part of
the University Archives, recording the history of our Web publishing.
Who will conduct the harvest—the Digital Projects Unit, Willis
Library. The Digital Project Unit of the Libraries has
significant experience with Web harvesting. Working on a
research project, they harvested more than 40 million pages of web
content in the past few months
What will be harvested—the
www.unt.edu Web presence down
to the third domain name level.
Dates:
Monday, November 21 – Tuesday, November 22, 2005
About the Harvesters
We will be running two
different harvests of the web. One pass for each of the following
pieces of software
-
[Heritrix] - Heritrix Web Harvester created by the Internet
Archive
-
[HTTrack] - Harvester we used to successfully harvest 2.3 TB of .gov
content last December.
-
- IP for the harvester:
129.120.92.227
-
- User Agent for the
harvester:
-
- Mozilla/5.0
(compatible; unt_heritrix/1.4.0 +http://libharvest.library.unt.edu)
Mozilla/5.0
(compatible; unt_httrack/1.4.0 +http://libharvest.library.unt.edu)
Web Server Administrators
can help by:
- Reviewing
your robots.txt file. The file should exclude content that should not
be harvested in this project. Remember the idea is to capture as much
"good" content as we can so keep that in mind when you are editing
your robot.txt files.
- Being
patient. Web harvesting at this scale is still an in-exact process. If the harvester behaves strangely on your server, please contact
Mark Phillips
[mphillips@library.unt.edu].
*
DEFINITION: Web harvesting (also known as Web farming, Web mining
and Web scraping) is the process of gathering and organizing
unstructured information from pages and data on the World Wide Web.
Return to top |