Benchmarks Online

Skip Navigation Links


Page One

Campus Computing News

Thanksgiving Break Hours

UNT General Access Labs: What we did this summer ...
er, this fall ... uh, this winter?

Using the Adobe Education Website - Revised November 2005

SkillPort and Thomson NETg Offer Easy-to-use Browser Compatibility Testing for Online Learning

Today's Cartoon

RSS Matters

The Network Connection

Link of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
    

Campus Computing News

UNT Website Scheduled for Harvest*

By Cathy N. Hartman, Assistant Dean, Digital and Information Technologies
University of North Texas Libraries and Fellow, Texas Center for Digital Knowledge

The Information Technology Services (ITS) in the UNT Libraries will conduct a harvest of the www.unt.edu Web presence next week. This snapshot of the UNT Web presence will be carried out in each academic year during the fall and the spring semesters. The captured files will become part of the University Archives, recording the history of our Web publishing.

Who will conduct the harvest—the Digital Projects Unit, Willis Library. The Digital Project Unit of the Libraries has significant experience with Web harvesting.  Working on a research project, they harvested more than 40 million pages of web content in the past few months

What will be harvested—the www.unt.edu Web presence down to the third domain name level.

Dates: Monday, November 21 – Tuesday, November 22, 2005

About the Harvesters

We will be running two different harvests of the web. One pass for each of the following pieces of software

[Heritrix] - Heritrix Web Harvester created by the Internet Archive

[HTTrack] - Harvester we used to successfully harvest 2.3 TB of .gov content last December.
 
IP for the harvester: 129.120.92.227
 
User Agent for the harvester:
 
Mozilla/5.0 (compatible; unt_heritrix/1.4.0 +http://libharvest.library.unt.edu)

Mozilla/5.0 (compatible; unt_httrack/1.4.0 +http://libharvest.library.unt.edu)

Web Server Administrators can help by:

  • Reviewing your robots.txt file. The file should exclude content that should not be harvested in this project. Remember the idea is to capture as much "good" content as we can so keep that in mind when you are editing your robot.txt files.
  • Being patient. Web harvesting at this scale is still an in-exact process. If the harvester behaves strangely on your server, please contact Mark Phillips [mphillips@library.unt.edu].

* DEFINITION: Web harvesting (also known as Web farming, Web mining and Web scraping) is the process of gathering and organizing unstructured information from pages and data on the World Wide Web.


 

Please note that information published in Benchmarks Online is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - http://www.unt.edu . You can also search Benchmarks Online - http://www.unt.edu/benchmarks/archives/back.htm as well as consult the UNT Helpdesk - http://www.unt.edu/helpdesk/ Questions and comments should be directed to
benchmarks@unt.edu

 

Return to top