Page One

Campus Computing News

Y2K -- What if?

Did you get yours?

Time to Renew PRAS Accounts

Internet Software CD-ROM Discontinued

Y2K Viruses

Personal Information Online

Think Before You Click

Cluster Computing

RSS Matters

The Network Connection

List of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

    

Cluster Computing

By Duane Gustavus, UNIX Research Analyst

Cluster computing has become a popular topic in computing environment discussions. Increases in the volume of PCs manufactured have brought about price changes which make clusters of low-cost systems more attractive than single more costly units which have previously been the focus of high performance computing. The classic battle between the ``army of ants'' and the ``bull elephant'' has shown that at least some types of processing can be more inexpensively implemented using parallel processing with large arrays of PC-class hardware1 . Where maximizing performance is not the sine qua non, however, another interesting feature of compute clusters comes to the fore.

As PC hardware progresses rapidly to ever more powerful processors, it leaves behind a trail of ``mid-life cycle'' hardware in the form of last year's models. Often the monitor and special-purpose hardware will be salvaged, but that still leaves a significant chunk of processing equipment looking for a new application. A cluster of these older systems can be configured relatively inexpensively to give adequate performance on certain classes of computing jobs which might otherwise have to be done on the desktop.

Desktop computing systems are designed to place a premium on responsiveness because that measure is most important to the interactive user. In this environment, work is considered compute intensive if it degrades interactive performance, or even if it merely requires a long period to complete (where ``long period'' is defined by each user on an individual basis). The prospect of moving these non-interactive tasks to another computing platform, either to speed up processing or to minimize the impact on local computing performance, is behind this investigation. The goal of this paper is to document the implementation of a ``render-farm'' which uses recycled desktop systems to increase the speed of producing an image with a raytracer. This application was chosen because raytracing is a simple problem to adapt to parallel processing. It should be noted that the software problem (programming a parallel solution) is much more demanding in most cases than the implementation of cluster hardware. Nonetheless, it is hoped this discussion will prove interesting and assist in increasing the availability of compute clusters for research purposes.

Requirements

A cluster obviously requires a group of computer systems connected via a network. In order to make this group of networked PCs look like a single "virtual" machine, an application program interface (api) needs to provide the ability to start processes on other nodes in the cluster and exchange information with them. One such api which is in the public domain is PVM or Parallel Virtual Machine. PVM assumes a single master and multiple slave nodes, where the master distributes work to the slaves. In order to make this cluster accessible from a public network, the master will require two Network Interface Cards (NICs), one configured for the public network and one for the cluster network. It is possible to use the public network for cluster access as well, however network congestion will degrade the performance of the cluster and the cluster will add to network congestion. The cost of an extra NIC for the master and a hub to connect cluster nodes should be considerably under $100.00. To maximize performance, high-speed switched network connections can be used, but as cluster price is a prime consideration in this discussion as well as the recycling of older technology, Tbase10 network speeds and a hub will be assumed.

The cluster nodes in this example are running FreeBSD release 3.32 and PVM release 3.4.23. PVM has been developed on UNIX platforms but there is a Windows port of PVM 3.4.2. One of the public domain versions of UNIX (Linux or a BSD variant) would seem to meet the cost consideration best. While a 16meg ram node is possible, 32meg is a more useful memory configuration. Disk storage space for the slave nodes is reasonably met by a single 1G drive. It is possible, of course, to get by with about half that space, but the current low cost of disk drives argues for the 1G drives for ease of installation and maintenance. The master node should have adequate disk space to provide users with home directory storage. The master node will also need a monitor, and all the nodes require a keyboard of some sort.

The POV raytracer is a popular public domain application used to generate images from description files. It has been modified to utilize PVM in order to take advantage of the parallelization available on a cluster. This "unofficial" modification is available on the network4 in source code form. The pvmpov program is the application example used for this cluster, but any PVM-capable program could be used.

Implementation

As an example, these notes describe an installation with four slave nodes that have 200mhz Pentium MMX processors, and one master with a 166mhz Pentium cpu. All nodes have 32meg of RAM and a Tbase10 network interface, with a second NIC added to the master node. The slave nodes each have a 1G IDE hard disk, while the master node has a 6G IDE hard disk. PVM is installed on all the nodes under /usr/local/pvm3. For PVM applications to work, the source directory and application executables must be available on all nodes using the same path. For this reason, home directories are placed in a separate file system on the master node (/home in this case) which is exported to all other nodes. The amd program (BSD's automounter) is used to mount the home directory on a slave node as required. The master node also provides NIS service supplying the password and group files to the cluster as well as the automounter map (/etc/amd.home in this case). Each slave node is an NIS slave which can only bind to the master node. This allows a user to be added to the cluster by making an account on the master, and adding that user to the NIS map files. There is no requirement that the slave nodes have any user-specific information on them, so changing the membership of slave nodes in the cluster is less complicated.

Installation

The PVM 3.4.2 release compiles without modification on FreeBSD 3.3. The single modification made in this case was to change the remote command used to start pvmd on slave nodes from rsh to ssh. The FREEBSD.def architecture configuration file can be found in the conf subdirectory if you wish to make this modification. The instructions for building PVM are accurate; it is important to set the environment variable PVM_ROOT to /usr/local/pvm3 before you build the software. Each node should have the system-wide shell configuration files modified so that users have appropriate values for environment variables PVM_ROOT, PVM_ARCH and RCMD_CMD. In addition, the normal user search path should be extended to add the /usr/local/pvm3/bin/FREEBSD and /usr/local/pvm3/lib to the path. The PVM source contains some test programs to verify functionality. You should ensure that PVM is working correctly before proceeding to install pvmpov.

The source code for the pvmpov application is available for Linux5 and requires a couple of simple modifications to compile on FreeBSD:

1.
The Makefile provided needs to be modified to reflect the system-wide installation of software rather than a personal installation. The libz and libpng libraries should be installed as packages or from the ports tree, and their locations corrected in the Makefile. Also, the TCL/TK paths will need to reflect reality on your system. I used TCL/TK 8.0 with no problems.
2.
At least one source code module has an "#include features.h" which needs to be changed to "stddef.h".

The instructions for installing this code on a Linux system6 are a little confusing but essentially correct. On this cluster it was installed under /usr/local/pvmpov. As per the instructions , you will also need to grab the source code for povray 3.17. To propagate pvmpov to the slave nodes after the software is built, tar the entire pvm3 and pvmpov directories and extract them on each slave system.

The process of adding a new user to such a cluster involves these steps:

  • add the user to /etc/master.passwd and run pwd_mkdb -p /etc/passwd
  • add the new group to /etc/group
  • mkdir /home/<new user> and chown so user owns directory
  • cp /usr/local/pvmpov/povrayrc ~<new user>/.povrayrc
  • add user to /var/yp/master.passwd
  • add user to /etc/amd.home
  • cd /var/yp; make to build and push NIS maps
  • run amq -f on each node to flush automounter maps

The first three steps are the standard process of adding a new user to the FreeBSD system that is running on the master node. The .povrayrc file contains some reasonable default settings for the raytracer. The final four steps add the user to the NIS database and reload the automounter maps on the slave nodes so they can find the new user's login directory. The process could be "automated" with a perl script to make cluster user management as easy as adding a new user to a UNIX workstation.

Results

The following table summarizes execution times for a "standard" test file named skyvase.pov. This image contains reflections, texture mapping and anti-aliasing to provide a useful benchmark for comparing systems. The tests are run first on the master node alone, then adding the slave nodes one at a time. As a reminder, the master node is a slower cpu than the slave nodes.

number of nodes time in secs speed-up factor
master alone 576.8 1.0
master + 1 slave 200.8 2.9
master + 2 slaves 124.2 4.6
master + 3 slaves 93.8 6.1
master + 4 slaves 73.6 7.8

This informal test shows the performance gains that can be obtained with minimal optimizations for this type of application. The "speed-up factor" is the master alone time divided by the cluster time for each configuration. A database for different types of hardware running this same test is maintained on the net8. The image size for the POV benchmark is 640x480 by convention. A more useful size for many applications would be 1024x768. The next table shows the effects of increasing the image size.

number of nodes time in secs speed-up factor
master alone 1217.4 1.0
master + 1 slave 435.4 2.8
master + 2 slaves 265.6 4.6
master + 3 slaves 193.6 6.3
master + 4 slaves 152.2 8.0

The 1024x768 version of the skyvase.pov image has a factor of 2.56 times as many pixels as the previous image. The relation between the compute time for any version is slightly more than a factor of 2, which matches well the increase in work load. In addition, there is a slight advantage with the full cluster on the larger job.

Enhancements

The most obvious enhancement to such a cluster would be to add more or faster nodes. As can be seen from the benchmark figures above, each node added decreases the time to completion, but not by the same factor. This means that at some point adding another node will not decrease the time to completion by a noticeable amount. This point of limited returns is related to the amount of work each node is required to do as compared to the amount of communication with the master involved in assigning the work. The most efficient size for the cluster will depend on the task, the speed of the cluster network, and the speed of the individual systems. The cluster nodes do not have to be the same speed or architecture, but there is increased installation and maintenance overhead if the architecture varies.

Future investigations

The goal of documenting this implementation is to encourage investigation of cluster computing on campus. The use of older hardware and public domain software can lead to a relatively inexpensive computation engine capable of significant work in some application areas. Two additional cluster configurations will be investigated as part of this project. The first involves experimenting with process migration from one node to the next using MOSIX9 . This has possible implications for scalable configurations with large numbers of interactive users. The second is Parallel Virtual File System10 which uses cluster node disk space to "stripe" the file system across nodes. This has possible implications for applications requiring large amounts of disk storage. Experiences with both of these configurations will be described in papers made available to the community11.

About this document ...

Cluster Computing

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html pvmpov.tex.

The translation was initiated by Duane Gustavus on 1999-12-07

Endnotes

... hardware1
http://beowulf.gsfc.nasa.gov/
 
... 3.32
http://www.freebsd.org/
 
... 3.4.23
http://www.epm.ornl.gov/pvm/pvm_home.html
 
... network4
http://www.luga.de/~flierl/pvmpov/
 
... Linux5
http://www.luga.de/~flierl/pvmpov/
 
... system6
http://www.cris.com/~rjbono/html/pondermatic.html
 
... 3.17
make extract in the ports area for povray31 will fetch the files and place them in /usr/ports/distfiles/povray31.
 
... net8
http://www.haveland.com/povbench/
 
... MOSIX9
http://www.mosix.org/
 
... System10
http://www.parl.clemson.edu/pvfs/
 
... community11
This document was prepared using the LYX graphical front-end to the TEX text processing system.