A cluster system controls a group of compute nodes which are assigned tasks by a batch scheduler. The scheduler selects work to be done by the compute nodes from a series of batch queues. These queues manage job requests (shell scripts generally referred to as jobs) submitted by users. In other words, to get your computations done by the cluster, you must submit a job request to a specific batch queue. The scheduler will assign your job to a compute node in the order determined by the policy on that queue and the availability of an idle compute node. Why batch queues? Batch processing from job queues may seem an unusual approach in an age of personal computers. The assumption is that the work to be done is computationally significant (ie will require hours, days or even several weeks of execution time). In that event, it may well be more efficient for you to be able to "farm out" the work instead of running it in the background on your personal computer. Almost inevitably, this implies sharing a resource with other users, and the immediate question is how. The idea behind batch queues is that each queue is available only to a subset of users. The group of users assigned to a specific queue is usually determined by department affiliation, or perhaps application type. This allows different groups to decide what constraints are reasonable for jobs submitted through the group's queue. Those constraints set the policy for each queue, and are arrived at through discussion with the various members of that group. Once policy is set, all jobs submitted to that queue are automatically subject to the agreed upon constraints. Job Queues: In order to submit a job to a queue for processing, you must first login to the batch processor system. The only form of remote access supported by cluster controllers is secure shell (ssh, slogin etc). Next you need to know which queues will accept work from you. This is determined by your group membership. If you type the "id" command, your identification on the system will be printed: duane@tbX [40] id uid=102(duane) gid=102(duane) groups=102(duane),11(acsop),1131(chem) In this example, user duane is a member of groups duane, acsop and chem. The queue names conform to the pattern - as in chem-small, phys-med or math-big. An attempt to submit work to a queue that does not service any of your groups will be rewarded with an "Unauthorized access" error message. In order to match your group memberships with work queues, you still need to know the names of all available queues. This can be found with the command "qstat -q": duane@tbX [86] qstat -q server: tbX.acs.unt.edu Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- chem-small -- -- -- 1 0 0 6 D S chem-med -- -- -- 1 0 0 6 D S chem-big -- -- -- 1 0 0 6 E R akw -- -- -- 1 0 0 8 D S phys-small -- -- -- 1 0 0 2 D S phys-med -- -- -- 1 0 0 2 D S phys-big -- -- -- 1 0 0 1 E R eesat-small -- -- -- 1 0 0 4 D S eesat-med -- -- -- 1 0 0 2 D S eesat-big -- -- -- 1 0 0 2 E R math-small -- -- -- 1 0 0 2 D S math-med -- -- -- 1 0 0 2 D S math-big -- -- -- 1 0 0 2 E R --- --- 0 0 The tabular information displayed in the qstat output is explained below: Queue - the queue name Memory - the maximum amount of memory a job in the queue may request. This value is generally unlimited since the memory is not shared between compute nodes, allowing one job to use all available memory. This does not imply the node is guaranteed to have enough memory for any job, but rather that a job will not be denied to a queue on the basis of a request for memory size. The job may fail when executed if the compute node does not have enough memory to run it. CPU Time - the maximum amount of cpu time a job in the queue may request. A job requesting more cpu time will not be accepted. A job using more cpu time than the queue limit will be terminated when that limit is reached. Terminated jobs may not be restarted. This value may be unlimited, but such a queue will be vulnerable to jobs that never finish execution. This value is generally unlimited on a controller. Walltime - the maximum amount of wall time a job in the queue may request. This is treated the same as CPU Time limits except using elpased time. Node - the maximum number of nodes a job in the queue may request. Currently each job may request no more than one cpu node because parallel execution is not supported, so this value is not useful. Run - the number of jobs in the queue in the running state. Que - the number of jobs in the queue in the queued state. Lm - the maximum number (limit) of jobs that may be run from the queue concurrently. This is the limit on the number of compute nodes any particular queue may employ. State - the state of the queue given by a pair of letters: - either the letter E if the queue is Enabled or D if Disabled - the letter R if the queue is Running or S if Stopped, or Q when a job is queued waiting to be assigned to a compute node. In the example output above, all the *-big queues are enabled and running while the rest of the queues are disabled and stopped. Submitting work: You should now have a pretty good idea which queues are intended for your work (if not, please ask!). In order to submit work to a queue, it must be encapsulated in a shell script. If you are familiar with UNIX, this should be pretty simple, but even if you are not, it is not a major obstacle. A shell script is a plain text file that contains commands you would normally enter from the keyboard in order to do your work. As a simple example, consider a file that contains the following commands: date hostname Each of these is a UNIX command that will output the time and the name of the host respectively. If these two lines are edited into a file named qtest, you can execute the commands by setting the permissions on the qtest file to indicate it is an executable file (chmod 755 qtest). Now you can test the script by entering its name from the keyboard (./qtest). The output will print a time stamp and the name of the host. The qtest script executed in this fashion did not run on the cluster (it ran on the local host). The following command, however, would submit the script as a job to the batch system for execution on the cluster using the math-small queue: duane@tbX [47] qsub -q math-small qtest 64.tbX.acs.unt.edu <-- this is the response to command above The immediate response will be the identification of the job to the batch system. The identifier is formed by a sequence number and hostname where the job was submitted. The batch system will probably queue, schedule and execute this job before you poke around enough to watch, leaving the output (what appeared on your display when you executed qtest directly) in two files named qtest.oXXX and qtest.eXXX where the X's are the numerals shown in the response to the qsub command (ie qtest.o64 and qtest.e64). The qtest.oXXX file contains standard output, and the qtest.eXXX contains the output to standard error (error messages generally appear here). If you examine the standard out file, you should find the same command output as above except you will see that the output of the hostname command is nodeXX, where XX can be between 0 and the highest compute node number. This is an indication that your job was scheduled and run on that compute node rather than on the controller. In practice, your job scripts will generally be a little more complex because there are batch directives that are useful for given applications. We are maintaining a collection of useful templates for job scripts in the same directory where you found this file (/usr/share/). These job templates will start with lines that have '#' as the first character. These lines are comments and PBS directives, and should not be modified until you understand what they are for. Following will be lines which do not begin with a '#' character, and these are the lines that actually execute your application. You will want to modify these lines to suite your naming conventions and application. As an example, this is a script for running Gaussian 98 job on the cluster: #!/bin/bash #PBS -q chem-med #PBS -mabe #PBS -W group_list=chem g98 $HOME/g98/f1.1input $HOME/g98/f1.1output This template script executes the command g98 followed by a path to the input script required by the application and a path to the file that should contain the ouput from the g98 run. The batch directives (lines that start with #PBS) identify the queue for submission (chem-med in this example), that mail should be sent on events abort, begin, end and that you are a member of the group chem, so you will be allowed to submit the work to the chem-med queue. To use this script template, you would make a personal copy, edit the command line to make it pertain to your work, and submit the job with the command "qsub ". This example assumes, of course, that you are a member of group chem, etc. Template job scripts will be accumulated in the /usr/share/ directory with .job extensions, so you may search there for ideas on where to start with job scripts. Most of the scripts will be a little more complex than this example to improve performance; feel free to ask (email) about features in the job scripts you do not understand. Once you have a working job script, it is generally trivial to modify it for different runs. Performance: The execution environment for a batch job will have your login directory available for reading input data and writing output data. This is made possible by allowing the compute node to mount your controller login directory over the network. You should be aware that network writes are much slower than local writes to disk. If your job builds temporary files or writes often to checkpoint files, you should arrange for those files to be written under the /export directory (indeed it is often most efficient to copy all your work files to /export before the run and then back to $HOME after execution is complete). Each compute node has local disk space available at /export (the amount of scratch space varies between nodes, but you should be concerned if your run requires over 15G space in /export) which you can use for the duration of your job. Writing temp files there will often decrease the execution time required for your job. Please be aware, however, that these temporary files will be removed from the node when the job terminates, and are not directly accessible to you while the program executes (in other words, this is not the same as writing files to :/export which you cannot do). Clusters and Controllers: This facility is intended to make available generic compute nodes for your use. The design is scalable (meaning it is relatively simple to add more compute nodes to the system) and inexpensive (meaning the per-node cost is fairly low). The batch server is intended to serve as a staging area and job scheduler. Please do not run compute intensive jobs on the controller, rather submit them to a batch queue. Please do not use the controller as your personal workstation or development platform. Any jobs run directly on controller instead of through the batch system are subject to immediate termination without prior notification. Constraints on services: The disk space on a cluster controller is intended as a staging area for work to be done, and a collection area for results. It is not intended to provide archive services. In other words, you will need to provide your own long-term storage facilities and move results off of the controller to that storage on a regular basis. Please read the following carefully: There is absolutely NO backup of your disk space on the controller or of the cluster nodes. In case of a system error or, heaven forbid, a user error, there is no possibility of recovering lost data. Please do not leave your only copy of valuable data on the controller! The reason for this constraint is two-fold. First, this is a compute facility and would be organized differently for archive purposes. Second, it will be very expensive to backup the amount of data expected to pass through a controller. The topic of a cluster file system designed for parallel I/O with terabyte scale storage is under investigation for a future project. Your input is solicited, but please understand that the controller as currently configured should not be expected to perform this function. You must develop your own plans for storage of the data generated on the cluster or accept the low but nonetheless real risk of loosing data. You have been forewarned. We would prefer to operate the controller without using disk quotas. The most efficient usage of available space can be had if users on a particular disk (and therefore in the same group) negotiate usage with each other. This might, for instance, allow a large amount of storage to be available to one user for a limited period when others don't need the space in exchange for the same privilege for other users at a later time. If the members of a particular group cannot or will not negotiate in good faith with others in the group, then a quota will be set for the entire group. Using quotas, there is no provision for extenuating circumstances, so it is to everyone's advantage to have user-managed disk allocation. Please negotiate with your colleagues. Questions and comments: It is our goal to run a useful computing facility for researchers at UNT. Contemporary research activity spans a spectrum of computer applications far in excess of the services supplied by this cluster however. If you have a compute intensive job (as opposed to say I/O intensive like a database or network server), we would be happy to discuss with you the potential of running it on this cluster. If you're not sure whether some particular application would be a good fit for the cluster, perhaps the following guidelines will help. The classic application for this cluster will be jobs that do a lot of computation on a little data with typical runtimes in the hours to days range. If you have source code that will compile on Linux, that is best. If you have binaries built for Linux/Intel, that may well work. While we are happy to provide information to help you port an application to this environment, we do not have staff to provide you with much programming assistance. Tasks which require user interaction at runtime will not work on the cluster. Tasks which provide services to other computers (such as web servers or file system sharing) will not be supported on this cluster. Activities which require creation/deletion of temporary accounts (ie class accounts) cannot be managed on this cluster. Tasks which can easily be performed on individual workstations (such as email, web surfing or reading net news) should not require a cluster. The use of commercially licensed software on the cluster (the batch system and operating system software are free) may be viable in some cases, not really in others. If the software has a site license and does not have constraints on the number of simultaneous users, then it might be possible to run on the cluster. If the software has to be "metered" by license managers, it is not a candidate for the cluster. In any case, the purchase of the software and management of the licenses is not a service provided by ACS. Please discuss your project with us before purchasing software intended for use on this cluster to avoid misunderstandings, wasted time and money. Please send questions or comments to duane@unt.edu. I will be happy to meet with you to discuss your application, and how it might be run effectively on this facility. Even if your application is not, for some reason, a good fit for this particular cluster implementation, it will be useful to know your requirements when planning for future projects. Duane Gustavus ACS UNIX Consultant duane@unt.edu