Benchmarks Online

Skip Navigation Links


Page One

Campus Computing News

Holiday Hours

Computing Outage Notification Mailing List Now Available

Computer-Based Training at UNT: Aargh, I'm so confused!

Student Computing Services Survey to be Launched Online

EDUCAUSE Reloaded

Today's Cartoon

RSS Matters

The Network Connection

Link of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
    

WWW@UNT.EDU

Recommendation for Dynamic Web Content Hosting Environment at UNT

By Shannon Eric Peevey, UNT Central Web Support

As completely static Web content begins to slowly disappear from the World Wide Web, more and more Websites are relying on various dynamic content engines to generate the pages that we see.  (For example, the University of North Texas uses Macromedia ColdFusion, PHP, and Zope to generate many of its wonderful pages). This means that our Web hosting environment is no longer going to be made up of a simple Web server with a location for content manipulation, which in turn means, that our environments are becoming much more complex. We now have five or six points of failure, instead of one. Obviously, as a setup becomes more complex, we have all the more reasons for implementing some sort of mechanism for us to make sure that when one of the components of our Web hosting environment fails, our end-users experience is not interrupted. In an attempt to implement these mechanisms, it has been my privilege to explore many of the options that are available to us.  During the rest of this article, I will be discussing our current component technologies that are in production at UNT, the mechanisms that are available to help bring up the quality of service for end-users, and then evaluate and recommend these mechanisms, based upon testing and reviews of these mechanisms.

Background

It has been my pleasure as a system administrator to be able to work with many of our great Web developers during my time at UNT. Some of them full-time staff, like myself, but also a plethora of student workers, graduate assistants, and outside contractors. This variety has been a pleasure, as every one of these people brings something unique to the table, with regard to technology, technique, and design. Some are geared more towards design, some more towards technology, but all are excited about doing something unique to place their stamp on the universities Web presence.  In response to these diverse needs, it has been necessary to offer a variety of technologies to help fit a greater majority of the needs of our developers.  Therefore, we currently offer the following technologies:

1. Apache Web server

2. Macromedia ColdFusion MX

3. PHP

4. Zope server (languages: Python, DTML, TAL)

5. MySQL database server

6. PostgreSQL database server

7. Microsoft SQL Server (a derivation of the Sybase database server)

We have seen much user satisfaction with the technologies that we offer, and have had a growing developer base, as developers begin to integrate dynamic content into their static sites. This has been an exciting time for us, as the dynamic content server has grown from 100,000 hits a month when I started, to over 2,000,000 hits a month at the present time. But, this has also placed quite a bit of load on each of the component technologies, as we have moved from a dual 300 MHz machine with 256 MB of RAM, to a dual 1 GHz machine with 4 GB of RAM, to five machines running one or two component tools apiece.  (All in a little over two years!)  With this complexity, has come failure. Most of the time, these failures can be pointed to either the Macromedia ColdFusion Server, or the Microsoft SQL Servers, but nevertheless, the downtime can be catastrophic to our quality of service.  Therefore, it has been decided that it is time to take the next step and put into place some sort of mechanism.  It has been my groups job to explore these mechanisms.

High Availability

High Availability is the infrastructure design to minimize downtime for a system.  In other words, it is preventative design for an online system, in our case, that allows a component failure to automatically fall back on another machine which is running the same exact services.  In simple terms, it is just duplication. We have multiple machines running the same exact copies of a component technology, with the same exact configuration, so that it can be brought online in the event of a failure. Our team understood this.

What we didn't understand, was the correct technology for this type of high availability. We had heard of clustering, and load-balancing, and every other buzz word that has been spoken on this topic, but we didn't understand what they were. (In short, clustering is the sharing of processing power and/or load over multiple machines, which act as one. (This is usually done internally on the machines by software).  Load-balancing is the sharing of load over multiple machines, which are identical, by an outside tool, such as a load-balancing router, or proxy server.)

Doesn't seem like much difference, does there?  In the next section, we will begin to explore the complexity of our undertaking, and how these tools are going to be used in conjunction with our component tools to bring high availability to our dynamic hosting environment.

Complexity

Our findings led us to the conclusion that our environment was too complex for a single solution of only clustering, or only load-balancing, for high availability. 

First of all, we have to look at the technology of the component tools, and how they retrieve data, etc. from their source. A Web server is a daemon, or service, that sits and listens for incoming calls from clients on port 80 and 443. It then reacts to this call in a variety of ways, for our purposes here, we are only interested in requests for HTML pages. 

The server has been configured to know the location of these files, so it makes an input/output call for the file on the storage device, and then it returns those pages to the client. These calls are fairly simple, (and relatively slow), so it is possible to place all of these files on a central server, and mount the location on the Web server machine. (This makes the file location appear to be local to the Web server, though it is acquired through remote calls across a distributed file system.) 

Thousands, even tens of thousands, of Web servers can all mount this remote directory, and therefore, it is possible for them to access the contents of the file concurrently, (depending on how the distributed file system deals with file locking).  In this case, you would place a tool, often a router, in front of the machines, and the router would send requests to the servers, randomly, or through some sort of load-checking algorithm. 

The key factor here is that the distributed file systems are "slow", or slower than calls to the local file system. Therefore, it is ok to run many cheaper machines, attached to a central storage unit, with a load-balancer in-front of them taking care of the distribution of load.  But, in the case of a database server, the server needs to access and manipulate rapidly changing data, which needs to be correct all of the time.  (If one is familiar with relational, or even object, databases, then you know that any mistake in execution of a command, and your data and database will become corrupted, and basically unusable). 

Also, the ratio of the database transactions to Web page calls is weighted quite heavily towards the database server.  (For example, one call to a popular Web page on one of our machines makes over 40 calls to the database server.  This page is called about 200,000 times a month, for around 8,000,000 calls to the database per month, for one page). Therefore, this "slow" distributed file system is going to become a bottleneck for our hosting environment. There might also be problems with calls being out of synch with each other, because of the distributed file system, again creating a potential situation for data corruption. 

How do we create a fail-over environment for databases then?  In most cases, it seems that clustering is used most often in this situation. The clustering software is usually integrated into the database server itself, and is given the control of the transactions as they are sent to the data repository. It also controls the redundancy of the data, because most of the data is duplicated across multiple machines, and even controls how that data is updated on each machine, usually through a master/slave situation). 

High Availability implementation

Put simply, we need hardware and/or software that will automate fail-over and load-balancing for us in our dynamic Web hosting environment.  As mentioned in the last section, the complexity of our setup forces us to use a mixture of high availability technologies.  In this section, we are going to discuss this mixture in detail.

As you may have gathered from the technology list, in the background section, and the information that was delivered in the previous section, we will:

1. implement load-balancing on all Web servers.

2. implement clustering on all database servers, (where available).

Let's break it down in detail:

1. Apache - is a Web server, so will be load-balanced.  This means that we will be running multiple machines, three, with identical configurations, and a distributed file system, probably NFS, or Network File System.

2. Macromedia ColdFusion MX - this is an application server, which hasn't been discussed yet.  (It is really a java application server, so most of the rules that apply to other java engines, like IBM WebSphere, apply to Macromedia ColdFusion MX.)  This is a server, much like the Web server, in that it accepts calls from the Web server, and, in turn, calls other tools to perform content manipulation.  It is different, in the fact that the ColdFusion MX server needs to access rapidly changing data, in the form of session variables, scoped variables, etc., much like a database.  Therefore, this component will need to be clustered.

3. PHP - is, for our purposes, an Apache module that plugs into the Web server.  It acts like the Macromedia ColdFusion server, but is limited in the fact that it is part of Apache.  Therefore, it will be load-balanced, because the Apache Web server is load-balanced.

4. Zope - is an application server, so will be clustered.

5. MySQL database server - should be clustered, but as of yet, does not have true clustering support.  At this time, there is more of a replication service, where the slaves call to the master for updates. (MySQL Manual Section 4.10.3)

6. PostgreSQL database server - should be clustered, but does not have true clustering support.  As with MySQL, replication is the only option at this time.  (Macdonald)

7. Microsoft SQL Server - is a database server, so will be clustered. 

Recommendations

At this time, we would like to make the recommendations for creating a highly available Web hosting environment here at the University of North Texas. 

First, we will list the high availability options that are already made for us.  These have already been limited by the component tools themselves, so there is no choice to make.  They are:

  1. Macromedia ColdFusion MX – These must be clustered using the Macromedia JRun server.  Macromedia ColdFusion MX is basically a tag library for Java now, so we set the server up on top of the Macromedia JRun Java engine.  Macromedia JRun itself has clustering capabilities which allow for load-balancing and fail-over for each server. (Macromedia, Inc.)
     
  2. Zope – Zope is unique in that “all three pieces of a three-tier architecture (presentation, logic, and data) can be managed in one object-based facility.” (Zope Corporation ZEOFAQ)  Therefore, ZEO, or Zope Enterprise Objects, allow “all three tiers to be scaled and distributed in one facility.” (Zope Corporation ZEOFAQ)  By placing ZEO, which is basically a single object, into the flow of data, etc., we basically allow clustering from any instance of Zope in the world. 
     
  3. MySQL database server – MySQL “supports one-way replication internally. One server acts as the master, while one or more other servers act as slaves.”  (MySQL Manual 4.10.1)  Basically, the master keeps track of all transactions in a logfile, and the slave contacts the master and asks for any transactions that have taken place since their last conversation.  As mentioned before, this is not really clustering, but until true clustering is implemented, (they say sometime next year), this will have to work.
     
  4. PostgreSQL database server – eRServer is the replication tool that is available for PostgreSQL.  It was released under the BSD license in August 2003, and takes the place of the older rserv utilities.  “It is a trigger-based single-master/multi-slave asynchronous replication system.” (GBorg development team)  This works essentially like the MySQL database, but works off of triggers.  Triggers are actions in the database that are executed when a particular sequence of events takes place. 
     
  5. Microsoft SQL Server – This proprietary database is, like most of Microsoft’s products, married to the Microsoft Windows operating system.  Therefore, we are limited to using their clustering and database tools.  Our recommendation is to use Microsoft SQL Server 2000, (there isn’t a true version for 2003 yet), on a Microsoft Windows 2003 operating system.  The reason for MS Windows 2003, is that server clustering comes as part of the operating system, so you do not have to buy any additional products to cluster the OS.  Microsoft SQL Server has replication features that work very well with the operating system clustering, so there is really no choice here.

Finally, the issue of Web server high availability rises. We are basically locked into clustering or load-balancing solutions on every aspect except the Web servers. our group has found that there are essentially two options available for load-balancing our Web servers. We can use the load-balancing routers that are already available from the Unix Services team, or we can create our own highly-available Web infrastructure using Linux-HA. 

The Linux-HA option

First, let us discuss the Linux-HA option for a moment.  The goals of the High Availability Linux project are to “Provide a high-availability (clustering) solution for Linux which promotes reliability, availability, and serviceability (RAS) through a community development effort.” (High-Availability Linux Project)  This project uses a group of Unix tools to accomplish this, including: Code, which pulses a machine to see if it is live, and Fake, to take over an IP address of a failed machine.  These tools basically allow machines to monitor each other, and then take their place in the case of failure.  Now, on each machine, you will need to have duplicate component tools and also the Linux-HA tools configured to take over the place of the live server.  Oftentimes, there is a front-end machine that will act as a load balancer, and will direct traffic to the various machines, which uses various mechanisms to redirect http calls to the machines.  

Linux-HA is used in a large number of production sites, with The Weather Channel saying, “From my experience, I was extremely impressed with the ease of installation/compilation on linux, and the stability of the cluster. This cluster has been running for approximately eight months (with forced manual failovers for updates and maintenance), and heartbeat has been running solid with virtually no interruptions in service.” (Joe Henggeler), and Motorola says, “All machines are reclaimed - obsoleted from the desktop when we moved to Win2K.  We moved to Linux for this service because SMB on Solaris is flaky as hell (our experience) and on HP-UX is slow slow slow.” (Damian Ohara)

Load-balancing routers

Second, we can take advantage of the load-balancing routers that are currently being used by many of the other servers here at UNT.  These are Radware Application Switch II load-balancers that are configured for gigabit speed ip applications.  The Unix Services team is responsible for them, meaning that we wouldn’t have to take the time to configure and maintain the hardware, etc., and they are optimized to work efficiently with “asymmetric traffic characteristics… can obtain throughput of 15 to 20 Gbps”. (Radware)  They also have health monitoring and traffic redirection built into them, so that we do not need to program these capabilities, or install software that can monitor health of the machines. 

Basically, the Radware Application Switch II has everything that we need for load-balancing our Web servers.  Plus, we do not have to administer these machines, and the setup time is limited to the time it takes for the Unix Services team to get it setup for us, and, the university has been using them successfully for over two years. (Load-balancing our www farm, which gets over 36 million hits a month).  They have been very dependable, with only an occasional hiccup in service. (This is very rare, as they have a fail-over Application Switch II to take over, in case the of a failure at the switch.)  Therefore, my group recommends this solution for load-balancing our Web servers. 

Bibliography

MySQL AB. “4.10.3 Replication Implementation Details”. 2003. http://www.mysql.com/doc/en/Replication_Implementation_Details.html

Macdonald, Patrick.  "Re: FW: Clustering". June 4, 2003. http://sources.redhat.com/ml/rhdb/2003-q2/msg00047.html

Macromedia, Inc.  “Features – Performance and Reliability”.  1995 – 2003. http://www.macromedia.com/software/jrun/productinfo/features/4/04_performance_reliability/index.html#03

Zope Corporation.  “ZEOFAQ”. 2003. http://zope.org/Products/ZEO.bak/ZEOFAQ

MySQL AB. “4.10.1 Introduction”. 2003. http://www.mysql.com/doc/en/Replication_Intro.html

GBorg development team. “The erserver Project”. 2000-2003. http://gborg.postgresql.org/project/erserver/projdisplay.php

High-Availability Linux Project. http://www.linux-ha.org/

Henggeler, Joe. “Heartbeat Success Stories”.                                                http://www.linux-ha.org/heartbeat/users.html

Ohara, Damian. “Heartbeat Success Stories”.                                              http://www.linux-ha.org/heartbeat/users.html

Radware. “Application Switch II”. 2002. http://209.218.228.203/content/products/as2/default.asp