Archive for May, 2011

Server Load Balancing in the Enterprise

My company previously used a low-end F5 load balancer for many of our internal web applications.  It was hugely expensive, but once we got it in house it worked very well for many critical services.  After some 5-6 years of service, F5 was no longer offering support for it, so my company re-evaluated the market.  We wanted to go with F5 again, but in the end we settled on a ServerIron due to cost.  I can’t comment much on what happened after that as my group wasn’t in control of the box, but we are about 3 years later and we’ve had nothing but problems.   Those old F5 units?  They are still handling several of our most critical applications.

The responsibility for load balancing has now been passed to the department I work in, instead of the Server group. After seeing the size of our budget to replace the ServerIron, we investigated the possibility of using the options built into our Cisco 6509.  We found that it can handle probably 90% or more of the load balancing.  The server load balancing features built into the 6509 aren’t nearly as full featured as even our original F5 units, but it seems to work well in our testing.

Everything the ServerIron and F5 are doing involves NAT.  The 6509 does support NAT, but our tests show that this is somewhat of a resource hog with a significant processor hit when using NAT mode.  You don’t want a core network switch with high processor utilization.  Trust me.  It doesn’t end well.

We did find, however, a mode Cisco calls dispatch mode (called direct server return by others) works extremely well on the 6509.  The initial startup and teardown do get handled by the processor, but once that is done, it’s all hardware switched.  It does this by passing the packet unchanged to the MAC address of the server farm member it selects for a given connection.  That server replies directly back to the client, meaning the server has to have the Virtual IP configured on it as well.

That brings me to a little Pro/Con action:

Low processor utilization
Server sees the actual client IP (makes for easier troubleshooting)
Full backplane speed switching
Fewer hops involved
Passive Health Checks

Let’s run down the Pros…  We already mentioned processor utilization (on the 6509).  Less is better.  The server seeing the actual client IP (instead of the NATed IP presented by the F5/ServerIron) is better.  Aside from easier troubleshooting, the server can tell the IP of the client, which could be useful depending on the app.  Full backplane speed is REALLY fast, so you aren’t limited to the number of Gig links from the load balancer to the switch.  Fewer hops, since the request doesn’t have to go to the F5, then to the server, then back to the F5, then finally back to the client.  The 6509 supports passive health checks, in that it monitors connection attempts and if it sees a certain number of them fail, it will mark that server as down.

That’s a pretty nice list.  So, what are we giving up?

Additional configuration must be done on each server
Application must be configured to listen on the Virtual IP
Clients must be on a different VLAN from the Servers
No port mapping
Advanced Health Checks unavailable
URL Redirection unavailable
No remarks, server names, etc.

Ok – Let’s go over it in detail:  More config on the server?  Yes!  You must add a loopback adapter and configure the Virtual IP on each member of the serverfarm.  That means working more closely with your server team than you are probably used to, but this is probably a good thing.  The app has to know?  Yes, in many cases it does.  If the server is a single-purpose server, it’s probably listening on all IPs, but if it’s running multiple web sites in IIS, or if Apache is configured similarly, you have to set the web site to listen on the Virtual IP.  The client can’t be on the same VLAN?  This one has the potential to be big, depending on how your network is designed.  For most people this isn’t a major problem, but if you have some services (like LDAP, for example) that are load balanced and the primary consumer of these services is OTHER SERVERS, then this might be an issue.  Might consider NAT for these services.  No port mapping because we aren’t doing NAT.  While it does support some health checks, we chose not to use them because, again, you take a hit on the processor.  URL Redirection isn’t available because this is just something that the 6509 does not do. Finally, No remarks, server names, or anything else to help you organize what you are doing.  Wait, WHAT?

The last one stings the most in our environment.  If you are setting up a server farm or even three, it’s not a big deal.  But, when you have over 100 Virtual IPs, organization is most definitely a BIG deal.  Managing this many Virtual IPs and ServerFarms without something better would be a nightmare.  To make matters worse, this isn’t something management wants us to manage on a day-to-day basis.  They want the Network Operations team to do it.  Even if every member of that team is technically very capable, the simple fact of it being such a large team with such meager organization built into the 6509 means it would probably be poorly managed.

From a performance standpoint, load balancing on the 6509 should mean we never have a bottleneck at the load balancer.  It should always have more than enough bandwidth for hundreds of Virtual IPs.

Now, we are still asking management for a small F5 to handle specialized things, the vast majority of our load balancing will be handled right through our 6509. But, I did have to do some work to make things a bit better.

1. I wrote a Rails-based WebGUI to add organization and make it difficult for someone to seriously break the network to limit the number of level 15’s that were needed.  The WebGUI allows Virtual Servers, ServerFarms, and Reals to be defined in a web interface.  Reals are associated with Host names.  On the Hosts page, you can mark a single host down or up, no matter how many farms the host is a member of, even if they have multiple IPs associated with the host name.  This isn’t a simple database that we put information in when we configure new things, but I’ve written it to actually understand IOS configuration.  When you are ready to make a change, you simply perform an Audit of a switch.  This pulls the current switch config (show run partition common) and parses it.  It then reads the database and generates the commands that should be in the switch.  It then compares the pulled config to the generated config and determines which commands are needed to move from the current config to the new config.  It then displays the commands it thinks are needed and the user can look them over, then hit “Apply” when they are ready.

2. I added redirects for maintenance windows.  Several of our applications have maintenance windows each week or month and they would like a nice looking “down” page to show up, instead of everyone calling in to the support center when it doesn’t work.  I did this through the Rails interface I had already added, using the “backup serverfarm” feature.  Basically, when the job runs to take a serverfarm down, I read all the Reals out of the database and go into the 6509’s config to shut them down.  The backup serverfarm is pointing to a server that I control, which has a rewrite command applied to it, directing all incoming requests to a redirect.php file.  This file reads the Rail’s app database, looks at the DNS name that the user is trying to get to, then redirects them to the matching “down URL”, as defined in the database.  The tricky part here was adding the Virtual IPs to this server.  A simple script handles that, reading from the Rails database to determine which VIPs are needed on the redirect server.

3. Stats.  Currently, we can’t see much info on the ServerIron.  I know there are a lot of stats to be seen via SNMP on the 6509, so I may write a Stats page to allow easy access to this information.  I’m not sure if this will be super helpful or not, but I’d like to be able to have a page to pull real health data from the 6509 instead of needing to do it by CLI, especially if I can write a nice WebGUI to drill down into the various sections.

4. Advanced Health Checks.  I’m considering writing something (perhaps from scratch, or that integrates with another tool) to allow for more advanced health checks, going after the server’s real IPs.  This would only help for applications that are capable of listening on both the VIP and their main IP.  My plan would be to do a check against each server in the farm and if one failed, I’d know exactly which one and could mark it down in the core switch and email the appropriate application team, then continue to monitor it and mark it up when it recovers.  If a server is down, it would be marked down without waiting for several actual failed connection attempts, and the 6509 wouldn’t even try to send additional traffic to it until it actually recovered.  I don’t know if this is worth writing, but it sounds like a good challenge, so I might just do it for that.

With my programming and some creativity, I’ve actually taken care of the bottom two Cons on my list, and if I do #4 above, that will be the bottom three Cons taken out…  Perhaps I can come up with a way to take out some more…

May 12, 2011 at 10:03 pm Leave a comment


May 2011

Posts by Month

Posts by Category