Managing thousands of network devices

September 9, 2013 at 11:54 pm Leave a comment

I’m a network designer and the tools I’ve written manage 700+ networks for one company.  Each of these networks contain a Cisco router (various models), between two and six Cisco switches (mostly various 2960 models), and a CheckPoint UTM.  Altogether, it’s over 3000 switches.  We have heard hints from management that we are growing to around 1000-1100 sites in the near future.  With our network design, expanding to that many sites will be almost effortless.

My day to day job involves a lot of configurations and a lot of data.  We don’t use any software from Cisco or any 3rd party to manage our configurations.  The tools that generate and manage the configurations were all written by me.  I have not yet seen a piece of COTS software that can manage router & switch configurations in a manner suitable for our business.

Generating Configurations

Each device config is generated from a template.  Most hardware models have a unique template.  These templates contain placeholders for items that are unique for each location.  Various database tables track these unique values, and my tools drop the right values in the right spot.

In the case of the routers, these templates are almost 1000 lines of commands.  Routers have very complex configurations and the widest range of variables.  Over 125 variable substitutions occur for every router configuration.

The switch templates are a little simpler, but are still over 400 lines of commands each.  One database table tracks the admin status of each switch port, along with the speed and duplex setting.  Those settings are tracked so that if that switch config gets regenerated, the admin status, speed and duplex settings are retained in the resultant config.

Oh, the switch configurations can be a bit tricky, partially because we have different switch designs at different sites.  Tracking that is no problem, though, thanks again to the database.  We also name our configurations, and each location has a slot in the database to track that as well.

The UTMs are somewhat unique, in that they house a 3G connection.  This is backed by a Ruby on Rails web app that let’s the Network Operations team pair the UTM with a 3G modem from one of two vendors, and assign a unique static IP.  This database also tracks serial numbers and phone numbers for modems and SIM cards.  Once a device is assigned to a location, a configuration is generated within a couple of minutes.  This database is versioned, and I’ve provided a web interface so the NOC can “go back in time”, so to speak, and see exactly when and what changes have been made to this database.  This was very important as this helps track actual hardware, as people make mistakes when faced with an easy-to-alter database.  Other tools coordinate the configuration of routers with each UTM, so that a backup WAN link can run over this 3G connection.

Managing Configurations

Notice that the title of this post contains the word Managing.  My job isn’t done at just generating complex config files that work together.  We have a team of network operations folks that handle the day to day care of these devices.  They need to have a high level of access to do their job, but they occasionally change things, in the course of troubleshooting, and don’t always put everything back.

I hate it when that happens.

So, I audit.  I have a series of tools that works hand-in-hand with the generation tools that I mentioned above.  Every day, these audit tools read the configuration of every router, switch, and UTM.  Configurations are generated and diffs are performed.  When differences are found, they are pinpointed (down to the interface, or sub-section of the config) and emailed out to the team, highlighting the lines that are either missing (part of the template, but not in the actual config of the device) or present but not expected (extra lines in the config that don’t exist in the template).  This allows us to quickly find and clean up the human error that slips in whenever humans are involved.

This is vitally important.

About a year and a half ago, I was tasked with incorporating our network with that of another company.  While the network team of that company had kept things running, their configurations were far from standard.  If an issue arose at one site, a band-aid was applied to work around the problem.  In many cases, problems were forgotten and a proper solution was never implemented.  Rinse.  Repeat.  This appeared to have happened for years, with various technicians implementing their “fixes” on their own.  There was a “port standard” for VLANs, but that was at least partially abandoned in most locations.  The result was that practically every site was “a one-off”, a unique non-standard configuration.  This makes standardization a nightmare.

By performing daily audits, we can catch these sorts of problems.  Network techs that might come from an environment where they can change whatever they feel are more conscientious, knowing that their actions are being monitored to ensure that the configurations stay standardized.  While only a few issues are caught each week by these audits, it’s easy to see how it keeps our network constantly snapping back to a desired state.

Remember, above, when I mentioned the switch configurations having names?  That goes hand-in-hand with the auditing tools.  A switch is audited against the switch configuration style that shows up in the database for that location.  So, if you implement a new config style in 10 locations, the audit tools will be auditing the switches at those locations against the new templates, not against the templates driving the configurations in the other locations.

Audits also serve another important purpose.  Once every couple of months or so (sometimes much more frequently), configuration templates change.  The audit tools are written such that these differences can be programmed for.  If a particular set of differences are found, the audit tool itself will actually perform the commands to get the device configured properly.  In the event that hardware is being upgraded, routers may be configured months ahead of deployment.  If configuration changes happen in the meantime (like they often do), the next audit after the device installation will bring the config up to the standard.

In addition to auditing each of the remote site devices, another tool audits the various central routers that the remote sites connect into.  These routers literally contain many thousands of lines of configuration, all of which must be exactly correct in order to properly work.

Making Big Changes

I’m currently involved in a project to consolidate two networks together.  Essentially, we have an pair of central routers with high-speed links connecting to them.  These routers connect to an old network that is slowly going away.  While technicians are on-site at the location, we will be implementing changes to swing these locations to the portion of the network that will remain.  We’ve attached these central routers to the new network via a Gig interface, which is part of a new VRF.  Once techs are on-site to strip the old gear out, leaving only our new gear, we can make the required configuration changes to swing them from the old network to the new network, just by running a script.  It’s actually much more complex than that, but that’s all the Operations team will have to know, as the intricacies of the changes are mostly hidden from them.  These hidden changes include not just the central routers that the circuits terminate on, but database changes, and another pair of core switches and firewalls that require route changes at cut-over time.  By using the template approach, all of the network side changes are possible without significant programming, planning, scheduling, or implementation effort.

Data, Data, and Even More Data

Managing these networks doesn’t just mean managing their configurations.  In addition, there’s lots of data collection that goes on.

Every router, switch, and UTM has various information polled each day.  The model of device and level of firmware is pulled from each, along with other hardware specific data.  This ensures that, for example, if a device experiences a hardware failure and gets replaced with a piece of hardware running another version of IOS, it gets noticed reasonably soon so that it can be corrected.  In the case of the UTMs, this daily data pull includes the firewall policy that is active on the UTM.  The CheckPoint management server occasionally has issues where not all devices get updated, and a simple sortable html table showing this data lets us easily see which devices haven’t been updated to the latest policy yet.  A simple html table for the routers and switches gives totals of what model of hardware is running which version of IOS, as well as how many of each model are in the field.

The above paragraph just hits the easy stuff.  In addition, each day the entire MAC address table is pulled from every switch, along with the ARP table from the corresponding routers.  By cross referencing the MAC addresses (associated with the ports) with the routers ARP table, we can tell which IP device is attached to which physical switch port.  CDP info is also pulled.  The result is a web page where you can enter a site ID and get a chart of all switches at that site, the state of each switch port (including how many days since the last status change), the cross connections between switches (and other CDP capable devices), and every IP device attached to the switch, right down to the port.  (Though devices that don’t communicate across the router much may not be caught in the ARP table, so a couple devices might be missing, but this gets about 95+% of devices.)   Various people across different areas of I.T. use this data daily to help them quickly locate and troubleshoot equipment.  This info is also very valuable if you are trying to move toward a standard layout of equipment to specific ports on specific switches.

Another relatively recent addition to our data tracking is WAN interface errors.  We poll for interface errors on the WAN link throughout the day.  Sites with any errors get polled more frequently.  If these sites with previous errors continue to rack up more on subsequent polling, emails are sent to alert the NOC of a continuing issue with the WAN link.  A beautiful dot chart created with this data lets us see trends in these errors over the course of the month, with a different background color for weekends, when we’d expect less vendor changes on the MPLS network.  This has even helped us find problems with the uplink from a Central Office to the MPLS cloud, when we noticed that numerous sites in the same vicinity all started having a similar pattern of errors.

To be clear on these WAN interface errors, these are problems that we were not tracking at all until very recently, but they are very real issues.  By looking into the WAN link when a location is getting a few thousand errors in a day, we might head off a T1 circuit outage.

3G Link Monitoring

Most monitoring systems don’t have a great method of monitoring a secondary link that’s really a backup WAN link.  I’ve seen them implemented by having the route for monitoring go across the backup link, but that is dependent on routes being configured properly.

In our case, we chose a different path.  I wrote a monitoring tool that logs into the central router for these backup links every 5 minutes.  It pulls the EIGRP neighbor table to see which locations have operational 3G links.  Further, it pulls the route table to find out which locations are actively routing across the 3G links.  Some database and parsing magic combine to give us a monitoring system that sends SNMP traps to our NMS station that will give us an alarm that “3G link is active” (when the normal WAN link is down) and another alarm “3G link is inoperable” when the 3G link itself is down.  The data this tool collects is also available on a web page, where timestamps are displayed showing the “Last Active on 3G”, “Last Contact”, and other similar fields.

To be completely honest with you, the above method of monitoring the 3G link doesn’t sound like it would be very effective.  I had some doubts when writing it.  To my happy surprise, it’s extremely efficient, taking only a couple of seconds to do all that, once every 5 minutes.  It has been monitoring our 3G links now for about 3 years (since soon after installing them), and it works amazingly well.

Managing Big Networks

You can make things easier if you have control over the entire design from the beginning, but who is that lucky?

Liberal use of databases, combined with competent programming, are the key to managing networks of any significant size without losing all your hair.


Entry filed under: Networking, Programming General.

Diagnose Your Spending Addiction Visualizing Error Data with gRaphaël

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


September 2013
« Aug   Oct »

Most Recent Posts

%d bloggers like this: