Monitoring a network with EIGRP

September 18, 2015 at 9:59 pm Leave a comment

Most network monitoring involves polling.

So, you have a server (or farm of them) going out across the WAN every minute or so, talking to every remote device to ensure that they are up and running.

There are a number of products out there that do this, but what if you can do it smarter?

At my day job, we have hundreds of remote sites connected via T1 and they have an alternate link, soon to be LTE across the company.  We run EIGRP across our links so our routers know which links are available for traffic.  Yes, even our LTE links.  They all terminate on GRE tunnels on one router.  We set the EIGRP Hello time to 20 seconds and the Hold time to 60 seconds.  If 60 seconds pass without seeing a Hello, the link gets marked down.

I wrote a PHP program to handle this monitoring in a very efficient way.  Every minute, it performs an SSH into this router and runs a “show ip eigrp neighbors” command to get a list of all active neighbors.  This tells me that each of those neighbors are active at the time I performed the command.  I log this info to a database table.  I also run a command like “show ip route | inc Tu”.  Due to our database, my program knows which EIGRP neighbor is each location and which route belongs to each location.  If I see a connected route to any Tunnel, I know we are actively running traffic across the LTE link to that location.  Since this is done every minute, I’m logging each time that a remote device has an EIGRP connection to headquarters.  I track the state of all the locations and send SNMP traps to our central manager to create alarms when I see that an EIGRP connection that should be there is missing and when a route exists (meaning the LTE link is being actively used).

This database is tracking the total number of polls and the number of successful polls.  This lets me calculate an “Availability” number for that GRE Tunnel.  Note, this isn’t a real “Availability” number for the LTE link.  It’s an Availability number for the Tunnel, meaning it can easily be worse than the LTE link availability (if the remote router is down, perhaps).

If you described this to me as a monitoring solution, I wouldn’t expect it to work well.  The fact is that we’ve been running with this sort of solution for several years.  The difference now is that I’ve reduced the polling cycle from every 5 minutes to every minute to give me better granularity.  And it still works great, even with 150+ sites.  The beauty of this system is that adding more sites doesn’t really add more time (technically, it does, but it’s such a small number that it’s pretty much irrelevant).

Advertisements

Entry filed under: General, Networking.

Best Cell Carrier coverage in the SouthEast US Ad Blocking is stealing

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

September 2015
S M T W T F S
« Aug   Dec »
 12345
6789101112
13141516171819
20212223242526
27282930  

Most Recent Posts


%d bloggers like this: