Visualizing Error Data with gRaphaël

September 17, 2013 at 10:24 pm Leave a comment

Visualize Errors?

For the last several years, we haven’t been collecting our WAN interface errors.  But then, practically no one else does either.  This is data that may mean something is an upcoming issue, or perhaps it just means that the circuit had a bad day.

If you could alert on this data, it might give you a heads-up that a circuit is having issues, possibly long before the situation gets bad enough to bring the circuit down.  But, you don’t want to jump at every problem that crops up.  You want to make sure there is some sustained issue, not a single bounce.

If you could trend on this data, it might show you circuits that are having a significant number of errors each and every day, but are generally staying below the radar.  If done very well, you might find that you can pick out issues you don’t even expect to be able to see.

Who cares about a few errors?

Generally speaking, most network techs don’t care about a few hundred errors showing up on an interface.  If the errors aren’t counting up over a relatively short period, most techs would say “Fugitaboutit”, and move on.

I strive to be better than “most” network techs, however.

For alerting on errors, the trick is to be able to pick out circuits that have ongoing problems, not just a single circuit bounce.  Many times, a single circuit bounce can be attributed to a one-time event, if it can be attributed to anything at all.  Alerting on a single bounce doesn’t help you find real problems.  If a problem has some consistency though, the vendor may be able to pinpoint the source and take action to fix it.  If, for example, a circuit has taken 2000+ errors an hour for the last 3 hours, there’s a good chance the vendor will be able to see the problem and do something about it.

When it comes to trending, the trick is to represent the data in a visually pleasing form that is easy to pinpoint trends in errors.  This is easier said than done, however.  A single line or bar graph per site would be time consuming to look though.  A chart consisting of the actual error counts over the course of a month might be wider than your screen, if kept at a size where the numbers were readable.

In every case, you don’t want to overwhelm yourself with data.

Collect the Data

Before you can graph anything, you have to collect data.

I wrote a quick program that reads every location’s WAN interface errors via SNMP.  It runs every hour.  During the 5 AM, 11 AM, 5 PM, and 11 PM runs, it pulls back errors from every remote location’s WAN interface.  We’ll call that a Full run.  During every other hour, it performs what I’ll call here a Mini run.  During these Mini runs, it polls every location that had errors above a threshold during the last Full run.  It tracks the last actual SNMP value along with the most recent delta value in a database table, with one row per location.  When each poll occurs, the newly polled “actual” value is used to calculate the delta.  If the new “actual” value is less than the previous value, either the variable has rolled, or the device has rebooted.  It also accumulates a total of WAN errors for the day.

Alerting on Errors

During the four Full runs that happen each day, no alerts are ever sent.  At the conclusion of each hourly Mini run, a check is performed to see if any locations had additional errors above a threshold, beyond the delta value pulled in that last Full run count.  If so, an alert it sent and the delta error value in the database is updated to the newly calculated delta value.

Did you follow that?  I wrote it, and I’m not even sure I fully follow it.

It’s a bit confusing, so let’s have an example:

Threshold is set to 500 errors.

5 AM Full run:  Delta value for location 191 is 540 errors.  There were no errors at the last 11 PM Full run for this location, so this delta is for the 6 hours from 11 PM to 5 AM.  No alert is sent during any Full run.
6 AM Mini run: Delta value for location 191 is 150 errors since the 5 AM run.  During this run, we saw errors, but no alert is sent because we haven’t hit the 500 error threshold.
7 AM Mini run: Delta value for location 191 since the 5 AM run is 290 errors.  We had even more errors, but still under 500.  No alert is sent, DB not updated.
8 AM Mini run: Delta value for location 191 since the 5 AM run is 525 errors.  We’ve hit the threshold.  An alert it sent, and the database is updated.
9 AM Mini run: Delta value for location 191 since the 8 AM run is 490 errors.  Under the 500 threshold since 8 AM (the last time the DB was updated with a new delta), so no alert, and no DB update.
10 AM Mini run: Delta value for location 191 since the 8 AM run is 980 errors.  An alert is sent and the database is updated.

To recap, in the above example, an alert was sent at 8 AM stating that we had 525 errors since 5 AM for location 191.  At 10 AM, another alert goes out stating that we have had another 980 errors since the 8 AM alert.  So, we’ve reduced the potential for 6 alerts (one at each hour) to only 2 alerts.

This is the best way that I’ve come up with to alert our operations team on “non-production down” issues.  They don’t get bombarded with alerts, and they don’t have to wait until the next day to see if there is a problem.  I still feel like there is a lot of room for improvement here, though.

Trending Errors

Every time in the above example where I indicated “database is updated”, in addition to the alert tracking, the total errors by date are incremented.  That’s a separate entry for each day and location, which accumulates errors.

What we end up is a table full of location numbers, dates, and error counts.

While surfing the web looking for a good tool for visualizing data, we ran across gRaphaël.  There a numerous graph possibilities you can make with this tool, but the one that caught my eye was the dot chart.  This chart allows us to have each router name listed down the left side of the graph, with each day of the month going across the top.  Where the date and router name intersect, it places a dot.  The size and color of the dots are relative to the number of WAN interface errors for each day.  Hovering over a dot displays the actual number of errors for that location, for that date.  We did alter the formula that gRaphaël uses to determine the breakdown on the different sizes.  Our changes make a large red dot appear when you have roughly 1000 WAN interface errors in a day.

The concept we had was good.  I was able to quickly whip up a page that read the current month of data from the database and filled variables in the format appropriate for gRaphaël.  After a full month, we plot about 21000 dots on this chart.  It takes 30-60 seconds to load, but it works great.

I wanted to break the chart up a bit, because with over 700 locations, you have to scroll to see everything.  So, I placed a gRaphaël bar graph under the dot chart.  This bar graph is populated with data so that it draws a bar to correspond with whatever dates during the month fall on Saturday and Sundays.

Here’s how it looks (the red dot with a number tag was being hovered over):

dotChart

Using this graph, we can easily identify locations that have not-so-sever, but continuing issues.  With a cooperative vendor, we can fix problems before circuits go down hard.

Even more, after we had been running this for some time, I spotted a group of locations that started having the same pattern of errors, on the same day.  This pattern continued, with each day the numbers of errors they had were very similar.  A little research showed that these locations all terminated at the same Central Office.  When we contacted the vendor, they were able to determine that the issue was a back-haul circuit from that CO.

We would like to think that the vendor would have found this on their own, but the problem went on for almost a week before we reported it, and they fixed it the next day.  This is how good trending can show you errors that aren’t even on your circuits!

 

Advertisements

Entry filed under: Networking. Tags: , , .

Managing thousands of network devices A simple YNAB feature that delights me

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

September 2013
S M T W T F S
« Aug   Oct »
1234567
891011121314
15161718192021
22232425262728
2930  

Most Recent Posts


%d bloggers like this: