Archive for September, 2013

A simple YNAB feature that delights me

This may seem odd, but there is a little feature in YNAB where I can exactly match up my current bank account balance to my YNAB account balance.  It’s actually pretty simple, but each time I try it and it works the way I hope, I just love it.

Go with this for a minute.  I know it won’t make sense up front, but just try me:

I wrote a check out to a friend some months back.  I’m guessing either he lost it, forgot about it, or never intended to cash it.  I had started using YNAB to track my spending, so this check is listed in there, but month after month, he doesn’t cash the check.

In YNAB, you don’t have a balance column by default, but this is one you need for a Checking or Savings account, in my opinion.  To add one, hit the “+” sign in the upper right of your account screen and you’ll be able to edit your columns.  At this point, your YNAB account screen probably looks like something out of Quicken, which is to be expected.  If you are like me, you enter future transactions in YNAB well ahead of the actual dates (perhaps you’ve written out checks for future bills or scheduled on-line payments).

So, go to your banks website and log in to look at cleared transactions… Mark all the ones as cleared in YNAB that your bank shows as cleared.

When you have all the transactions that have cleared your bank marked in YNAB appropriately, hover your mouse over the balance column for today’s date.

It will pop up a small bubble that will tell you your cleared bank balance, your uncleared total, and your working balance as of the date you hovered over.  (Your working balance is your cleared balance minus your uncleared balance.)  And yes, in my case, the cleared bank balance total YNAB shows actually matches my bank account balance to the penny.  It has for months.  It doesn’t matter that my friends check has never been cashed.  It knows that, and it factors it in.

I’m not sure exactly why, but to me it is very reassuring to see that my account balance in YNAB for the current date matches the bank’s records.

September 27, 2013 at 10:20 pm 3 comments

Visualizing Error Data with gRaphaël

Visualize Errors?

For the last several years, we haven’t been collecting our WAN interface errors.  But then, practically no one else does either.  This is data that may mean something is an upcoming issue, or perhaps it just means that the circuit had a bad day.

If you could alert on this data, it might give you a heads-up that a circuit is having issues, possibly long before the situation gets bad enough to bring the circuit down.  But, you don’t want to jump at every problem that crops up.  You want to make sure there is some sustained issue, not a single bounce.

If you could trend on this data, it might show you circuits that are having a significant number of errors each and every day, but are generally staying below the radar.  If done very well, you might find that you can pick out issues you don’t even expect to be able to see.

Who cares about a few errors?

Generally speaking, most network techs don’t care about a few hundred errors showing up on an interface.  If the errors aren’t counting up over a relatively short period, most techs would say “Fugitaboutit”, and move on.

I strive to be better than “most” network techs, however.

For alerting on errors, the trick is to be able to pick out circuits that have ongoing problems, not just a single circuit bounce.  Many times, a single circuit bounce can be attributed to a one-time event, if it can be attributed to anything at all.  Alerting on a single bounce doesn’t help you find real problems.  If a problem has some consistency though, the vendor may be able to pinpoint the source and take action to fix it.  If, for example, a circuit has taken 2000+ errors an hour for the last 3 hours, there’s a good chance the vendor will be able to see the problem and do something about it.

When it comes to trending, the trick is to represent the data in a visually pleasing form that is easy to pinpoint trends in errors.  This is easier said than done, however.  A single line or bar graph per site would be time consuming to look though.  A chart consisting of the actual error counts over the course of a month might be wider than your screen, if kept at a size where the numbers were readable.

In every case, you don’t want to overwhelm yourself with data.

Collect the Data

Before you can graph anything, you have to collect data.

I wrote a quick program that reads every location’s WAN interface errors via SNMP.  It runs every hour.  During the 5 AM, 11 AM, 5 PM, and 11 PM runs, it pulls back errors from every remote location’s WAN interface.  We’ll call that a Full run.  During every other hour, it performs what I’ll call here a Mini run.  During these Mini runs, it polls every location that had errors above a threshold during the last Full run.  It tracks the last actual SNMP value along with the most recent delta value in a database table, with one row per location.  When each poll occurs, the newly polled “actual” value is used to calculate the delta.  If the new “actual” value is less than the previous value, either the variable has rolled, or the device has rebooted.  It also accumulates a total of WAN errors for the day.

Alerting on Errors

During the four Full runs that happen each day, no alerts are ever sent.  At the conclusion of each hourly Mini run, a check is performed to see if any locations had additional errors above a threshold, beyond the delta value pulled in that last Full run count.  If so, an alert it sent and the delta error value in the database is updated to the newly calculated delta value.

Did you follow that?  I wrote it, and I’m not even sure I fully follow it.

It’s a bit confusing, so let’s have an example:

Threshold is set to 500 errors.

5 AM Full run:  Delta value for location 191 is 540 errors.  There were no errors at the last 11 PM Full run for this location, so this delta is for the 6 hours from 11 PM to 5 AM.  No alert is sent during any Full run.
6 AM Mini run: Delta value for location 191 is 150 errors since the 5 AM run.  During this run, we saw errors, but no alert is sent because we haven’t hit the 500 error threshold.
7 AM Mini run: Delta value for location 191 since the 5 AM run is 290 errors.  We had even more errors, but still under 500.  No alert is sent, DB not updated.
8 AM Mini run: Delta value for location 191 since the 5 AM run is 525 errors.  We’ve hit the threshold.  An alert it sent, and the database is updated.
9 AM Mini run: Delta value for location 191 since the 8 AM run is 490 errors.  Under the 500 threshold since 8 AM (the last time the DB was updated with a new delta), so no alert, and no DB update.
10 AM Mini run: Delta value for location 191 since the 8 AM run is 980 errors.  An alert is sent and the database is updated.

To recap, in the above example, an alert was sent at 8 AM stating that we had 525 errors since 5 AM for location 191.  At 10 AM, another alert goes out stating that we have had another 980 errors since the 8 AM alert.  So, we’ve reduced the potential for 6 alerts (one at each hour) to only 2 alerts.

This is the best way that I’ve come up with to alert our operations team on “non-production down” issues.  They don’t get bombarded with alerts, and they don’t have to wait until the next day to see if there is a problem.  I still feel like there is a lot of room for improvement here, though.

Trending Errors

Every time in the above example where I indicated “database is updated”, in addition to the alert tracking, the total errors by date are incremented.  That’s a separate entry for each day and location, which accumulates errors.

What we end up is a table full of location numbers, dates, and error counts.

While surfing the web looking for a good tool for visualizing data, we ran across gRaphaël.  There a numerous graph possibilities you can make with this tool, but the one that caught my eye was the dot chart.  This chart allows us to have each router name listed down the left side of the graph, with each day of the month going across the top.  Where the date and router name intersect, it places a dot.  The size and color of the dots are relative to the number of WAN interface errors for each day.  Hovering over a dot displays the actual number of errors for that location, for that date.  We did alter the formula that gRaphaël uses to determine the breakdown on the different sizes.  Our changes make a large red dot appear when you have roughly 1000 WAN interface errors in a day.

The concept we had was good.  I was able to quickly whip up a page that read the current month of data from the database and filled variables in the format appropriate for gRaphaël.  After a full month, we plot about 21000 dots on this chart.  It takes 30-60 seconds to load, but it works great.

I wanted to break the chart up a bit, because with over 700 locations, you have to scroll to see everything.  So, I placed a gRaphaël bar graph under the dot chart.  This bar graph is populated with data so that it draws a bar to correspond with whatever dates during the month fall on Saturday and Sundays.

Here’s how it looks (the red dot with a number tag was being hovered over):


Using this graph, we can easily identify locations that have not-so-sever, but continuing issues.  With a cooperative vendor, we can fix problems before circuits go down hard.

Even more, after we had been running this for some time, I spotted a group of locations that started having the same pattern of errors, on the same day.  This pattern continued, with each day the numbers of errors they had were very similar.  A little research showed that these locations all terminated at the same Central Office.  When we contacted the vendor, they were able to determine that the issue was a back-haul circuit from that CO.

We would like to think that the vendor would have found this on their own, but the problem went on for almost a week before we reported it, and they fixed it the next day.  This is how good trending can show you errors that aren’t even on your circuits!


September 17, 2013 at 10:24 pm Leave a comment

Managing thousands of network devices

I’m a network designer and the tools I’ve written manage 700+ networks for one company.  Each of these networks contain a Cisco router (various models), between two and six Cisco switches (mostly various 2960 models), and a CheckPoint UTM.  Altogether, it’s over 3000 switches.  We have heard hints from management that we are growing to around 1000-1100 sites in the near future.  With our network design, expanding to that many sites will be almost effortless.

My day to day job involves a lot of configurations and a lot of data.  We don’t use any software from Cisco or any 3rd party to manage our configurations.  The tools that generate and manage the configurations were all written by me.  I have not yet seen a piece of COTS software that can manage router & switch configurations in a manner suitable for our business.

Generating Configurations

Each device config is generated from a template.  Most hardware models have a unique template.  These templates contain placeholders for items that are unique for each location.  Various database tables track these unique values, and my tools drop the right values in the right spot.

In the case of the routers, these templates are almost 1000 lines of commands.  Routers have very complex configurations and the widest range of variables.  Over 125 variable substitutions occur for every router configuration.

The switch templates are a little simpler, but are still over 400 lines of commands each.  One database table tracks the admin status of each switch port, along with the speed and duplex setting.  Those settings are tracked so that if that switch config gets regenerated, the admin status, speed and duplex settings are retained in the resultant config.

Oh, the switch configurations can be a bit tricky, partially because we have different switch designs at different sites.  Tracking that is no problem, though, thanks again to the database.  We also name our configurations, and each location has a slot in the database to track that as well.

The UTMs are somewhat unique, in that they house a 3G connection.  This is backed by a Ruby on Rails web app that let’s the Network Operations team pair the UTM with a 3G modem from one of two vendors, and assign a unique static IP.  This database also tracks serial numbers and phone numbers for modems and SIM cards.  Once a device is assigned to a location, a configuration is generated within a couple of minutes.  This database is versioned, and I’ve provided a web interface so the NOC can “go back in time”, so to speak, and see exactly when and what changes have been made to this database.  This was very important as this helps track actual hardware, as people make mistakes when faced with an easy-to-alter database.  Other tools coordinate the configuration of routers with each UTM, so that a backup WAN link can run over this 3G connection.

Managing Configurations

Notice that the title of this post contains the word Managing.  My job isn’t done at just generating complex config files that work together.  We have a team of network operations folks that handle the day to day care of these devices.  They need to have a high level of access to do their job, but they occasionally change things, in the course of troubleshooting, and don’t always put everything back.

I hate it when that happens.

So, I audit.  I have a series of tools that works hand-in-hand with the generation tools that I mentioned above.  Every day, these audit tools read the configuration of every router, switch, and UTM.  Configurations are generated and diffs are performed.  When differences are found, they are pinpointed (down to the interface, or sub-section of the config) and emailed out to the team, highlighting the lines that are either missing (part of the template, but not in the actual config of the device) or present but not expected (extra lines in the config that don’t exist in the template).  This allows us to quickly find and clean up the human error that slips in whenever humans are involved.

This is vitally important.

About a year and a half ago, I was tasked with incorporating our network with that of another company.  While the network team of that company had kept things running, their configurations were far from standard.  If an issue arose at one site, a band-aid was applied to work around the problem.  In many cases, problems were forgotten and a proper solution was never implemented.  Rinse.  Repeat.  This appeared to have happened for years, with various technicians implementing their “fixes” on their own.  There was a “port standard” for VLANs, but that was at least partially abandoned in most locations.  The result was that practically every site was “a one-off”, a unique non-standard configuration.  This makes standardization a nightmare.

By performing daily audits, we can catch these sorts of problems.  Network techs that might come from an environment where they can change whatever they feel are more conscientious, knowing that their actions are being monitored to ensure that the configurations stay standardized.  While only a few issues are caught each week by these audits, it’s easy to see how it keeps our network constantly snapping back to a desired state.

Remember, above, when I mentioned the switch configurations having names?  That goes hand-in-hand with the auditing tools.  A switch is audited against the switch configuration style that shows up in the database for that location.  So, if you implement a new config style in 10 locations, the audit tools will be auditing the switches at those locations against the new templates, not against the templates driving the configurations in the other locations.

Audits also serve another important purpose.  Once every couple of months or so (sometimes much more frequently), configuration templates change.  The audit tools are written such that these differences can be programmed for.  If a particular set of differences are found, the audit tool itself will actually perform the commands to get the device configured properly.  In the event that hardware is being upgraded, routers may be configured months ahead of deployment.  If configuration changes happen in the meantime (like they often do), the next audit after the device installation will bring the config up to the standard.

In addition to auditing each of the remote site devices, another tool audits the various central routers that the remote sites connect into.  These routers literally contain many thousands of lines of configuration, all of which must be exactly correct in order to properly work.

Making Big Changes

I’m currently involved in a project to consolidate two networks together.  Essentially, we have an pair of central routers with high-speed links connecting to them.  These routers connect to an old network that is slowly going away.  While technicians are on-site at the location, we will be implementing changes to swing these locations to the portion of the network that will remain.  We’ve attached these central routers to the new network via a Gig interface, which is part of a new VRF.  Once techs are on-site to strip the old gear out, leaving only our new gear, we can make the required configuration changes to swing them from the old network to the new network, just by running a script.  It’s actually much more complex than that, but that’s all the Operations team will have to know, as the intricacies of the changes are mostly hidden from them.  These hidden changes include not just the central routers that the circuits terminate on, but database changes, and another pair of core switches and firewalls that require route changes at cut-over time.  By using the template approach, all of the network side changes are possible without significant programming, planning, scheduling, or implementation effort.

Data, Data, and Even More Data

Managing these networks doesn’t just mean managing their configurations.  In addition, there’s lots of data collection that goes on.

Every router, switch, and UTM has various information polled each day.  The model of device and level of firmware is pulled from each, along with other hardware specific data.  This ensures that, for example, if a device experiences a hardware failure and gets replaced with a piece of hardware running another version of IOS, it gets noticed reasonably soon so that it can be corrected.  In the case of the UTMs, this daily data pull includes the firewall policy that is active on the UTM.  The CheckPoint management server occasionally has issues where not all devices get updated, and a simple sortable html table showing this data lets us easily see which devices haven’t been updated to the latest policy yet.  A simple html table for the routers and switches gives totals of what model of hardware is running which version of IOS, as well as how many of each model are in the field.

The above paragraph just hits the easy stuff.  In addition, each day the entire MAC address table is pulled from every switch, along with the ARP table from the corresponding routers.  By cross referencing the MAC addresses (associated with the ports) with the routers ARP table, we can tell which IP device is attached to which physical switch port.  CDP info is also pulled.  The result is a web page where you can enter a site ID and get a chart of all switches at that site, the state of each switch port (including how many days since the last status change), the cross connections between switches (and other CDP capable devices), and every IP device attached to the switch, right down to the port.  (Though devices that don’t communicate across the router much may not be caught in the ARP table, so a couple devices might be missing, but this gets about 95+% of devices.)   Various people across different areas of I.T. use this data daily to help them quickly locate and troubleshoot equipment.  This info is also very valuable if you are trying to move toward a standard layout of equipment to specific ports on specific switches.

Another relatively recent addition to our data tracking is WAN interface errors.  We poll for interface errors on the WAN link throughout the day.  Sites with any errors get polled more frequently.  If these sites with previous errors continue to rack up more on subsequent polling, emails are sent to alert the NOC of a continuing issue with the WAN link.  A beautiful dot chart created with this data lets us see trends in these errors over the course of the month, with a different background color for weekends, when we’d expect less vendor changes on the MPLS network.  This has even helped us find problems with the uplink from a Central Office to the MPLS cloud, when we noticed that numerous sites in the same vicinity all started having a similar pattern of errors.

To be clear on these WAN interface errors, these are problems that we were not tracking at all until very recently, but they are very real issues.  By looking into the WAN link when a location is getting a few thousand errors in a day, we might head off a T1 circuit outage.

3G Link Monitoring

Most monitoring systems don’t have a great method of monitoring a secondary link that’s really a backup WAN link.  I’ve seen them implemented by having the route for monitoring go across the backup link, but that is dependent on routes being configured properly.

In our case, we chose a different path.  I wrote a monitoring tool that logs into the central router for these backup links every 5 minutes.  It pulls the EIGRP neighbor table to see which locations have operational 3G links.  Further, it pulls the route table to find out which locations are actively routing across the 3G links.  Some database and parsing magic combine to give us a monitoring system that sends SNMP traps to our NMS station that will give us an alarm that “3G link is active” (when the normal WAN link is down) and another alarm “3G link is inoperable” when the 3G link itself is down.  The data this tool collects is also available on a web page, where timestamps are displayed showing the “Last Active on 3G”, “Last Contact”, and other similar fields.

To be completely honest with you, the above method of monitoring the 3G link doesn’t sound like it would be very effective.  I had some doubts when writing it.  To my happy surprise, it’s extremely efficient, taking only a couple of seconds to do all that, once every 5 minutes.  It has been monitoring our 3G links now for about 3 years (since soon after installing them), and it works amazingly well.

Managing Big Networks

You can make things easier if you have control over the entire design from the beginning, but who is that lucky?

Liberal use of databases, combined with competent programming, are the key to managing networks of any significant size without losing all your hair.

September 9, 2013 at 11:54 pm Leave a comment

Diagnose Your Spending Addiction

Are you a Spending Addict?

The first step to recovery is realizing you have a problem.

Does this describe you?

You don’t budget because you think you make enough money that you shouldn’t need to budget.
Several times a month you go out to eat, on a whim.
While shopping at department/discount stores if you spot something you like, you just buy it.
While surfing online, you spot something cool and order it without much thought.

Read each of the below questions slowly, really think about your answer and be completely honest with yourself.

Do you avoid tracking how much consumer debt you have (all credit cards and personal loans)?
Do you avoid logging into your credit card accounts to track your spending throughout the month?
Do you semi-consciously avoid tracking your spending, so that you can keep spending without guilt?

I believe a lot of people have trouble controlling their spending because they purposely keep themselves in the dark.  This way, when it is time to make a spending decision, since they don’t KNOW how bad things are, it’s easier to give themselves permission to make that purchase.  If you do this, you are withholding vital information from yourself, information that could help you make better financial decisions.

Think about it.  Does the above sound hauntingly familiar?

Ok, I admit it, I’m a Spending Addict, Help!

If you see most of the above in yourself, you might be a Spending Addict.

Once you have realized that you have a probem, you can take steps to remedy it.

My prescription for Spending Addicts is to proceed to to get re-educated.  That website contains valuable free information about the YNAB (YouNeedABudget) method of budgeting.  Yes, YNAB is a software program, but their method for budgeting is something you could use without buying their software, though I highly recommend it.

You can even get started with their software with a free 34 day trial.  If you try their software, I highly suggest watching the video tutorials to get a feel for starting out.

They even have free on-line classes (click here to sign up) several times a week.  (Who else in the land of consumer finance does this?)  You don’t even have to be a customer to attend the classes.

If you do decide to buy YNAB, you can save $6 off the purchase price by using my refer-a-friend link.

Why did you write this article?

I know the symptoms of Spending Addiction so well because I practiced them myself.

I am so excited by the way in which my life has changed by using YNAB that I want to spread the message, so more and more people can feel what I feel.

September 4, 2013 at 7:22 pm Leave a comment


September 2013

Posts by Month

Posts by Category