Categories: Blog

The top 4 causes of network downtime

Business people often believe that in order to have a reliable system, you need reliable components. Using reliable components to build your network certainly helps, but it’s not essential. The task of an engineer is building a reliable whole out of less reliable parts; a well-engineered network built out of somewhat less reliable parts wins from a somewhat less well-engineered network built out of more reliable parts. Which is a good thing, because super-reliable devices and services are much, much more expensive than regular devices and services.

Every device and cable (be it a power or network cable) will eventually fail. And the chain is only as strong as its weakest link. If your network depends on the operation of 20 components with an uptime of 99.9%, that still means one of them will be down 2% of the time, adding up to a week of downtime per year. Using components that are ten times as reliable will cut this down to 0.2%, but that’s still half a day per year. Another way to go is to install redundant components that can keep the network going if the primary component goes down. With 40 components that are 99.9% reliable, the chance of one being down is 4%, but the chance of a primary and secondary component fulfilling the same function being down at the same time is only 0.01%—that’s only half a minute of downtime per year! But of course it’s not quite that simple. Installing a second router, switch, fiber or power line is relatively easy. Automatically switching from the first one to the second one when there’s a failure is much harder.

Keeping the above in mind, what are the most common reasons for network downtime, and what can be done about them?

Hardware single points of failure

Most networks have a reasonable amount of redundancy built in, but it’s still quite common to see hardware that’s a single point of failure. Perhaps servers only have a single Ethernet port that’s connected. Then, if the switch they’re connected to fails, the servers are unreachable. So try to minimize single points of failure. When they’re unavoidable, make sure broken hardware can be replaced quickly.

For instance, if it’s impossible to connect servers to multiple switches, it helps to be able to replace a broken switch with another one that has an identical configuration. So either use those switches in their default port/VLAN configuration, or set up a spare switch with an appropriate basic configuration in advance so simply replacing the hardware is enough to restore connectivity.

Power issues

Obviously having some kind of backup power is key. But even if the power never fails, there always comes a time when there’s maintenance, so power has to be turned off. Ideally, all network components have redundant power supplies that are connected to different circuits, so they can keep running when there’s a failure or maintenance on one feed. Make sure one circuit can provide adequate power by itself, and that the circuits have as few components in common as possible. Another reason to have redundant power supplies is that the power supply tends to be the part that fails the most often in many devices.

If equipment doesn’t have redundant power supplies, it’s important to have two components that provide backup for each other connected to different power circuits, so if one circuit goes down, that doesn’t take out both devices.

Routing problems

Installing a second ISP connection and running BGP buys a lot of peace of mind: you no longer have to worry about your internet connection going down. If ISP A goes down, within (often much less than) a minute or two traffic is rerouted over ISP B. However, BGP routing as well as the internal (OSPF) routing and Ethernet spanning tree (and the interactions between them!) add a lot of complexity. So sometimes, these protocols don’t repair connectivity issues the way they should. Or worse, they create outages of their own that wouldn’t have occurred in a simpler network.

As such, it’s important to be very careful and consider all eventualities when designing a network. Often, there is a choice between letting the Ethernet Spanning Tree Protocol (STP) repair outages or letting an IP routing protocol such as OSPF handle this. STP is more flexible, but different vendors support different versions of STP that handle Ethernet VLANs differently. So letting OSPF or another IP layer routing protocol handle this when possible is usually simpler and more robust.

Most of the time, BGP will route around outages. However, sometimes packets disappear into a black hole: for some reason, the packets are lost, but the routing protocols don’t notice that there’s a problem, so traffic continues to flow towards the black hole rather than be rerouted. One way to detect this problem is to monitor reachability of key remote services. Another is looking at total traffic, which will be much lower than usual in the presence of a black hole. Once detected, recovering from routing black hole affecting one ISP is very simple: shut down the BGP session towards that ISP until they’ve fixed the problem.

Sometimes routing issues are caused by software bugs. As such, it can be useful to have equipment from different vendors, so that if one device is affected by a bug, the other one isn’t. On the other hand, using the same vendor and software version for everything avoids additional complexity.

Human error

Because the proper application of redundant hardware and protocols can recover from so many problems, the remaining reachability problems are usually caused by a failure to follow known best practices. Just like early aviators could fly “by the seats of their pants” and the same is no longer advised for pilots of jet airliners, as networks grow, more planning, procedures and checklists are necessary to avoid simple but costly mistakes. Most of all, consistency is key, making sure that weakest link is just as strong as all the other ones. On the other hand, when it’s crunch time, some creativity and willingness to take unorthodox measures can make the difference.