Categories: Blog

BGP path hunting

Before we look at BGP path hunting, let’s first discuss the RIP count to infinity problem. RIP, the Routing Information Protocol, is pretty much the simplest bgp table imaginable. RIP simply broadcasts the contents of the local routing table over UDP every 30 seconds. When a neighboring router can reach a destination over fewer hops than the local router, the local router installs the route broadcast by the neighboring router. This makes RIP the quintessential example of a distance vector routing protocol. One basic limitation of this type of routing protocols is that good news travels fast, but bad news travels slow. In other words, when a new destination is connected to the network, the information is propagated by RIP pretty quickly. But when a destination becomes unreachable, it takes very long for RIP to realize this. This is the “count to infinity” problem.

Consider the three routers in the image. Router 2 and router 3 both receive network 192.0.2.0 from router 1 and then subsequently re-advertise that route to each other. So the routing tables look as follows:

Router Network Reachable through Hops
Router 2 192.0.2.0 Router 1
Router 3
1
2
Router 3 192.0.2.0 Router 1
Router 2
1
2

 

But now router 1 goes down. So the routing tables are updated:

Router Network Reachable through Hops
Router 2 192.0.2.0 Router 3 2
Router 3 192.0.2.0 Router 2 2

 

As a result, router 3 now sees the path that router 2 advertises increase from 1 hop to 2 hops (and router 2 sees 3 hops from router 3), so after the next round of updates the tables look like this:

Router Network Reachable through Hops
Router 2 192.0.2.0 Router 3 3
Router 3 192.0.2.0 Router 2 3

 

And then:

Router Network Reachable through Hops
Router 2 192.0.2.0 Router 3 4
Router 3 192.0.2.0 Router 2 4

 

This continues until the number of hops required to reach 192.0.2.0 reaches infinity. Fortunately, for the purposes of the RIP protocol, infinity is 16. So with 30 seconds between updates, it takes 8 minutes for RIP to determine that a destination has become unreachable.

Unreachability—do we care?

If network 192.0.2.0 had truly become unreachable, it wouldn’t really matter whether the routers would know that immediately, or they’d need 8 minutes to come to that conclusion: in both cases, the packets aren’t going to be delivered to their destination.

However, there is a case where it does matter: when a more specific prefix becomes unreachable while a less specific prefix remains reachable. With RIP, this can’t happen because RIP doesn’t understand overlapping prefixes. But with BGP, this is not uncommon, as we’ll see shortly.

“Good news travels fast”

Border Gateway Protocol is a real-time system, so when a BGP session is established or a new prefix is injected into BGP, the new information typically propagates very quickly. You can monitor this using one of the many “looking glasses” that are available around the globe. (Look for “BGP looking glass” in a search engine.) However, sometimes it takes a little longer of prefixes to propagate to all corners of the internet. The main reason for this is the minimum route advertisement interval (MRAI). RFC 4271 specifies that a certain minimum amount of time should elapse between eBGP updates for “routes to a particular destination”. The suggested value for the MRAI is 30 seconds.

The idea is that if there’s an update for the same prefix every five seconds, the first update is propagated, and then nothing happens for the next 30 seconds. After the 30 seconds, the most recent information is sent to the neighbor or neighbors in question. So the updates at 5, 10, 15, 20 and 25 seconds aren’t propagated. This reduces the instability that may be injected into the BGP system.

When a new prefix is announced because it’s newly injected into BGP or a session that has been down comes back up, usually the MRAI doesn’t come into play, because the update typically travels fastest over the best path. So that best path is installed in the BGP table and used. The update then comes in over longer paths, but those updates aren’t propagated because they contain paths that are worse than the already known path. (Exceptions are possible when for traffic engineering or policy reasons, a path that’s longer than the shortest path is preferred.

But bad news travels more slowly: path hunting

However, propagation of information is much slower when routes are removed from BGP, and to a lesser degree when routes are made less preferred, for instance, by path prepending. In that case, BGP will start “path hunting”. The reason for this is that at its core, BGP works much the same as RIP. But fortunately, it’s a bit smarter. Consider the topology in the figure below.

ASes 20, 30, 40 and 50 all receive the prefix that AS 10 advertises. AS 20 propagates the prefix to AS 30, AS 30 propagates it to AS 40, and AS 40 propagates it to AS 50. This results in the BGP tables shown below. The greater-than sign indicates the best path.

AS Prefix AS path
20 198.51.100.0/24 > 10
30 198.51.100.0/24  20   10
> 10
40 198.51.100.0/24  30   10
> 10
50 198.51.100.0/24  40   10
> 10

 

When at this point AS 10 goes down or otherwise stops advertising prefix 198.51.100.0/24, each AS will remove the direct path towards AS 10, as shown in table 3.

AS Prefix AS path
20 198.51.100.0/24
30 198.51.100.0/24 >20   10
40 198.51.100.0/24 >30   10
50 198.51.100.0/24 >40   10

 

Table 3. AS paths immediately after AS 10 becomes unreachable

Now, the following updates will happen:

  • AS 40 tells AS 50 that the path is now 30 10
  • AS 30 tells AS 40 that the path is now 20 10
  • AS 20 tells AS 30 that 198.51.100.0/24 is now unreachable

After this round of updates, the BGP tables look like this:

AS Prefix AS path
20 198.51.100.0/24
30 198.51.100.0/24
40 198.51.100.0/24 >30  20  10
50 198.51.100.0/24 >40  30  10

 

Immediately after receiving its update from AS 30, AS 40 needs to send another update to AS 50, informing it that the path is now even longer. And AS 30 no longer has a route towards 198.51.100.0/24, so it needs to send a withdrawal towards AS 40.

However… This is where the minimum route advertisement interval kicks in, as AS 30 already just sent an update to AS 40 and AS 40 one to AS 50. So for 30 seconds, nothing happens. Then the updates are sent, resulting in theses BGP tables:

AS Prefix AS path
20 198.51.100.0/24
30 198.51.100.0/24
40 198.51.100.0/24
50 198.51.100.0/24 >40  30  20  10

 

AS 40 now sees the withdrawal from AS 30 and needs to propagate that withdrawal to AS 50. But the MRAI delays the update once again, so this withdrawal is delayed by another 30 seconds. After that delay, we reach a new stable situation where 198.51.100.0/24 is considered unreachable by every AS, as shown here:

AS Prefix AS path
20 198.51.100.0/24
30 198.51.100.0/24
40 198.51.100.0/24
50 198.51.100.0/24

 

The MRAI is intended to work on individual prefixes, but implementations may group prefixes together and apply the MRAI to the group in order to save bookkeeping overhead. As a result, the MRAI may slow down a prefix even if it’s updated only once rather than multiple times.

So the path hunting behavior means that when a prefix becomes unreachable, BGP will explore longer and longer paths before the prefix disappears from routing tables everywhere. The difference with RIP is that the presence of the path attribute allows BGP to ignore paths with loops in them immediately, so BGP doesn’t quite count all the way to infinity the way RIP does.

So be careful advertising more specific prefixes

When a prefix really becomes unreachable, it’s not a problem that for about two minutes, packets try to follow longer and longer paths before they’re ultimately lost anyway. However, path hunting can be very problematic when the prefix in question is a more specific prefix, and there’s also a less specific prefix covering the same address range.

For instance, suppose that a network has prefix 198.51.100.0/23 and is connected to two ISPs. In order to get their traffic engineering to work the way they want, they decide to split the /23 into two /24s, and announce 198.51.100.0/24 to one ISP and 198.51.101.0/24 to the other ISP. The /23 is announced to both ISPs as a backup. The “longest match first” rule will make sure that as long as the /24s are present in routing tables, routers will use those paths and not look at the /23 that covers the same address space.

If now the link to the first ISP goes down, path hunting will happen for 198.51.00.0/24. Once the path hunting is over, traffic will flow towards the /23, but during path hunting, there will be instability, as paths keep changing every 30 seconds. Some of these paths may work, others may not. Typically, this takes two minutes, with pings coming through during some of the 30-second periods and going unanswered during others.

So it’s best to not depend on a less specific advertisement as a backup for a more specific advertisement. In the situation with two ISPs, it would have been better if each /24 were advertised to both ISPs, but with different properties to influence traffic engineering. However, it’s possible for prefixes announced with many prepends to still receive incoming traffic, so this option isn’t always ideal, either.