ccie blog

Book Notes

I plan to read approximately 6 books. These books are listed below. The heading of each book will be in bold, followed by a summary, and then my notes for it underneath.

1. OSPF. Anatomy of an Internet Routing Protocol (John. T. Moy)

2. Practical BGP (2004), (Russ White, Danny McPherson, Sangli Srihari)

3. IPv6 theory, protocol, and practice, 2nd edition (2003)

4. MPLS and VPN Architectures

5. Cisco QOS Exam Certification Guide (IP Telephony Self-Study) (2nd Edition)

6. Developing IP Multicast Networks, Volume 1

My notes may contain paraphrases from each of the books, so all credit on this page should go towards the authors listed above.

OSPF. Anatomy of an Internet Routing Protocol


The book talks about why OSPF was developed and the protocol design decisions they made when designing it. It took me about 3 days to read and felt it was worth my time. So I would recommend reading this book, but only up to chapter 8 for anyone else studying for their CCIE. Anything beyond that is out of scope or legacy.


RIP was originally the main IGP routing protocol back in the 1980’s, and EGP was the main routing protocol used between AS’s. But they both scaled badly as more people started using the internet due to infrequent routing updates, large sized updates, & slow convergence. So the author, John Moy, one of the OSPF developers (who also wrote the RFC on OSPF) got together to figure out how to address these problems in a new routing protocol design.

Some of the functional requirements they set out to achieve are listed below:

  • A more descriptive routing metric. RIP used hop count which introduced two issues. Inability to account for bandwidth/delay into the metric. And network diameter restriction of 15.
  • Equal-cost multipath. They wanted to use this feature, but ensure the load sharing method was not dictated by OSPF itself. And now in 2015 we use CEF to choose whether it’s a round-robin, per src-dst-ip, or src-mac or whatever.
  • Routing hierarchy. Required for scaling up the network.
  • Separate internal and external routes. According to John, “Autonomous Systems running RIP were having trouble knowing which information to trust. You always want to trust information gained first hand about your own internal routing domain over external routing information that has been injected into your domain”. And RIP could not distinguish between the two.
  • Support of more flexible subnetting schemes. They predicted that people would want to make more efficient use of address space, so wanted to incorporated VLSM into OSPF so that it could route arbitrary subnet & mask combinations.
  • Security. They wanted to protect against any old person just joining the routing domain without permission.

Link State Decision and Encapsulation Choice

Up until this point, no multi-vendor link state routing protocol had ever been developed & was debated whether it would even be possible. The main problem with distance vector protocols was that they took a long time to converge, and consumed a lot of bandwidth with updates. This was the case was found with the original ARPANET, RIP, and BGP (all distance vector protocols by the way). It was just the general trend of the current distance-vector protocols that gave people negative feelings towards making OSPF distance vector, although it wasn’t really a good reason not to use it. One of the “pro’s” about link-state protocols though, was that every router has a complete overview of the entire topology, whereas distance vector protocols meant that a router would only know what the upstream router thought was the best path to a destination prefix. Link state protocols, instead, give the neighbor router a complete topology so that it could make it’s own independent decision about the way that router wants to forward traffic for a prefix. It generally looked and sounded much more appealing since routers had more information about the network, and thus the ability to make a more informed decision about the best path to a destination.

When designing a TCP/IP routing protocol there are three choices for encapsulation of your new protocols packets: They can run directly over the link layer, directly over the IP network layer, or over IP’s transport protocols TCP/UDP. Running it over the link-layer (such as IS-IS) has a problem where you need to build your own fragmentation/reassembly  services into the protocol. If you run it at the IP network layer, you can use IP’s network layer services, which co-incidentally has fragmentation/reassembly capabilities built into it. And since IP runs over just about every type of modern network it seemed logical to run OSPF at this layer. OSPF also did not need TCP’s reliability as link-state protocols have reliability built into the protocols themselves. And UDP (at the time) was considered less secure than IP because IP actually needed super-user privileges on Linux OS’s, whereas UDP packets did not. UDP also added 8 bytes of overhead that it used for checksums to verify integrity of packets (although it’s a nice feature, the overhead outweighed the benefits in the eyes of the developers).

A current issue with RIP and IP encapsulation was that RIP sent broadcast updates to everyone, so all hosts on the same VLAN would also get the packet too. At the time, IP multicasting was just emerging into the work of networking, so they decided that OSPF would use multicast updates instead. This meant that only devices that wanted to receive OSPF packets got the data.

Side Notes:

  • RIP runs over UDP port 520
  • RIPng runs over UDP port 521
  • BGP uses TCP port 179
  • EIGRP uses its own protocol, 88
  • OSPF uses its own protocol, 89.

Dealing With Different Link Types

During the development of OSPF, the creators needed to enable the protocol to work over various types of link layer protocols (frame-relay, X.25, ARPANET, Ethernet etc). So they started to review how they would achieve such a scenario. When looking at Ethernet, they saw that they could end up needing a really high number of  OSPF neighbor relationships over the same VLAN if multiple routers existed on the same subnet. There would be a point-to-point type relationship between every single router, which meant on a network with 200 routers, every router would potentially need 199 relationships/neighbors (i.e. there would be thousands of neighbor relationships). So they wanted to create a star topology where every router peered with just one central device, and that central device was responsible for updating every single router.  So the Designated Router (DR) was born. This meant that all 199 routers would just peer with the DR, and that was it. There would just be 199 relationships rather than thousands.

Since it was decided OSPF updates would be multicasted, it meant that for protocols that did not support broadcast/multicast capabilities, such as frame-relay and X.25, needed to find a way to unicast hellos between neighbors instead. So when you are sending OSPF updates over a non-broadcast capable network, you actually have to statically configure the neighbor under the OSPF process so that it unicasts hellos instead. They did want to keep the same structure as the DR used in broadcast networks for the same reason as before (to stop thousands of unnecessary neighbors forming over the same subnet), so they continued to keep the DR for the non-broadcast network type. The modern reference for this kind of setup is referred to as a Non-Broadcast Multi-Access (NBMA) network (basically just a bunch of routers on the same subnet, using a non-broadcast capable protocol between them at the data link layer). Later on, the Backup DR (BDR) was introduced to account for a failure of the main DR router.

External Route Tag

When creating OSPF, the developers wanted a way to tag externally injected routes as they moved through the OSPF domain. This meant that they could identify this exact set of routes anywhere in the network and know precisely which router injected into the domain. At this point they didn’t know what they would actually do with this information, they just wanted to be able to identify it. But later on it got used for filtering. So a router can be configured to say, RIP routes that have been injected into OSPF with a value tagged with 1234 should or shouldn’t be re-injected into another external part of the network – BGP for example. So the router connecting OSPF to the BGP domain would see the value of 1234 and just opt to not inject them.

Hierarchical Abstraction

To scale OSPF, they wanted to split the OSPF domain into regions/areas. The idea is that routers within one area know all the details within the area. However, the details about an area should not be passed between areas. In order to achieve the scalability they wanted, they needed to compact/reduce the information they knew about an area before passing it to another area.  However, the developers run into problems where this type of abstraction ended up hiding metrics, so they decided against it. Instead they just allow you to summarise addresses within an area then then forward this summary address to the new area in a type 3 LSA. Since this is very similar to what distance vector protocols do, they wanted to avoid having too much hidden information passing over multiple areas (imagine some /28’s in area 1, which area then summarised into area 2, then summaraised again using an even shorter mask into area 3 etc). So they said, right, we will just have one special area (backbone area) that all other areas must connect to. That way, you only hide information for networks that are one area away.


When they started testing a bunch of different design scenarios with OSPF and vendor interoperability, they found some bugs, malformed packets, issues with sequence numbers etc. They’ve revised the protocol a few times to fix these issues. One of the interesting problems/inefficiencies I read about is below. Assume the AS numbers here are for BGP.

Fowarding Address


Say R4 only advertises routing information with R1. Now lets say the IGP running between R1,2, and 3 is OSPF. R3 could learn the prefixes in AS200 from R1, but he would also have to route traffic via R2-R1-R4 to reach anything in AS200. Assuming the bandwidth/delay of all links is the same, this is not an optimal path, ideally if R3 needs to reach something in AS200, it should just forward traffic directly to R4. So the OSPF developers creating something called a forwarding address that can be manipulated that achieves exactly that. It allows routes from R4 to be learned via R1, but R1 can specify that the next hop is R4 & not himself. The process to configure this is actually done in the background. You have to meet like 5 or 6 criteria and the forwarding address will be advertised (and these criteria are on the Cisco website if you Google OSPF forwarding address).

Another scenario they run into was running OSPF on routers with minimal resources. Back in the day, when they first deployed OSPF on some of the very first businesses to use it, some of the routers that were involved had hardly any processing/memory power. The businesses wanted every router, or as many as possible to use OSPF so that they didn’t have too many IGP’s running. So here came the development of OSPF stub areas. This allowed these routers to run OSPF, but caused them to only take a default route.

Corrupted LSA’s

OSPF uses a checksum on the router that originated the LSA to check the packet for corruptions. A router verifies the checksum of the LSA received form a neighbor router during flooding. A router also periodically verifies the checksums of all its LSA’s periodically to make sure its database is not corrupted.

LS Age

An LSA will always be updated when its age reaches 30 minutes. The router will refresh the LSA by flooding a new LSA with a higher sequence number out over the network when its age reaches 30 minutes. If the originating router of the LSA has failed, the LSA will be discarded by the other routers after 1 hour (the MaxAge time). The LSA will be re-flooded with the MaxAge set to 60, which in turn causes all other routers to delete the LSA out of the database.

The Router-LSA


  • The LS Age is always 0 0n new LSA’s.
  • The Options describes the capabilities that the router supports.
  • LS Type is 1, for router-lsa.
  • Link State ID is the OSPF router-id of the router.
  • Each of the 3 links underneath are neighbors.
  • The Link Type is a number. And depending on the number will depend on what the Link ID and Link Data is.
    • 1 = point-to-point.                           Link ID = Neighboring Router ID
    • 2 = connection to a transit network. Link ID = IP address of DR
    • 3 = connection to a stub network.    Link ID = Subnet
    • 4 = virtual link.                                Link ID = Neighboring Router ID
  • Metric is the relative cost of sending data over that link.

 OSPF Packet Types

Type 1 = Hello

Type 2 = Database Description (DD) packets

Type 3 = Link State Request (LSR) packets

Type 4 = Link State Update (LSU) packets

Type 5 = Link State Acknowledgement (LSAck) packets

Initial Database Synchronization

When a new connection comes up, an OSPF router does not immediately send its entire database over to the neighbor. What it does, is, just send the LSA headers and then the neighbor requests those LSAs that are more recent. The idea is that if the router already knows about 90% of the LSA’s it’s just been given (i.e. it already has the same sequence number and info in its LSDB through some other connection into the main network), then there is no point sending all the details about those LSAs over to the router again. It’s far more efficient to just send the new router a list of the current LSDB LSA headers, and then if the new router doesn’t already have the latest copy of an LSA or a particular LSA at all, then it uses a Link State Request (LSR) to ask for the complete LSA.

As soon as the OSPF Hello protocol has determined a link to a neighbor is bidirectional, the OSPF protocol goes through the database exchange process (if required). Some routers never actually become fully adjacent, they stay in the 2way state. This is found in broadcast and non-broadcast networks where the DROthers (i.e. routers connected to non-DR routers) are OSPF neighbors only. They would only go into Exstart with the DR and BDR to begin the database exchange process. Anway, if a decision is made to synchronize the database then neighboring routers send a copy of all LSA headers in the LSDB to each other. The headers are sent in Database Descriptor (DD) packets. They also flood any future LSA updates to each other. If one of the routers receives an LSA header that has a higher sequence number than it currently holds for that LSA in its LSDB, or doesn’t have that LSA at all, then it uses the LSR, LSU, LSAck packets to request, update, and acknowledge the full LSA information from the neighbor.

Reliable Flooding

As new LSAs, or current LSAs are updated, the new information is sent throughout the OSPF domain using a procedure called “reliable flooding”. Say a link goes does and the router needs to tell everyone to remove that link from the LSDB. It sends an LSU out all it’s interfaces stating that the LSAs Age field is the MaxAge. When another router receives this LSU, it sees the Age is set to the MaxAge, so sends an LSAck back to the originating router, then forwards the new info out all interfaces except the interface that the LSU was received on. The process continues until all routers are updated.

Designated Router Election

The first OPSF router on an IP subnet always becomes the DR. The second OSPF router will become the BDR. Only if the DR or BDR fail will an election occur. The router with the highest router priority value will become the DR (or BDR depends on which one failed). In the event there is a tie, the highest router-id wins.

Virtual Links

There is a requirement with OSPF that all areas must connect to the backbone area. This is not necessarily a physical requirement, and can be done logically using virtual links. In the diagram below, this means that a virtual link could be created between R2 and R5, also a separate virtual link could be created between R2 and R4 to make the network work. What virtual links effectively do is allow summary LSA’s to be tunnel across non-backbone areas. So when a summary LSA from R5 reaches R1, the cost to reach a link on R5 would be the cost to reach R2 + the cost of the virtual link + the cost of the link connected to R5 (i.e. notice the cost betewen R2-R3-R5 wasn’t counted since the virtual link acts like a point to point connection to the backbone so it takes the cost of the virtual link instead).

Virtual Link


OSPF Areas

The developers created a bunch of different types of areas for OSPF due to resource requirements. Stub areas require the least resources, followed by NSSA’s. The problem they were trying to overcome was that there could potentially be thousands of external LSAs or Summary LSAs, which will require more resources on the router. So by limiting what types of LSA can go into different areas, they could reduce the load for selected routers in the network.

Stub Areas

By default these types of area only accept default routes & summary routes from Area Boarder Routes (ABRs). The idea is minimize the router resources requirements by not sending external LSAs into this type of area at all. These types of area can also be configured to not accept Summary LSAs either (the syntax is #area xxx stub no-summary). The area ends up being called a “Totally Stubby Area” if you do decide to configure this.  Due to the fact that external routes can’t pass through stub areas, ASBRs are not supported in this area type. Virtual links also cannot be configured through this type of area. So in my diagram above, only area 1 and area 3 could become stub areas. Area 0 is the backbone area so can’t become a stub, and area 2 needs to support a virtual link tunnel for area 1, so it also cannot become a stub.

As a side note, stub areas are always located at the boundary of an OSPF domain.

NSSA Areas

I learned something I never knew about NSSAs from this book. NSSAs act as a one way filter for external route injection. Where, in the diagram below, the NSSA area injects RIP routes into OSPF and forwards them to the rest of the OSPF domain, however it never learns the external BGP routes.

In the diagram below please assume a virtual link exists between R2 and R4.


The OSPF designers were still looking as ways to reduce the OSPF resource requirements. So they initially came up with stub areas which was ideal. But now they wanted a way to do something like this; where in the diagram above, they could get the external RIP routes into the OSPF domain, however the external BGP routes did not come into the NSSA area. The reason why is because there would be maybe 40,000 external BGP routes, but there might only be 20 RIP routes. The main problem is the high quantity of BGP routes using too much memory on the R4 (cos R4 is an old and has low memory). So they wanted the area to just use a default route to reach any of the external BGP routes instead. So this was the design goal, and how NSSA areas came about. What they ended up doing was stopping type 5 LSAs going into the NSSA. When R5 injects RIP routes into the NSSA, R5 (the ASBR) converts them to a new LSA type, which is LSA type 7, and then forwards these routes to R4 (the ABR). R4 then converts these type 7 LSAs to type 5’s so that the rest of the network can learn about them. However R4 does not covert any type 5 LSAs to Type 7 LSAs. This way, the BGP routes will never be able to get into the NSSA area.

Translation of type 7 LSAs to type 5 LSAs is always done on the router with the highest OSPF router ID, providing there is more than one router connecting to the backbone within the NSSA area.

As a side note, NSSAs are always located at the boundary of an OSPF domain.

I’ve read about 60% of this book now, and the rest of it is talking about MOSPF, SNMP, Multicasting, other IGPs, DVMRP, and how to configure OSPF as well as debug it. 20% of this is legacy (nobody uses DVMRP or MOSPF, nor is it in the CCIE v5 blueprint). And the rest is covered in other study sections, or the INE workbook (for the configuration/debugging section). So I’m done with reading it. I liked learning how the protocol was designed and learned a few things along the way. I also now understand why they created the NSSA area (I was always clueless to the point of such an area).


Practical BGP


The book gives some good troubleshooting approaches and also identifies a couple of really good problems with BGP peering that you can run into (and should be aware of for the CCIE lab exam & real life). Overall I found it a good BGP refresher, but learnt some really useful stuff from it, especially from the troubleshooting section, and in terms of enterprise network design. There is a lot of spelling mistakes & sections that reference routers that aren’t there, making it very hard to understand what the author is talking about. There are also a few mis-configs in various chapters. Overall it’s a decent book that looks at BGP from an enterprise level, but I wouldn’t spend more than 3-4 days reading it otherwise it’s wasting too much study time (in my opinion). For anyone else reading this specifically for the CCIE lab exam, I’d read up to chapter 8 and skip the last 2 chapters.



Hot potato routing =Take the closest point out of the network

Cold potato routing = Take the best path towards the destination (AKA best exit routing)

Link State vs Distance Vector vs Path Vector

Link state routing protocols like OSPF rely on each router to advertise the state of each of their links to every other router within the local domain. Routers then have a complete topology map and store it in a database. They then pass this on to adjacent peers, unmodified or manipulated in any way. The information is flooded throughout the routing domain, unchanged, just as the originating router advertises it. Once the database is populated, the router runs shortest path first to build a tree with itself directly in the middle of the tree. The shortest path to each reachable destination with the network is found by traversing the tree.

Distance vector algorithms advertise the path (vector) and the distance (metric) for each destination reachable with the network to adjacent peers. This info is placed in a local database and an algorithm is used to determine the best path for each reachable destination. Only the best path is then shared with the neighbors. RIP uses bellman-ford, EIGRP uses DUAL.

A Path Vector protocol does not rely on the cost of reaching a destination to determine if the path is loop free. It relies on the analysis of the path instead (i.e. it checks the AS’s advertised in the update to see if its own AS was already in the path).

BGP as  a path vector protocol

BGP treats an entire AS sa a single hop in the AS path to hide topological details of the AS. For this reason BGP can only detect loops between AS’s & cannot guarantee loop-free paths within an AS.

BGP speakers use TCP 179 to create a session to other BGP speakers. If two speakers do this at exactly the same time, the highest RID wins which means the router with the lowest RID drops the TCP session that it initiated.

BGP update messages are used to advertise/withdraw routes. Multiple prefixes with the same attributes can be advertised in a single update, otherwise they go in separate updates.

The BGP update packet is constructed of 1 set of attributes (it’s only possible to actually have one set of attributes per update), and then multiple prefixes that match these attributes. So it’s a one-to-many relationship. I included a packet capture below. There can be lots of prefixes all with the same set of attributes.


Any group of prefixes sharing the same attributes can therefore be packed into a single BGP update, providing it doesn’t go over the maximum update size of 4096 octets.

Interior vs Exterior Peering

Four main differences between iBGP and eBGP:

  1. Routes learned from an iBGP peer are not (normally) advertised to other iBGP peers.
  2. Atrributes of paths learned via iBGP peers are not (normally) changed to impact the path selected to reach the same outside outside network. The best path chosen throughout the AS must be consistent to prevent routing loops
  3. The AS path is not manipulated when advertising a prefix to an iBGP peer
  4. The BGP next hop is not normally changed when advertising prefixes between iBGP peers.

BGP Notifications

It’s possible that between BGP speakers, one of them sends a malformed packet and the other speaker doesn’t understand. Another problem could be that one of the BGP speakers has been configured wrong. So when a peer closes the connection for one of these reasons, it sends a BGP notification stating what the problem is.

BGP Capabilties

A BGP speakers capabilities are sent during the OPEN message exchange, but can also be sent one the session is up.

BGP Attributes

Well-known mandatory attributes. Self explanatory

Well-known discretionary attributes. Attributes must be recognized by all BGP speakers, and may be advertised in updates (i.e. it’s not required)

Optional transitive attributes. Attributes may be recognized by BGP speakers, but not all. The attributes should be preserved regardless or not if the speaker recognizes them.

optional nontransitive attributes. Attributes may be recognized by BGP speakers, but not all. Unrecognized attributes should be removed before advertising them to another peer. So you can set something like MED towards a eBGP peer, but the eBGP peer will strip the MED value off before advertising the prefix to any other eBGP peer (interestingly though, it can advterise the MED value to another iBGP peer because it’s within the same AS domain).


Origin is a Well-known mandatory attribute. An origin code of Incomplete can be caused from aggregation, redistribution, or other indirect ways of install routes into BGP within the originating AS.

AS Path is a well-known mandatory attribute.

Next Hop is a well-known mandatory attribute.

Multiple Exit Discriminator (MED) is an optional non-transitive attribute. MED is only taken into account when the AS paths are also the same.

Local Preference is a well-known mandatory attribute.

Communities are an optional transitive attribute. Options are NO_EXPORT, NO_ADVERTISE, and NO_EXPORT_SUBCONFED. Normal communities are 32bits long, and extended communities are 64 bits long.

Attributes and Aggregation

Aggregation or summarisation hides reachability information, but also hides topology information. In BGP this means hiding the AS Path and other attributes of the prefixes aggregated. When aggregating prefixes, BGP puts those AS’s that the prefixes came from inside of the AS Set into and then injects that into the the AS Path (the AS’s inside squiggly brackets are the AS-Set). The atomic-aggregate is always set by the router who did the aggregation  & will be advertised as an attribute for the prefix, with the IP set to the aggregating router within the BGP update.

Single Homing to a Service Provider

There is no point running BGP

Dual Homing to a Single Service Provider

-Usually this is the case when you have multiple connections to a single provider, but at different locations.

-You don’t need your own AS number when dual homing. You can use a private AS to eBGP peer with your supplier, and then your supplier can strip your private AS off of updates upstream.

-To make the service provider prefer one inbound path to your AS over another, you can use MED. But the MED value is regularly stripped from updates by the service provider, so it’s best to check and agree with them first. Another, more common method of influencing your service providers decision on the path to reach your network is by using communities. Another way would be to do some as-prepends.

-When dual homing to one ISP, you should check your service provider will configure eBGP multipath so that it allows his AS to install two paths to the same destination.

Dual Homing to a Multiple Service Providers

-You can run into a problem where say you are connected to two SP’s, SP_A and SP_B. Now you want to use SP_A’s address block that they have provided you (maybe you were only ever peered with them at the start, and then you wanted a new service providers, SP_B to load share some of the traffic). So now you are wanting to advertise your prefix out via SP_B, and make SP_B advertise it upstream. The problem is, let’s say you were given a /26, SP_A will most likely be aggregating this upstream to a /24 or something. However, SP_B won’t. This means routers on the internet would prefer the longest match via SP_B. So you have to ask SP_A to advertise your specific /26 so that the internet can compare the equal length routes. You can then try and do a bit of AS prepending to load share the traffic.

-A conditional advertisement can be used in this setup. Basically, the idea is that you would only want to use one link at a time i.e. the path via SP_B would only be used in a failure of the link to SP_A. The way the conditional advertisement would work is based on the SP_B’s configuration. It would state that, if the prefix is not being received from SP_A, then SP_B will advertise it. You’d have to do it using some sort of tag value to know it’s from SP_A. In my blog post I used a community to tag the route from SP_A.

Controlling Outbound Traffic

If you are dual homed and you want to load share to one of the ISPs for your OUTBOUND traffic, you could get the ISP to peer with one of your routers via a loopback. You’d then route towards that loopback via two equal cost paths, probably by a static route.

Another way to load balance in a dual homed scenario (when taking full routing tables in on at least one router) is by using a prefix list on the neighbor to only accept half the routing table in. So you’d use something like the below. Where roughly half the routing table goes via R1’s neighbor because the routes are more specific, and the rest go via R2’s default route.

R1#neighbor x prefix-list HALF_ROUTES in 
prefix-list HALF_ROUTES permit ip

R2#neighbor x prefix-list DEFAULT_ONLY in 
prefix-list DEFAULT_ONLY permit ip

A problem you can run into with dual homed AS’s is from firewalls. Since traffic inbound and outbound is often asymmetric, it means that if you have a setup like below, then traffic that leaves FW1 and goes via R1 towards the internet creates an entry in the state table. However if the return traffic comes back via R2 and hits FW2, there would be no state entry. So it wouldn’t work.


The solution is to ensure you split your public IP block out so that traffic goes via one particular router. So say you are using one public /24 from Service Provider A (SP_A) and also getting SP_B to advertise this upstream. Now you want to ensure traffic that came from FW1 goes back to FW1. You would put say the first /25 part of that subnet on the WAN interfaces of FW1, and the other /25 on FW2. Now when the return traffic comes to R2 (that was initiated from FW1), R2 will route it via R1 and R1 towards FW1.

Why Does iBGP have to be full mesh

When routes traverse multiple BGP AS’s, the AS is added onto the AS_PATH attribute. If a router receives a BGP update with its own AS inside of the AS_PATH, then it knows to reject that update since the true path to the destination cannot ever be via  multiple hops & then back to itself. It would indicate a loop. However, within an autonomous system, the AS path does not chance and BGP has no mechanism to ensure loop-free paths. So this is why the iBGP rule, “if you learn a prefix from an iBGP peer, you cannot advertise it out to any other iBGP peer” was created. You cannot therefore cause a BGP loop if you follow this rule.

The problem with iBGP full mesh peering is that it adds a lot of management overhead and resources to add new peers. Imagine if you have a 100 peer iBGP full mesh topology & add another router. You need to add 100 peers on the new router + 1 peer from all existing routers. This wastes memory and also processing power. This is why route reflectors were invented.

Route Reflectors

The rules of route reflection are:

  • If the route was received from a nonclient peer, reflect the route to all client peers.
  • If the route was received from a client peer, reflect the route to all nonclient peers and client peers

Because a route reflector reflects routes learned form a client to a nonclient and client peers, clients within a cluster do not need to be fully meshed anymore.

When you build hierarchies of router reflectors & a mixture of nonclient iBGP peers, you can end up with routing loops because nobody tracks where the route was originally learned from. So the originating router can stuck in a routing loop between an iBGP neighbor it advertised the route to & then receiving that same route back and preferring it because of something like local preference. The Cluster List and Originator ID was therefore created to deal with this. The Originator ID is set by the Route Reflector (RR) & it will set it to the router-id of the RR client or non-client who it learned the route from. The RR also adds its own router-id (which now becomes his cluster-id) to the cluster list so that he knows if he gets a BGP Update with the same prefix but sees himself in the cluster list, then he will discard the route.

The Cluster List provides path information within the autonomous system, much as the AS Path provides path info between autonomous systems. The Cluster List and Originator ID are both optional nontransitive BGP attributes which means that they are removed at the boarder of an autonomous system & will never be advertised to eBGP peers.

When reflecting a route, a RR will never change the BGP next hop, local pref, MED, or AS Path attributes of the route. This is also to prevent routing loops.

An RR client may learn routes from more than one RR.

As a general recommendation it is not a good idea to have mixed RR clusters that incorporate both client-to-client reflection and also direct peering sessions between clients. Either go full mesh or ONLY use RR’s. Don’t mix it cause it’s a cluster fuck to manage.

When you have multiple RR’s serving the same set of clients, it’s best to set the cluster ID to be the same address on both RR’s. This stops the unnecessary duplication of updates towards the clients.

As a rule of thumb, your logical topology should be identical to your physical topology for the RR cluster.

BGP Confederation

Two new attributes are defined fro BGP confederations:

  • AS Confederation Sequence. This is an ordered list of all the AS’s a route has passed through
  • AS Confederation Set. This is an un-ordered list of all the AS’s a route has passed through

These attributes are carried as part of the AS_PATH, and are only used for aggregation purposes (similar to how normal BGP aggregation works with the AS Set and AS Sequence).

The rules for the BGP updates within a confederation is as followes:

  • If advertising the update to an iBGP peer, normal processing applies.
  • If advertising to an eBGP peer within the same BGP confederation (shares the same confederation ID), prepend the sub-AS number to the list of sub-ASes in the AS Confederation Sequence.
  • If advertising to an eBGP peer outside the BGP confederation (does not share the same BGP confederation ID), remove the AS Confederation Sequence and Set attributes, and prepend the confederation ID to the AS Sequence in the AS Path.

BGP Peer Groups
Peer groups can reduce convergence significantly in places like BGP peering exchanges in POPs where you have a large number of prefixes being advertised and a lot of eBGP peers. Let’s say you are a large tier 1 or 2 ISP with many eBGP peers. You will taking a full internet routing table and potentially advertising it to a good chunk of your customers (other smaller ISP’s). If you have say 400 customers each taking a 500,000 route internet routing table, then when the router creates the BGP update, it must check the routing table, generate the new BGP update, then stored in the output queue for each neighbor until it is sent (i.e. it would require that the the 500,000 routes in the routing table are scanned 400 times in order to create the 400 BGP updates). With peer groups, the routing table is checked just once and then a BGP update message is formed and replicated to all peers. Russ White, the author, shows an example of having 100,00 routes being sent to 420 eBGP peers & displays that it takes about 850 seconds to advertise all the routes to all the neighbors. But with those same neighbors configured using a peer-group it takes about 120seconds.

BGP Peer Templates
These are similar to peer groups which grouped neighbors together and then sent them a BGP update. The trick with peer templates though is that they can advertise a neighbor a policy that they can inherit. So if you have a tonne of eBGP routers connecting to multiple AS’s, then you can just make a policy on a couple of the routers, and then get the other routers to inherit it. So then when you make changes in the future, you do it on just one or two boxes that’s it! It’s really a very large enterprise feature designed for tier 1 or 2 ISPs in my opinion.

Keepalive and Hold down Timers

BGP keeps a TCP session open with the peer providing it’s received a keepalive within the hold time. The hold time is the amount of time a BGP speaker will wait to receive a keepalive before resetting the session.

BGP speakers advertise their configured hold time at the beggining of a session with another speaker. The hold time used for the session is determined by the lower of the two hold down times advertised.

Retry Timer

The BGP retry time is the amount of time between retries to establish a connection to configured peers which have gone down for one reason or another. By default it is 120seconds. The idea is to prevent a router from causing the network to fall into a state of continual churn if it keeps taking routes and then withdrawing them for whatever reason (maybe not enough memory).

Open Delay Timer

This is to stop two BGP speakers trying to form a BGP session at the same time. The time when the BGP speakers attempt to peer is basically just jittered. The time is some random small number, plus another random small number.

Minimum Route Advertisement Interval

This is the amount of time a BGP speaker must wait before sending new information to its peers. If a route is flapping like crazy, say like once a second, then it doesn’t matter. This timer will stop the constant BGP updates that would get sent with the “withdraw” flag set, and another message getting sent straight away saying to “update” that the link is restored. This is particularly important to know the values since they are not what I thought there were, especially for regular iBGP:

  • eBGP sessions not in a VRF: 30 seconds
  • eBGP sessions in a VRF: 0 seconds
  • iBGP sessions: 0 seconds

BGP Fast-External-Failover

This is enabled by default on Cisco IOS. Basically, when an eBGP peer is connected over a physical link and peered using that physical link’s IP address (as opposed to loopbacks), then the neighbor goes down straight away. The idea is that if you only have one connection to an eBGP peer and your physical interface goes down towards them, then there is no other path to reach the neighbor. So there is no need to wait for the hold timer to expire (3 missed keepalives) before failing the neighbor. It just fails it straight away.


I learned something with ACLs I didn’t know before. It’s kind of a weird ACL to see and I’ve never used it before. Here is the example

access-list 1 permit

This would match anything with .0 in the last octet (i know it seems obvious but I have never used an all 0’s wildcard except to explicitly match a /32 prefix). So what this mean is that, or, or would be matched by this ACL.

Another interesting thing I found. If you use a distribute list for BGP filtering and reference an ACL that is non-existent, then all routes are permitted. Which is the opposite for 90% of other applications which would block all of the traffic. The same goes for prefix-lists referenced on BGP neighbor statements.

BGP Communities

The recommended encoding for communities is in the form AA:NN (AA = AS number, NN = Network Number). The default display in Cisco IOS software is in the form of NNAA in decimal. In order to display communities in the AA:NN format, #ip bgp-community new-format must be configured. The catch you can run into with communities is this.

R1(config)#ip community-list 1 permit 200:666
R1#sh ip community-list 1
         permit 13107866

So even though you configured it in the new-format, the router matches it based on decimal. I spent a little bit of time working out how to convert the number from the 4 byte number to decimal.

So the numbers in the first left brackets = 200, and the second brackets = 666.


Then I bung this entire 32 character number into the scientific calculator as binary and convert it to dec. = 13107866.

Type of communities

There are 4 types of community as shown below. But only 2 of them are actually used, it’s just that you can either name them or use a number for them, similar to ACLs.

R2(config)#ip community-list ?
  <1-99>     Community list number (standard)
  <100-500>  Community list number (expanded)
  expanded   Add an expanded community-list entry
  standard   Add a standard community-list entry

Standard communities match a given community per line. For example, the line below would ensure that the check matches all 3 of the communities for community list 1.

r1(config)#ip community-list 1 permit 200:1 200:2 200:3


Expanded communities match a list of communities based on a regular expression. So say I said the below. It would match anything ending with the community of 200.

R2(config)#ip community-list 100 permit :200$

By default communities are not sent to peers. So you have to enable it by using

ip bgp neighbor x.x.x.x send-community

One other good feature regarding communities is that you can search the BGP database for a specific community, or via a community list, as shown below.

R2#sh ip bgp ?
  A.B.C.D            Network in the BGP routing table to display
  A.B.C.D/nn         IP prefix /, e.g.,
  all                All address families
  cidr-only          Display only routes with non-natural netmasks
  community          Display routes matching the communities
  community-list     Display routes matching the community-list

You could then configure a community-list 1 and make it match say community 200:200. Then in the sh ip bgp statement, you would reference community-list 1 to find only routes that have a community of 200:200.

AS Path Access Lists

This matches prefixes based on the AS Path via a regular expression. You create them using the syntax below. And then you reference them on a neighbor statement using a “filter-list” keyword.

router(config)#ip as-path access-list ?

    Regular expression access list number

If you reference a as-path filter on say an outbound routing policy for a particular neighbor but don’t define the actual as-path filter itself, then in this case traffic would be blackholed outbound. This different to distribute-lists and ACL’s that get referenced where the default action would be to permit all traffic.

One other thing about regex in general is that you can use it to check the BGP table for routes matching a certain AS path. For example

sh ip bgp re ^1234_200$

This would search for anything that has the exact AS path of 1234 200.

^ means start of string

$ means end of string

_ means match any: space, bracket, or the start or end of a string


I have always found these easy, but there is an AMAZING head scratcher of a route-map on chapter 6 under the local preference section. I really liked it.

A good route-map tip with communities is this. If you are writing a route-map to “set” the community value for a prefix, it will over-write any currently set community value that the prefix already has attached to it. So if you want to just append another community to the prefix just use the key word additive. EG:

R1(config)# route-map test permit 10
set community 200:000 additive

I’ve just read about a cool feature with route-maps that I didn’t know before. See the syntax below

R2(config)#route-map test permit 10
R2(config-route-map)#match ip address 1
R2(config-route-map)#continue ?
    Route-map entry sequence number

R2(config-route-map)#continue 15

So this says once a match is found, don’t exist the route-map just yet. Instead continue onto route-map sequence 15. Apparently you can only have one continue statement within a route-map.


BGP Maximum Prefixes

The maximum-prefixes command used on a neighbor statement can be used to limit the number of prefixes in from a neighbor. One good use case that Russ describes is if you were to take a partial routing table from your ISP. So if you were to take maybe 5-6 routes from the ISP, you could set the maximum prefixes to 10 which would then ensure any cock ups on your ISP’s filtering, or your own filtering, that you don’t wreck your router by taking a full internet table. The syntax for the command is below.

router(config-router)#neighbor maximum-prefix 10 ?

        Threshold value (%) at which to generate a warning msg

 restart       Restart bgp connection after limit is exceeded

 warning-only  Only give warning message when limit is exceeded

Stopping Your AS Becoming a Transit Path

If you are dual homed to two ISPs, there is a chance you could leak a full BGP table, or a partial table to one of the ISPs and become transit to reach a destination. One very simple bit of BGP config can be applied to ensure you don’t ever sit in the transit path, as shown below. Basically it just makes sure that the only prefixes you advertise are ones originated within your own AS.


ip as-path access-list 1 permit ^$
router bgp 65001
 neighbor remote-as 65002
 neighbor filter-list 1 out


MED is nontransitive and is usually used for eBGP peers to influence their decision towards our network. By non-transitive this means that I can send the neighbor a MED value, but the MED value should not be re-advertised to any other eBGP peer that my neighbor has. The neighbor can, however, send the MED value to another iBGP peer since it’s within the same autonomous system.

MED is only compared between prefixes received from the same AS (unless using always-compare-med)

Using a dynamically created value for MED is not a good idea. Since when the IGP metric changes, it can cause route flaps and get your prefix dampened.

BGP Graceful Restart

Basically a router can flap its control plane, but continue sending traffic in the data plane with graceful restart. The trick is to make sure all your BGP peers are compatible with graceful restart, and if you are learning any next-hop value via an IGP, that the IGP’s peers are also graceful restart compatible. Otherwise you can run into routing problem. That’s basically the moral of the story.

BGP Peering Troubleshooting

Aparently it’s possible to configure two BGP speakers to form a multihop session so long as they are peering to the correct IP (for this example assume they are peering using loopbacks), and only one of them is using the correct update-source IP. This can cause forwarding problems in the network, so always double check the update source is right.

BGP Network Statement

If you use network then all prefixes from through are advertised regardless of the prefix length.

aggregate-address summary only will always create a local route to null 0 upon creation. This command requires that a route within the summary range is advertised by a network statement. So if a single route within this range is in the routing table, but not in the BGP table, then the summary is not advertised. you must use a network statement in conjunction of the aggregate address command.

Duplicate BGP Router-ID

BGP will withdraw prefixes that have an originator as itself (i.e. if someone else advertises us a route, but the router who originated the route has the same router-id as us then the prefix would be dropped. One easy way to find and fix this is using this:

access-list 101 permit ip host any
debug ip bgp update 101

In here you will see the message “Denied due to originator-id is us”. You can use this same method for spotting duplicate cluster-id’s.  To fix a cluster-id problem, you should always follow the rule that all route reflector clients must peer with each router reflector in the same route reflector cluster.

BGP Recursion Oscillation

There are loads of ways that you can run into a recursion problem for the next hop with BGP updates, one of which is shown in this BGP Recursive Routing Failure blog post. One way to spot this kind of issues (where the next hop resolves to itself via BGP), is if the routes are in the routing table & then disappear within 60 seconds. To troubleshoot it, you can also check the sh ip bgp [prefix] command and check if the route is alternatively marked as inaccessible. If you find that this is the case, then check what the next hop is & you will find the recursion problem.

BGP Security

Basic MD5 syntax is shown below, but it’s really weak. The password is called a shared secret and doesn’t get changed periodically unless you reference a key-chain. The MD5 authentication is also not negotiated on session startup, so you just configure it and make sure the other BGP speaker has it configured. Also note that the key is exchanged just on session startup, it is not periodically transmitted between neighbors.

r1(config-router)#neighbor x.x.x.x password [password]

Also note that MD5 does not hide the content of the packet, or protect it. It’s just there to ensure both sides agree on a shared secret. To add these functionalities, you would use a 1hop IPSec tunenel. The author shows an example, but it’s long and I think it’s probably out of scope of the CCIE exam. We probably just need to know about MD5, and TTL security.