ShortestPathFirst Network Architecture and Design, and Information Security Best Practices


Carrier Grade NAT and the DoS Consequences

Written by Stefan Fouant

Republished from Corero DDoS Blog:

The Internet has a very long history of utilizing mechanisms that may breathe new life into older technologies, stretching it out so that newer technologies may be delayed or obviated altogether. IPv4 addressing, and the well known depletion associated with it, is one such area that has seen a plethora of mechanisms employed in order to give it more shelf life.

In the early 90s, the IETF gave us Classless Inter-Domain Routing (CIDR), which dramatically slowed the growth of global Internet routing tables and delayed the inevitable IPv4 address depletion. Later came DHCP, another protocol which assisted via the use of short term allocation of addresses which would be given back to the provider's pool after use. In 1996, the IETF was back at it again, creating RFC 1918 private addressing, so that networks could utilize private addresses that didn't come from the global pool. Utilizing private address space gave network operators a much larger pool to use internally than would otherwise have been available if utilizing globally assigned address space -- but if they wanted to connect to the global Internet, they needed something to translate those addresses. This is what necessitated the development of Network Address Translation (NAT).

NAT worked very well for many, many years, and slowed the address depletion a great deal. But in order to perform that translation, you still needed to aquire at least one globally addressable IP. As such, this only served to slow down depletion but not prevent it - carriers were still required to provide that globally addressable IP from their own address space. With the explosive growth of the Internet of Things, carriers likewise began to run out of address space to allocate.

NAT came to the rescue again. Carriers took notice of the success of NAT in enterprise environments and wanted to do this within their own networks, after all, if it worked for customers it should likewise work for the carriers. This prompted the IETF to develop Carrier Grade NAT (CGN), also known as Large Scale NAT (LSN). CGN aims to provide a similar solution for carriers by obviating the need for allocating publicly available address space to their customers. By deploying CGN, carriers could oversubscribe their pool of global IPv4 addresses while still providing for seamless connectivity, i.e. no truck-roll.

So while the world is spared from address depletion yet again, the use of CGN technologies opens a new can of worms for carriers. No longer does one globally routable IP represent a single enterprise or customer - due to the huge oversubscription which is afforded through CGN, an IP can service potentially thousands of customers.

This brings us to the cross-roads of the Denial of Service (DoS) problem. In the past, when a single global IP represented only one customer network, there was typically no collateral damage to other customer networks. If the DoS was large enough to impact the carrier's network or if there was collateral damage, they would simply blackhole that customer IP to prevent it from transiting their network. However, with CGN deployments, and potentially thousands of customers being represented by a single IP, blackhole routing is no longer an option.

CGN deployments are vulnerable to DoS in a few different ways. The main issue with CGN is that it must maintain a stateful record of the translations between external addresses and ports with internal addresses and ports. A device which has to maintain these stateful tables is vulnerable to any type of DoS activity that may exhaust the stateful resources. As such, a CGN device may be impacted in both the inbound and the outbound direction. An outbound attack is usually the result of malware on a customers machine, sending a large amount of traffic towards the Internet and consuming the state tables in the CGN. Inbound attacks usually target a particular customer, and take the form of a DoS attack, or a Distributed Denial of Service (DDoS) attack. Regardless of the direction of the attack, a large amount of resources are consumed in the CGN state table, which reduces overall port availability. Left unregulated, these attacks can easily cause impact not only to the intended victim, but potentially the thousands of other customers being serviced by that CGN.

With the inability to simply blackhole a given IP using edge Access Control Lists (ACLs), carriers must look at other options for protecting their customer base. While some CGN implementations have the ability to limit the amount of ports that are allocated to a single customer, these only work in discrete cases and can be difficult to manage. They also do not protect customers if the CGN device is itself the target of the attack.

The solution to this problem is the use of a purpose-built DDoS mitigation device, or what is more commonly referred to as a "scrubbing" device in IT circles. Dedicated DDoS mitigation devices attempt to enforce that everyone plays nicely, by limiting the maximum number of sessions to or from a given customer. This is done by thorough analysis of the traffic in flight and rate-limiting or filtering traffic through sophisticated mitigation mechanisms to ensure fairness of the public IP and port availability across all customers. Through the use of dedicated DDoS mitigation devices, CGN devices and their associated customers are protected from service disruptions, while still ensuring legitimate traffic is allowed unencumbered. Lastly, another important aspect of DDoS mitigation technology is that they tend to be "bumps in a wire", that is to say, they don't have an IP address assigned to them and as such cannot be the target of an attack.


Is DDoS Mitigation as-a-Service Becoming a Defacto Offering for Providers?

Written by Stefan Fouant

Republished from Corero DDoS Blog:

It’s well known in the industry that DDoS attacks are becoming more frequent and increasingly debilitating, turning DDoS mitigation into a mission critical initiative. From the largest of carriers to small and mid-level enterprises, more and more Internet connected businesses are becoming a target of DDoS attacks. What was once a problem that only a select few dealt with is now becoming a regularly occurring burden faced by network operators.

In my daily engagements with various customers of all shapes and sizes, it’s truly interesting to see how the approach to DDoS mitigation is changing. Much of this is the result of DDoS mitigation services shifting from a “nice to have” technology to a “must-have”, essential in order to maintain business continuity and availability.

When I built DDoS mitigation and detection services for Verizon back in 2004, the intent was to offer value-add revenue producing services to offer subscribers, in an effort to build out our security offerings. For many years, this concept was one that pretty much every provider I worked with was looking into; build a service with the intent of generating new revenue opportunity from customers when traditional avenues such as simple connectivity and bandwidth offerings were contracting.

However, in the past several months, as I interact with large scale carriers to data center hosting providers, I am seeing a common thread starting to emerge - that is, attracting new customers and retaining existing ones is becoming more difficult in the absence of differentiated value. Compounding this issue is that the customers are starting to expect some of these services as part of their connectivity fees.  What I’m seeing is more and more providers investigating the option of offering DDoS mitigation services to their customers as a virtue of being connected to them, in an effort to attract them away from other providers who have limited service offerings and capabilities.

Could it be that DDoS mitigation services become a standard offering on a provider’s network? Is it feasible that at some point in the future DDoS mitigation will become an inherent capability provided by the service providers?

In order for this approach to become a reality, the economics of the game have to change. Inserting DDoS mitigation elements into the network need to be reasonably inexpensive in order for carriers and hosting providers to justify the cost. The technology also needs to be simple and as close to automatic as possible, as an inherent service offering will not justify the huge expense and uplift of having a team of operations personnel managing the service. Attacks need to be mitigated dynamically and quickly and without the need for manual intervention or the requirement to have to pick up a phone to get assistance. And lastly, whatever mechanisms are in place need to ensure a “do no harm” approach and that there is no collateral damage to good traffic.

At Corero, we believe that we are doing just that; changing not only the economics of the game, but also by fundamentally looking at the problem is a different way. Corero enables real-time, algorithmic identification of network anomalies and subsequent mitigation of the attack traffic, eliminating the DDoS challenge before attacks transit the network and ultimately impact downstream customers.

This concept is realized through, dynamic mitigation bandwidth licensing - a new economic model which represents the use of high scalable DDoS mitigation technology. The ability to modernize DDoS protection, specifically taking advantage of always-on DDoS mitigation through emerging and proven deployment models; such as dedicated in-line deployment of appliance based DDoS mitigation at the peering and transit points is becoming a more common practice with the help of Corero Network Security.


Juniper Networks Announces New Network Design Training Curriculum and Certification Program

Written by Stefan Fouant

Juniper took a big step forward in rounding out their certification programs by announcing a new Design Training and Certification curriculum, focusing on best practices and techniques that can be used across the spectrum of network architecture and design. Slated to be included in this program are also technologies around software-defined networking (SDN) and network functions virtualization (NFV).

This is a huge step forward for Juniper's training and certification program and will round out their education portfolio with something similar to Cisco's design certification. Furthermore with the advent of network automation, and SDN and NFV technologies becoming more commonplace, the benefits from such a training and certification curriculum can't be overstated.

The design curriculum will eventually include a portfolio of training offerings, starting with the first course which is available now, the Juniper Networks Design Fundamentals course. These courses and their corresponding design certifications will focus on the latest techniques, resources and various tools that companies can use to fully design, secure, and automate their networks. Trainings will range all the way from design fundamentals on through to more advanced courses covering the design specific requirements of Data Center and WAN networks. The first certification, Juniper Networks Certified Design Associate (JNCDA) is available for registration now, and will eventually be followed by certifications at the Specialist level (JNCDS) and the Professional level (JNCDP).

This looks to be a very exciting offering indeed and should help those interested in Juniper technologies keep pace with the myriad new changes that are taking place in the networking world, and assist them in making proper design choices. I thoroughly look forward to analyzing these materials and providing an update to the community on these materials once I've had an opportunity to take a look at them.


What’s a Steiner Tree?

Written by Stefan Fouant

Any of you who have worked with VPLS or NG-MVPNs are likely already familiar with using Point-to-Multipoint (P2MP) LSPs to get traffic from a single ingress PE to multiple egress PEs.  The reason that P2MP LSPs are desired in these cases is that it can reduce unnecessary replication by doing so only where absolutely required, for example where a given P2MP LSP must diverge in order to reach two different PEs.

However, typically the sub-LSPs which are part of a given P2MP LSP traverse the shortest-path from ingress to egress based on whatever user defined constraints have been configured.  While this is fine for many applications, additional optimizations might be required such that additional bandwidth savings can be realized.

We will take a look at something called a Steiner-Tree which can help the network operator to realize these additional savings, when warranted, reducing the overall bandwidth used in the network and fundamentally changing the way in which paths are computed.

Let's start by taking a look at a simple example in which RSVP is used to signal a particular P2MP LSP, but no constraints are defined.  All the links in this network have a metric of 10.  In this case, the sub-LSPs will simply traverse along the shortest path in the network, as can be seen in the diagram below.

Here we see a P2MP LSP where PE1 is the ingress PE and PE2, PE3, and PE4 are all egress nodes.  Since no constraints have been defined the calculated ERO for each of the sub-LSPs will follow along the shortest path where we can see one sub-LSP taking the PE-P1-P2-PE2 path, another is taking the PE1-P1-P3-PE3 path, and the third is taking the PE1-P1-P4-PE4 path.  In this case, each sub-LSP has a total end-to-end cost of 30.

Shortest Tree

Under many circumstances this type of tree would be perfectly acceptable, especially when the end-goal is the minimize end-to-end latency, however there are other cases where we may want to introduce additional hops in an effort to reduce overall bandwidth utilization.  This is where the concept of a minimum-cost tree, otherwise known as a Steiner Tree, comes into play.

This may seem counter-intuitive at first; after all, doesn't a shortest-path tree attempt to minimize costs?  The answer is yes, but it usually only does so by looking at costs in terms of end-to-end metrics or hops through a network.  Once you understand the mechanics of the Steiner Tree algorithm, and how it attempts to minimize the total number of interconnects, it starts to make more sense.

According to Wikipedia, "the Steiner tree problem, or the minimum Steiner tree problem, named after Jakob Steiner, is a problem in combinatorial optimization, which may be formulated in a number of settings, with the common part being that it is required to find the shortest interconnect for a given set of objects".

That's a pretty fancy way of saying it's attempting to optimize the path to be the shortest path possible while at the same time reducing the total number of interconnects between all devices to only those that are absolutely required.

Steiner Tree optimizations are very useful where an ingress PE must send large amounts of data to multiple PEs and it is preferable to ensure that overall bandwidth utilization is reduced, perhaps because of usage-based billing scenarios which require that overall circuit utilization be reduced as much as possible in order to save money.

Let's take a look at an example, once again using the same network as before, but this time performing a Steiner Tree optimization whereby cost is measured in terms of overall bandwidth utilization.  In this case we still see that we have the requirement to build the P2MP LSP from PE1 to PE2, PE3, and PE4.  However, this time we are going to compute an ERO such that replication will only take place where absolutely necessary in order to reduce the total number of interconnects and hence overall bandwidth utilization.

After performing a Steiner Tree path computation, we determine that PE3 is a more logical choice to perform the replication to PE2 and PE4, even though it increases the overall end-to-end metric cost to 40.  The reason for this is we have now effectively eliminated the bandwidth utilization on the P1-P2, P2-PE2, P1-P4, and P4-PE4 links.  In effect, we've gone from utilizing bandwidth across seven links to only five.  If the P2MP LSP was servicing a 100 Mbps video stream, we have just effectively reduced overall bandwidth utilization on the network as a whole by 200 Mbps.

Steiner Tree

One of the interesting side-effects of this approach is that we now see that PE3 is not only an egress node, but it is now also a transit node as well (for the sub-LSPs terminating at PE2 and PE4).  Due to this, we'll also see that in these types of scenarios the Penultimate Hop Popping (PHP) behavior is different on P3 in that we don't want it popping the outer label before sending frames to PE3 since PE3 may need to accommodate labeled packets heading to PE2 or PE3.  We will cover some of this in a subsequent article on the signaling mechanisms inherent in P2MP LSPs and some of the changes to the behavior in MPLS forwarding state.

Path computation for P2MP LSPs can be complex, especially when the goal is create Steiner Trees.  The reason for this added complexity when computing Steiner Trees is that sub-LSP placement has a direct correlation with other sub-LSPs, which is contrary to what happens when shortest-path trees are calculated where each sub-LSP may be signaled along their own unique path without regard to the placement of other sub-LSPs.

As with traditional LSPs, similar methods of determining the paths through the network and hence the ERO can be used, i.e. manual, offline computation.

The easiest approach would be to use constructs like Link Coloring (Affinity Groups for you Cisco wonks) to influence path selection, for example, by coloring the PE1-P1, P1-P3, P3-PE3, PE3-PE2, and PE3-PE4 links with an included color, or coloring the remaining links with a different color and excluding that color from the LSP configuration.

However, this approach is merely a trick.  We are feeding elements into the CSPF algorithm such that the shortest path which is calculated essentially mimics that of a Steiner Tree.  In other words, it's not a true Steiner Tree calculation because the goal was not to reduce the total number of interconnects, but rather to only utilize links of an included color.

Furthermore, such an approach doesn't easily accommodate failure scenarios in which PE3 may go down, because even though Fast Reroute or Link/Node Protection may be desired, if the remaining links do not have the included colors they may be unable to compute an ERO for signaling.

Workarounds to this approach are to configure your Fast Reroute Detours or your Link/Node Protection Bypass LSPs to have more relaxed constraints, such that any potential path might be used.  However, more commonly what you'll see is that some type of additional computations might be performed using traditional offline approaches (using modeling tools such as those provided by vendors such as WANDL, OPNET, or Cariden) which factors both steady-state as well as failure scenarios to assist the operator in determining optimal placement of all elements.

An interesting side-note is that there are some pretty significant developments underway whereby online computation can be performed in such a way as to optimize all P2MP LSPs network-wide, using something known as Path Computation Elements (PCEs).  These are essentially any entity which is capable of performing path computation for any set of paths throughout a network by applying various constraints.  It is something that looks to be especially useful in large carrier networks consisting of many LSPs, and especially so in the case of Steiner Tree P2MP LSPs where the sub-LSP placement is highly dependent on others.  See the charter of the PCE Working Group in the IETF for more information on this and other related developments.

As a side note, it should be fairly evident that in order to perform path optimizations on anything other than shortest-path trees (i.e. Steiner Trees or any other type of tree based on user-defined constraints), RSVP signaling must be used in order to signal a path along the computed ERO.  LDP certainly can be used to build P2MP LSPs (aka mLDP), however much like traditional LSPs built via LDP, the path follows the traditional IGP path.

Stay tuned as we will cover more exciting articles on P2MP LSPs and some of the other underpinnings behind many of the next generation MPLS services being commonly deployed...


JNCIE Tips from the Field :: Summarization Made Easy

Written by Stefan Fouant

Today we'll start with a series of articles covering tips and techniques that might be utilized by JNCIE candidates, whether pursuing the JNCIE-SP, JNCIE-ENT, or even the JNCIE-SEC.  The tips and techniques I will be covering might prove to be useful during a lab attempt but could also be used in real-world scenarios to save time and minimize configuration burden in addition to eliminating mistakes that might otherwise be made.  I want everyone to understand that what I am about to write is simply a technique.  I am not divulging any materials or topics which are covered under NDA.

NOTE: For full disclosure, I must reveal that I am an employee of Juniper Networks in their Education Services department.  As such, I take the responsibility of protecting the content and integrity of the exam as well as the certification credentials very seriously.  I would never reveal anything which would allow a candidate to have in-depth knowledge of any specific topics or questions that may appear on the exam.  Not only that, I worked REALLY, REALLY hard to achieve my JNCIE certifications, and I believe everyone else should too! It's certainly more rewarding that way too don't you think?!

So without further delay, let's take a look at today's technique.

It is well known that sumarization is a key aspect of any type of practical exam involving routing of some sort.  Those who have ever taken a CCIE Routing & Switching or CCIE Service Provider exam can attest, summarization is one thing every expert level candidate needs to master.  It is no different with Juniper.  In fact, Juniper's certification web page clearly lists as one of the JNCIE-ENT exam objectives the requrement to "Filter/summarize specific routes". 

What I will show you next is a technique which I find quite handy when attempting to determine the best summary for a given route, and you can do so without having to resort to pen and paper and figuring it out the old fashioned way, i.e. looking at prefixes in binary. This technique, rather, allows you to use the power of Junos to your advantage to perform these tasks.  What I will reveal will also show you a fundamental difference between IOS and Junos and highlights why I believe Junos to be a more flexible, powerful, and superior network operating system.  You simply can't do what I am about to do on a Cisco platform running IOS.

So let's start off by looking at a diagram.  Let's say we have a network that has several OSPF areas, and we must summarize some information for each respective area towards the backbone without encompassing routing information that might exist outside of that area.


Here we can see we have a backbone area, consisting of two routers, P1 and P2.  P1 is acting as an ABR for Area 1 and is connected to both R1 and R2. P2 is acting as an ABR for Area 2 and is connected to R3.  As you can see from the diagram, I have configured more than a single set of IP addresses on many of the physical interfaces as well as the loopbacks.  This way I can represent many more networks and therefore create multiple Network LSAs for purposes of summarization.

So let's assume that we need to create the largest aggregate possible for a particular area and advertise only that aggregate towards the core without encompassing any routes which might be outside the area from which the summary describes.  Now normally, one would take a look at the diagram, get out a pen and paper, and start a lengthy exercise of supernetting based on binary addresses.  This can take several minutes or more and is valuable time that could certainly be used on wide variety of other more important tasks like setting up MPLS LSPs or troubleshooting that Layer 2 VPN connectivity.  So let's take a look at a simple trick that actually takes advantage of Junos to determine what the summary should be.

What we are going to do is take advantage of a feaure inside Junos which automatically shows us a range of prefixes which match a given CIDR block.  The Junos operating system has built-in route matching functionality which allows us to specify a given CIDR block and returns all routes with a mask length equal to or greater than that which is specified.  So by applying this principle, what we want to do is look at the diagram for a particular area, choose the lowest IP within that area as our base, and then apply a subnet mask to it which attempts to encompass that route as well as others. 

For example, looking at this diagram, we see that the lowest IP address being used in Area 1 is the address assigned to R1's loopback.  So let's start by using this as our base for our summary, and then simply apply a subnet mask to it which we think might encompass additional routes:

sfouant@p1# run show route   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both     *[OSPF/10] 8w4d 19:04:48, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:48, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:43, metric 1
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0     *[OSPF/10] 8w4d 19:04:43, metric 1
                    > to via ge-0/0/1.0
                      to via ge-0/0/1.0

Note: We can do this on any router within Area 1 since the Link-State Database is the same on all devices, but I prefer to perform the work on the ABR since this is where I will be performing the aggregation.  Also, the ABR may have other local and/or direct routes (or perhaps routes from other protocol sources) so we want to see things from the perspective of the ABR.

What we see here is that we have just now determined the summary route which in fact encompasses all the loopback addresses on both R1 as well as R2, but we need to keep going because this doesn't incorporate the Gigabit Ethernet links between all the devices:

sfouant@p1# run show route   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:45, metric 1
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0     *[OSPF/10] 8w4d 19:04:45, metric 1
                    > to via ge-0/0/1.0
                      to via ge-0/0/1.0

Not quite. Let's keep trying:

sfouant@p1# run show route   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both     *[OSPF/10] 8w4d 19:04:55, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:55, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:50, metric 1
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to via ge-0/0/1.0
                      to via ge-0/0/1.0

Nope, still not there yet. Let's try again:

sfouant@p1# run show route   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both     *[OSPF/10] 8w4d 19:04:58, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:58, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:53, metric 1
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0     *[OSPF/10] 8w4d 19:04:53, metric 1
                    > to via ge-0/0/1.0
                      to via ge-0/0/1.0     *[OSPF/10] 00:36:26, metric 2
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0
                      to via ge-0/0/0.0     *[Direct/0] 8w4d 19:36:13
                    > via ge-0/0/1.0     *[Local/0] 8w4d 19:36:13
                      Local via ge-0/0/1.0     *[Direct/0] 8w4d 19:51:31
                    > via ge-0/0/0.0     *[Local/0] 8w4d 19:51:31
                      Local via ge-0/0/0.0     *[OSPF/10] 00:36:26, metric 2
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0
                      to via ge-0/0/0.0     *[Direct/0] 8w4d 19:36:13
                    > via ge-0/0/1.0     *[Local/0] 8w4d 19:36:13
                      Local via ge-0/0/1.0

Ok, this looks more like it.  Here we can see we have all the Gigabit Ethernet links connecting all devices, as well as the loopback addresses.  This might be a suitable summary.  Let's keep going to see what happens:

sfouant@p1# run show route   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both      *[Direct/0] 8w4d 19:51:37
                    > via lo0.0      *[Direct/0] 8w4d 19:36:19
                    > via lo0.0      *[OSPF/10] 00:28:41, metric 1
                    > to via fe-0/0/2.0
                      to via fe-0/0/2.0      *[OSPF/10] 00:28:41, metric 1
                    > to via fe-0/0/2.0
                      to via fe-0/0/2.0     *[Direct/0] 8w4d 19:05:28
                    > via fe-0/0/2.0     *[Local/0] 8w4d 19:05:28
                      Local via fe-0/0/2.0     *[Direct/0] 8w4d 19:05:28
                    > via fe-0/0/2.0     *[Local/0] 8w4d 19:05:28
                      Local via fe-0/0/2.0     *[OSPF/10] 8w4d 19:05:04, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:05:04, metric 1
                    > to via ge-0/0/0.0     *[OSPF/10] 8w4d 19:04:59, metric 1
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0     *[OSPF/10] 8w4d 19:04:59, metric 1
                    > to via ge-0/0/1.0
                      to via ge-0/0/1.0     *[OSPF/10] 00:36:32, metric 2
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0
                      to via ge-0/0/0.0     *[Direct/0] 8w4d 19:36:19
                    > via ge-0/0/1.0     *[Local/0] 8w4d 19:36:19
                      Local via ge-0/0/1.0     *[Direct/0] 8w4d 19:51:37
                    > via ge-0/0/0.0     *[Local/0] 8w4d 19:51:37
                      Local via ge-0/0/0.0     *[OSPF/10] 00:36:32, metric 2
                      to via ge-0/0/1.0
                    > to via ge-0/0/1.0
                      to via ge-0/0/0.0     *[Direct/0] 8w4d 19:36:19
                    > via ge-0/0/1.0     *[Local/0] 8w4d 19:36:19
                      Local via ge-0/0/1.0

Clearly from this command, we can see we have now gone beyond what might be considered a suitable summary because we are now encompassing routes that exist within the backbone Area 0.  So it should be clear from this simple set of commands that the would be a suitable address to use for our summary.

We could easily apply a similar example to Area 2 to quickly determine what the best summary would be.  We see from looking at the diagram the lowest IP within Area 2 is the loopback address applied to R3.  When we use that as our base and go through the steps above, we can find our summary:

sfouant@p2# run show route   
inet.0: 27 destinations, 27 routes (27 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both     *[OSPF/10] 01:00:57, metric 1
                      to via fe-0/0/3.0
                    > to via fe-0/0/3.0     *[OSPF/10] 01:00:57, metric 1
                    > to via fe-0/0/3.0
                      to via fe-0/0/3.0    *[Direct/0] 01:13:48
                    > via fe-0/0/3.0    *[Local/0] 01:13:48
                      Local via fe-0/0/3.0    *[Direct/0] 01:13:48
                    > via fe-0/0/3.0    *[Local/0] 01:13:48
                      Local via fe-0/0/3.0

And there you have it! As you can see it's really quite simple and if you haven't stumbled upon this already you may be saying to yourself, "Why didn't I think of that before?".  I hear from many candidates that they spend considerable time the old fashioned way to determine summaries and I always ask myself why.  As you can see, there is an easier way!

Clearly the benefit to using a technique such as the above is to easily find the routes that best summarize a bunch of more specific routes.  The utility of such an approach, while very useful during a practical exam, might be considerably lessened in the real-world where it is likely that hierarchy has already been built into the network and you have network diagrams at your disposal.  On the other hand, there may be situations where you inherit a network that was developed with hierarchy in mind, however summarization was never employed, or it was employed improperly.  In such cases, the above technique can be a real time saver, allowing you to spend less time doing binary math and more time doing the fun stuff - like troubleshooting why that MPLS LSP isn't getting established!

Stay tuned for additional articles covering time saving tips and techniques which can be used during your next lab attempt!  Good luck, and may the force be with you!


IETF Provides New Guidance on IPv6 End-Site Addressing

Written by Stefan Fouant

I've always been at odds with the recommendation in RFC 3177 towards allocating /48 IPv6 prefixes to end-sites.  To me this seemed rather short-sighted, akin to saying that 640K of memory should be enough for anybody.  It's essentially equivalent to giving out /12s in the IPv4 world which in this day and age might seem completely ridiculous, but let us not forget that in the early days of IPv4 it wasn't uncommon to get a /16 or even a /8 in some cases.

Granted, I know there are quite a few more usable bits in IPv6 than there are in IPv4, but allocating huge swaths of address space simply because it's there and we haven't thought of all the myriad ways it could be used in the future just seems outright wasteful.

So you can imagine my surprise and also my elation last week when the IETF published RFC 6177 entitled 'IPv6 Address Assignment to End Sites'.  In it, the general recommendation of allocating /48s to end-sites that has long been the defacto standard since the original publication of RFC 3177 in 2001 has finally been reversed.

It seems that sanity has finally prevailed and the IAB/IESG have decided to take a more pragmatic approach towards address allocation in IPv6.  The recommendations in RFC 6177 attempt to balance the conservation of IPv6 addresses while at the same time continuing to make it easy for IPv6 adopters to get the address space that they require without requiring complex renumbering and dealing with other scaling inefficiencies in the long term.  It is clear that acting too conservatively and allocating very small address spaces could act as a disincentive and possibly stifle widespread adoption of IPv6.

The new current recommendations for address allocations are as follows:

  • /48 in the general case, except for very large subscribers
  • /64 when it is known that one and only one subnet is needed by design
  • /128 when it is absolutely known that one and only one device is connecting

It goes on to state other recommendations and offers guidance to operators with regards to when to allocate certain prefix lengths.  But essentially, what this means is that now individual network operators have more options regarding which prefix size to allocate, and allows them to move away from strict general guidelines.  In essence, operators make the decision as to what prefix size to allocate based on an analysis of the needs of particular customers.

Perhaps this practical conservation may never be needed given the trillions of address space available in IPv6, but maybe, just maybe, in the very distant future if IPv6 is still in widespread use, it could very well be due to some of these recommendations being put in place today.  After all, 640K did turn out to be a rather small number didn't it?


Book Review :: JUNOS High Availability: Best Practices for High Network Uptime

Written by Stefan Fouant

JUNOS_High_AvailabilityJUNOS High Availability: Best Practices for High Network Uptime
by James Sonderegger, Orin Blomberg, Kieran Milne, Senad Palislamovic
Paperback: 688 pages
Publisher: O'Reilly Media
ISBN-13: 978-0596523046

5starsHigh Praises for JUNOS High Availability

Building a network capable of providing connectivity for simple business applications is a fairly straightforward and well-understood process. However, building networks capable of surviving varying degrees of failure and providing connectivity for mission-critical applications is a completely different story. After all, what separates a good network from a great network is how well it can withstand failures and how rapidly it can respond to them.

While there are a great deal of books and resources available to assist the network designer in establishing simple network connectivity, there aren't many books which discuss the protocols, technologies, and the myriad ways in which high availability can be achieved, much less tie it all together into one consistent thread. "JUNOS High Availability" does just that, in essence providing a single, concise resource covering all of the bits and pieces which are required in highly available networks, allowing the network designer to build networks capable of sustaining five, six, or even seven nines of uptime.

In general, there are a lot of misconceptions and misunderstandings amongst Network Engineers with regards to implementing high availability in Junos. One only needs to look at the fact that Graceful Restart (GR) protocol extensions and Graceful Routing Engine Switchover (GRES) are often mistaken for the same thing, thanks in no small part to the fact that these two technologies share similar letters in their acronyms. This book does a good job of clarifying the difference between the two and steers clear of the pitfalls typically prevalent in coverage of the subject matter. The chapter on 'Control Plane High Availability' covers the technical underpinnings of the underlying architecture on most Juniper platforms; coverage of topics like the separation between the control and forwarding planes, and kernel replication between the Master and Backup Routing Engine give the reader a solid foundation to understand concepts like Non-Stop Routing, Non-Stop Bridging, and In-Service Software Upgrades (ISSU). In particular I found this book to be very useful on several consulting engagements in which seamless high availability was required during software upgrades as the chapter on 'Painless Software Upgrades' discusses the methodology for achieving ISSU and provides a checklist of things to be performed before, during, and after the upgrade process. Similarly, I found the chapter on 'Fast High Availability Protocols' to be very informative as well, providing excellent coverage of BFD, as well as the differences between Fast Reroute vs. Link and Node Protection.

Overall I feel this book is a valuable addition to any networking library and I reference it often when I need to implement certain high availability mechanisms, or simply to evaluate the applicability of a given mechanism versus another for a certain deployment. The inclusion of factoring costs into a high availability design is a welcome addition and one that all too many authors fail to cover. Naturally, it only makes sense that costs should be factored into the equation, even when high availability is the desired end-state, in order to ensure that ultimately the business is profitable. If I had to make one suggestion for this book it is that there should be additional coverage of implementing High Availability on the SRX Series Services Gateways using JSRP, as this is a fundamental high availability component within Juniper's line of security products. To the authors credit however, this book was written just as the SRX line was being released, so I don't fault the authors for providing limited coverage. Perhaps more substantial coverage could be provided in the future if a Second Edition is published.

The bottom line is this - if you are a Network Engineer or Architect responsible for the continuous operation or design of mission-critical networks, "JUNOS High Availability" will undoubtedly serve as an invaluable resource. In my opinion, the chapters on 'Control Plane High Availability', 'Painless Software Upgrades', and 'Fast High Availability Protocols' are alone worth the entire purchase price of the book. The fact that you get a wealth of information beyond that in addition to the configuration examples provided makes this book a compelling addition to any networking library.


Reality Check: Traditional Perimeter Security is Dead!

Written by Stefan Fouant

Recently I came across a marketing event promoted by a network integrator which touted industry leading solutions to assist customers in determining "what was lurking outside their network".

In this day and age, it still surprises me when supposedly network savvy folks are still thinking of network security in terms of a traditional perimeter made up of firewalls or IPS devices. The truth of the matter is that the traditional perimeter vanished quite a few years ago.

Only looking at the perimeter gives the end-user a a false sense of protection. It completely fails to recognize the dangers of mobility in today's traditional workplace environment. Users roam. They might bring in viruses or other Trojans INSIDE your network where they are free to roam unencumbered. In the worst of these cases, the perimeter is only secured in one direction, giving outbound traffic unfettered access and completely ignoring that data which might be leaked from hosts inside your network destined to hosts outside your network, as might be the case with Keyloggers or other similar types of rogue programs.

Furthermore, in today's environment composed of virtualized machines, the line gets even blurrier which is why we are starting to see solutions from startup vendors such as Altor Networks. It’s one thing when we are dealing with physical hosts in the traditional sense, but what about the situation when you are dealing with a multitude of virtual machines on the same physical hosts which must talk to each other?

When you take a data-focused approach instead of a technology-focused approach, the problem and its solutions start to make more sense.   The perimeter should be viewed as the demarcation between the data and any I/O fabric providing connectivity between that data and some external entity. This is the domain of things like Data Loss Prevention (DLP), Network Access Control (NAC), and Virtual Hypervisor Firewalls in addition to that of traditional security devices.


To deal with the realities of today, we must start to think of network security in terms of Hotels vs. Castles. In the Castle model, we have a big wall around our infrastructure. We might have a moat and some alligators, and perhaps we only lower our drawbridge for very special visitors. This model tends to keep a good majority of the enemies at bay, but it completely ignores that which might already be inside your network (think in terms of the Trojan horse as told in Virgil's epic poem 'The Aeneid').

What is more commonly being employed is that of the Hotel Model.  Initially, to gain entrance into the hotel itself, we must check in with the Concierge and get our room key.  Once we have our room key, we have limited access to our own room, and perhaps some shared facilities like the pool or the gym.  In this model, we are unable to enter into a room in which we do not have access.  The key word here is LIMITED access.

An all-inclusive security posture looks at the network from a holistic point of view.  The principles of Defense-in-Depth will make evident the failings of the traditional perimeter model.  The traditional perimeter is dead.  The perimeter is wherever the data is.


What’s the BFD with BFD?

Written by Stefan Fouant

Many networks today are striving for "five nines" high availability and beyond. What this means is that network operators must configure the network to detect and respond to network failures as quickly as possible, preferably on the order of milliseconds. This is in contrast to the failure detection inherent in most routing protocols, which is typically on the order of several seconds or more. For example, the default hold-time for BGP in JUNOS is 90 seconds, which means that in certain scenarios BGP will have to wait for upwards of 90 seconds before a failure is detected, during which time a large percentage of traffic may be blackholed. It is only after the failure is detected that BGP can reconverge on a new best path.

Another example is OSPF which has a default dead interval of 40 seconds, or IS-IS which has a default hold-time of 9 seconds (for DIS routers), and 27 seconds (for non-DIS routers). For many environments which support mission-critical data, or those supporting Voice/Video or any real-time applications, any type of failure which isn't detected in the sub-millisecond range is too long.

While it is possible to lower timers in OSPF or IS-IS to such an extent that a failure between two neighbors can be detected rather quickly (~1 second), it comes at a cost of increased protocol state and considerable burden on the Routing Engine's CPU.  As an example, let us consider the situation in which a router has several hundred neighbors. Maintaining subsecond Hello messages for all of these neighbors will dramatically increase the amount of work that the Routing Engine must perform. Therefore, it is a widely accepted view that a reduction in IGP timers is not the overall best solution to solve the problem of fast failure detection.

Another reason that adjusting protocol timers is not the best solution is that there are many protocols which don't support a reduction of timers to such an extent that fast failure detection can be realized. For example, the minimum BGP holdtime is 20 seconds, which means that the best an operator can hope for is a bare minimum of 20 seconds for failure detection.

Notwithstanding, this does nothing about situations in which there is no protocol at all, for example, Ethernet environments in which two nodes are connected via a switch as can be seen in the figure below.  In this type of environment, R1 has no idea that R2 is not reachable, since R1's local Ethernet segment connected to the switch remains up.  Therefore, R1 can't rely on an 'Interface Down' event to trigger reconvergence on a new path and instead must wait for higher layer protocol timers to age out before determining that the neighbor is not reachable.  (Note to the astute reader: Yes, Ethernet OAM is certainly one way to deal with this situation, but that is a discussion which is beyond the scope of this article).


Essentially, at the root of the problem is either a lack of suitable protocols for fast failure detection of lower layers, or worse, no protocol at all.  The solution to this was the development of Bidirectional Forwarding Detection, or BFD, developed jointly by Cisco and Juniper.  It has been widely deployed and is continuing to gain widespread acceptance, with more and more protocols being adapted to use BFD for fast failure detection. 

So what is the Big Freaking Deal with Bidirectional Forwarding Detection anyway and why are so many operators implementing it in their networks?  BFD is a simplistic hello protocol with the express purpose of rapidly detecting failures at lower layers.  The developers wanted to create a low overhead mechanism for exchanging hellos between two neighbors without all the nonessential bits which are typical in an IGP hello or BGP Keepalives.  Furthermore, the method developed had to be able to quickly detect faults in the Bidirectional path between two neighbors in the forwarding plane.  Originally, BFD was developed to provide a simple mechanism to be used on Ethernet links, as in the example above, prior to the development of Ethernet OAM capabilities.  Hence, BFD was developed with this express purpose in mind with the intent of providing fault identification in an end-to-end path between two neighbors.

Once BFD was developed, the protocol designers quickly found that it could be used for numerous applications beyond simply Ethernet.  In fact, one of the main benefits of BFD is that it provides a common method to provide for failure detection for a large number of protocols, allowing a singular, centralized method which can be reused.  In other words, let routing protocols do what they do best - exchange routing information and recalculate routing tables as necessary, but not perform identification of faults at lower layers.  An offshoot of this is that it allows network operators to actually configure higher protocol timer values for their IGPs, further reducing the burden placed on the Routing Engine.

BFD timers can be tuned such that failure detection can be realized in just a few milliseconds, allowing for failure and reconvergence to take place in similar timeframes to that of SONET Automatic Protection Switching.  A word of caution - while BFD can dramatically decrease the time it takes to detect a failure, operators should be careful when setting the intervals too low.  Very aggressive BFD timers could cause a link to be declared down even when there is only a slight variance in the link quality, which could cause flapping and other disastrous behavior to ensue.  The best current practice with regards to BFD timers is to set a transmit and receive interval of 300ms and a multiplier of 3, which equates to 900ms for failure detection.  This is generally considered fine for most environments, and only the most stringent of environments should need to set their timers more aggressive than this.

One question that is commonly asked is how is it that BFD can send hello packets in the millisecond range without becoming a burden on the router.  The answer to this question lies in the fact that BFD was intended to be lightweight and run in the forwarding plane, as opposed to the control plane (as is the case with routing protocols).  It is true that while early implementations of BFD ran on the control plane, most of the newer implementations run in the forwarding plane, taking advantage of the dedicated processors built into the forwarding plane and alleviating the burden which would otherwise be place on the RE.  In JUNOS releases prior to JUNOS 9.4, BFD Hello packets were generated via RPD running on the RE.  In order to enable BFD to operate in the PFE in JUNOS versions prior to JUNOS 9.4, the Periodic Packet Management Daemon (PPMD) had to be enabled, using the command 'set routing-options ppm delegate processing'.  In JUNOS 9.4 and higher this is the default behavior and BFD Hello packets are automatically handled by PPMD operating within the PFE.


Implementing Provider-Provisioned VPNs using Route Reflectors

Written by Stefan Fouant

MPLS/BGP Provider-Provisioned VPNs, such as those proposed in RFC 4364 (formerly RFC 2547) or draft-kompella variants, suffer from some scalability issues due to the fact that all PE routers are required to have a full iBGP mesh in order to exchange VPN-IPv4 NLRI and associated VPN label information.  In a modern network consisting of a large number of PE devices, it becomes readily apparent that this requirement can quickly become unmanageable.

The formula to compute the number of sessions for an iBGP full mesh is n * (n-1)/2.  10 PE devices would only require a total of 45 iBGP sessions (10 * (9)/2 = 45).  However, by simply adding 5 additional PEs into this environment your total number of sessions increases exponentially to 105.  Scalability issues arise because maintaining this number of iBGP sessions on each PE is an operational nightmare; similarly control plane resources are quickly exhausted.

An alternative to this that has gained widespread adoption is to utilize Route Reflectors to reflect the VPN-IPv4 NLRI and associated VPN label between PE devices.  However, several issues arise when using Route Reflectors in such an environment.  In a normal environment without the use of Route Reflectors, MPLS tunnels exist between each PE router such that when the VPN-IPv4 NLRI and associated VPN label are received, a PE router can recurse through its routing table to find the underlying MPLS tunnel used to reach the remote BGP next-hop within the VPN-IPv4 NLRI.  In the Route Reflection model, the Route Reflector typically doesn't have an MPLS tunnel to each PE for which it is receiving VPN-IPv4 NLRI.  Therefore, these routes never become active and are therefore not candidates for reflection back to other client and non-client peers.

A few methods have been developed which circumvent this issue.  One method is to simply define MPLS tunnels from the Route Reflector to each PE.  This solves the problem by allowing the Route Reflector to find a recursive match (i.e. MPLS tunnel) in order to reach the remote PE.  However, this approach suffers from the drawback in that it requires a whole bunch of MPLS tunnels to be configured which only serve to allow the received VPN-IPv4 NLRI to be considered active.  Remember, these tunnels are completely useless in that they will never be used for the actual forwarding of data, they are only used within the control plane to instantiate routes.

An alternative and much more graceful solution to this problem is to configure the Route Reflector with a static discard route within the routing table which is used to reference BGP next-hops in MPLS environments (inet.3 for example in JUNOS).  This static discard route only serves to function as a recursive match when incoming VPN-IPv4 NLRI are received for the express purpose of making these routes active and therefore candidates for redistribution.  In JUNOS, one can accomplish this using the following configuration:

routing-options {
    rib inet.3 {
        static {
            route discard;

With the above, any VPN-IPv4 NLRI received from a PE router is immediately made active due to the fact that a static route has been created in inet.3 which is the routing table used in JUNOS to recurse for BGP next-hops in MPLS environments.

An excellent whitepaper entitled "BGP Route Reflection in Layer 3 VPN Networks" expands upon this and describes the benefits of using Route Reflection in such environments. It also builds the case for using a distributed Route Reflection design to further enhance scalability and redundancy.

One thing to keep in mind is that with the Route Reflector approach, we have merely moved the problem set from that of the PE device to that of the Route Reflector.  Although it minimizes the number of iBGP sessions required on PE devices, the Route Reflector must be capable of supporting a large number of iBGP sessions and in addition, must be able to store all of the VPN-IPv4 NLRI for all of the VPNs for which it is servicing.  It is highly recommended that adequate amounts of memory are in place on the Route Reflector in order to store this large amount of routing information.

Finally, while using Route Reflectors is an acceptable solution in the interim to addressing scaling concerns with Provider-Provisioned VPNs, it is not clear if this approach is sufficient for the long term.  There are several other options being examined, with some of them outlined in a presentation entitled "2547 L3VPN Control Plane Scaling" given at APRICOT in Kyoto, Japan in 2005.