ShortestPathFirst Network Architecture and Design, and Information Security Best Practices

21Sep/114

What’s a Steiner Tree?

Any of you who have worked with VPLS or NG-MVPNs are likely already familiar with using Point-to-Multipoint (P2MP) LSPs to get traffic from a single ingress PE to multiple egress PEs.  The reason that P2MP LSPs are desired in these cases is that it can reduce unnecessary replication by doing so only where absolutely required, for example where a given P2MP LSP must diverge in order to reach two different PEs.

However, typically the sub-LSPs which are part of a given P2MP LSP traverse the shortest-path from ingress to egress based on whatever user defined constraints have been configured.  While this is fine for many applications, additional optimizations might be required such that additional bandwidth savings can be realized.

We will take a look at something called a Steiner-Tree which can help the network operator to realize these additional savings, when warranted, reducing the overall bandwidth used in the network and fundamentally changing the way in which paths are computed.

Let's start by taking a look at a simple example in which RSVP is used to signal a particular P2MP LSP, but no constraints are defined.  All the links in this network have a metric of 10.  In this case, the sub-LSPs will simply traverse along the shortest path in the network, as can be seen in the diagram below.

Here we see a P2MP LSP where PE1 is the ingress PE and PE2, PE3, and PE4 are all egress nodes.  Since no constraints have been defined the calculated ERO for each of the sub-LSPs will follow along the shortest path where we can see one sub-LSP taking the PE-P1-P2-PE2 path, another is taking the PE1-P1-P3-PE3 path, and the third is taking the PE1-P1-P4-PE4 path.  In this case, each sub-LSP has a total end-to-end cost of 30.

Shortest Tree

Under many circumstances this type of tree would be perfectly acceptable, especially when the end-goal is the minimize end-to-end latency, however there are other cases where we may want to introduce additional hops in an effort to reduce overall bandwidth utilization.  This is where the concept of a minimum-cost tree, otherwise known as a Steiner Tree, comes into play.

This may seem counter-intuitive at first; after all, doesn't a shortest-path tree attempt to minimize costs?  The answer is yes, but it usually only does so by looking at costs in terms of end-to-end metrics or hops through a network.  Once you understand the mechanics of the Steiner Tree algorithm, and how it attempts to minimize the total number of interconnects, it starts to make more sense.

According to Wikipedia, "the Steiner tree problem, or the minimum Steiner tree problem, named after Jakob Steiner, is a problem in combinatorial optimization, which may be formulated in a number of settings, with the common part being that it is required to find the shortest interconnect for a given set of objects".

That's a pretty fancy way of saying it's attempting to optimize the path to be the shortest path possible while at the same time reducing the total number of interconnects between all devices to only those that are absolutely required.

Steiner Tree optimizations are very useful where an ingress PE must send large amounts of data to multiple PEs and it is preferable to ensure that overall bandwidth utilization is reduced, perhaps because of usage-based billing scenarios which require that overall circuit utilization be reduced as much as possible in order to save money.

Let's take a look at an example, once again using the same network as before, but this time performing a Steiner Tree optimization whereby cost is measured in terms of overall bandwidth utilization.  In this case we still see that we have the requirement to build the P2MP LSP from PE1 to PE2, PE3, and PE4.  However, this time we are going to compute an ERO such that replication will only take place where absolutely necessary in order to reduce the total number of interconnects and hence overall bandwidth utilization.

After performing a Steiner Tree path computation, we determine that PE3 is a more logical choice to perform the replication to PE2 and PE4, even though it increases the overall end-to-end metric cost to 40.  The reason for this is we have now effectively eliminated the bandwidth utilization on the P1-P2, P2-PE2, P1-P4, and P4-PE4 links.  In effect, we've gone from utilizing bandwidth across seven links to only five.  If the P2MP LSP was servicing a 100 Mbps video stream, we have just effectively reduced overall bandwidth utilization on the network as a whole by 200 Mbps.

Steiner Tree

One of the interesting side-effects of this approach is that we now see that PE3 is not only an egress node, but it is now also a transit node as well (for the sub-LSPs terminating at PE2 and PE4).  Due to this, we'll also see that in these types of scenarios the Penultimate Hop Popping (PHP) behavior is different on P3 in that we don't want it popping the outer label before sending frames to PE3 since PE3 may need to accommodate labeled packets heading to PE2 or PE3.  We will cover some of this in a subsequent article on the signaling mechanisms inherent in P2MP LSPs and some of the changes to the behavior in MPLS forwarding state.

Path computation for P2MP LSPs can be complex, especially when the goal is create Steiner Trees.  The reason for this added complexity when computing Steiner Trees is that sub-LSP placement has a direct correlation with other sub-LSPs, which is contrary to what happens when shortest-path trees are calculated where each sub-LSP may be signaled along their own unique path without regard to the placement of other sub-LSPs.

As with traditional LSPs, similar methods of determining the paths through the network and hence the ERO can be used, i.e. manual, offline computation.

The easiest approach would be to use constructs like Link Coloring (Affinity Groups for you Cisco wonks) to influence path selection, for example, by coloring the PE1-P1, P1-P3, P3-PE3, PE3-PE2, and PE3-PE4 links with an included color, or coloring the remaining links with a different color and excluding that color from the LSP configuration.

However, this approach is merely a trick.  We are feeding elements into the CSPF algorithm such that the shortest path which is calculated essentially mimics that of a Steiner Tree.  In other words, it's not a true Steiner Tree calculation because the goal was not to reduce the total number of interconnects, but rather to only utilize links of an included color.

Furthermore, such an approach doesn't easily accommodate failure scenarios in which PE3 may go down, because even though Fast Reroute or Link/Node Protection may be desired, if the remaining links do not have the included colors they may be unable to compute an ERO for signaling.

Workarounds to this approach are to configure your Fast Reroute Detours or your Link/Node Protection Bypass LSPs to have more relaxed constraints, such that any potential path might be used.  However, more commonly what you'll see is that some type of additional computations might be performed using traditional offline approaches (using modeling tools such as those provided by vendors such as WANDL, OPNET, or Cariden) which factors both steady-state as well as failure scenarios to assist the operator in determining optimal placement of all elements.

An interesting side-note is that there are some pretty significant developments underway whereby online computation can be performed in such a way as to optimize all P2MP LSPs network-wide, using something known as Path Computation Elements (PCEs).  These are essentially any entity which is capable of performing path computation for any set of paths throughout a network by applying various constraints.  It is something that looks to be especially useful in large carrier networks consisting of many LSPs, and especially so in the case of Steiner Tree P2MP LSPs where the sub-LSP placement is highly dependent on others.  See the charter of the PCE Working Group in the IETF for more information on this and other related developments.

As a side note, it should be fairly evident that in order to perform path optimizations on anything other than shortest-path trees (i.e. Steiner Trees or any other type of tree based on user-defined constraints), RSVP signaling must be used in order to signal a path along the computed ERO.  LDP certainly can be used to build P2MP LSPs (aka mLDP), however much like traditional LSPs built via LDP, the path follows the traditional IGP path.

Stay tuned as we will cover more exciting articles on P2MP LSPs and some of the other underpinnings behind many of the next generation MPLS services being commonly deployed.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

21Jun/1112

JNCIE Tips from the Field :: Summarization Made Easy

Today we'll start with a series of articles covering tips and techniques that might be utilized by JNCIE candidates, whether pursuing the JNCIE-SP, JNCIE-ENT, or even the JNCIE-SEC.  The tips and techniques I will be covering might prove to be useful during a lab attempt but could also be used in real-world scenarios to save time and minimize configuration burden in addition to eliminating mistakes that might otherwise be made.  I want everyone to understand that what I am about to write is simply a technique.  I am not divulging any materials or topics which are covered under NDA.

NOTE: For full disclosure, I must reveal that I am an employee of Juniper Networks in their Education Services department.  As such, I take the responsibility of protecting the content and integrity of the exam as well as the certification credentials very seriously.  I would never reveal anything which would allow a candidate to have in-depth knowledge of any specific topics or questions that may appear on the exam.  Not only that, I worked REALLY, REALLY hard to achieve my JNCIE certifications, and I believe everyone else should too! It's certainly more rewarding that way too don't you think?!

So without further delay, let's take a look at today's technique.

It is well known that sumarization is a key aspect of any type of practical exam involving routing of some sort.  Those who have ever taken a CCIE Routing & Switching or CCIE Service Provider exam can attest, summarization is one thing every expert level candidate needs to master.  It is no different with Juniper.  In fact, Juniper's certification web page clearly lists as one of the JNCIE-ENT exam objectives the requrement to "Filter/summarize specific routes". 

What I will show you next is a technique which I find quite handy when attempting to determine the best summary for a given route, and you can do so without having to resort to pen and paper and figuring it out the old fashioned way, i.e. looking at prefixes in binary. This technique, rather, allows you to use the power of Junos to your advantage to perform these tasks.  What I will reveal will also show you a fundamental difference between IOS and Junos and highlights why I believe Junos to be a more flexible, powerful, and superior network operating system.  You simply can't do what I am about to do on a Cisco platform running IOS.

So let's start off by looking at a diagram.  Let's say we have a network that has several OSPF areas, and we must summarize some information for each respective area towards the backbone without encompassing routing information that might exist outside of that area.

ospf-summarization-example

Here we can see we have a backbone area, consisting of two routers, P1 and P2.  P1 is acting as an ABR for Area 1 and is connected to both R1 and R2. P2 is acting as an ABR for Area 2 and is connected to R3.  As you can see from the diagram, I have configured more than a single set of IP addresses on many of the physical interfaces as well as the loopbacks.  This way I can represent many more networks and therefore create multiple Network LSAs for purposes of summarization.

So let's assume that we need to create the largest aggregate possible for a particular area and advertise only that aggregate towards the core without encompassing any routes which might be outside the area from which the summary describes.  Now normally, one would take a look at the diagram, get out a pen and paper, and start a lengthy exercise of supernetting based on binary addresses.  This can take several minutes or more and is valuable time that could certainly be used on wide variety of other more important tasks like setting up MPLS LSPs or troubleshooting that Layer 2 VPN connectivity.  So let's take a look at a simple trick that actually takes advantage of Junos to determine what the summary should be.

What we are going to do is take advantage of a feaure inside Junos which automatically shows us a range of prefixes which match a given CIDR block.  The Junos operating system has built-in route matching functionality which allows us to specify a given CIDR block and returns all routes with a mask length equal to or greater than that which is specified.  So by applying this principle, what we want to do is look at the diagram for a particular area, choose the lowest IP within that area as our base, and then apply a subnet mask to it which attempts to encompass that route as well as others. 

For example, looking at this diagram, we see that the lowest IP address being used in Area 1 is the 168.10.32.1 address assigned to R1's loopback.  So let's start by using this as our base for our summary, and then simply apply a subnet mask to it which we think might encompass additional routes:

sfouant@p1# run show route 168.10.32.0/22   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.32.1/32     *[OSPF/10] 8w4d 19:04:48, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.2/32     *[OSPF/10] 8w4d 19:04:48, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.3/32     *[OSPF/10] 8w4d 19:04:43, metric 1
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
168.10.32.4/32     *[OSPF/10] 8w4d 19:04:43, metric 1
                    > to 168.10.48.10 via ge-0/0/1.0
                      to 168.10.60.10 via ge-0/0/1.0


Note: We can do this on any router within Area 1 since the Link-State Database is the same on all devices, but I prefer to perform the work on the ABR since this is where I will be performing the aggregation.  Also, the ABR may have other local and/or direct routes (or perhaps routes from other protocol sources) so we want to see things from the perspective of the ABR.

What we see here is that we have just now determined the summary route which in fact encompasses all the loopback addresses on both R1 as well as R2, but we need to keep going because this doesn't incorporate the Gigabit Ethernet links between all the devices:

sfouant@p1# run show route 168.10.32.0/21   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.32.1/32     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.2/32     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.3/32     *[OSPF/10] 8w4d 19:04:45, metric 1
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
168.10.32.4/32     *[OSPF/10] 8w4d 19:04:45, metric 1
                    > to 168.10.48.10 via ge-0/0/1.0
                      to 168.10.60.10 via ge-0/0/1.0


Not quite. Let's keep trying:

sfouant@p1# run show route 168.10.32.0/20   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.32.1/32     *[OSPF/10] 8w4d 19:04:55, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.2/32     *[OSPF/10] 8w4d 19:04:55, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.3/32     *[OSPF/10] 8w4d 19:04:50, metric 1
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
168.10.32.4/32     *[OSPF/10] 8w4d 19:04:50, metric 1
                    > to 168.10.48.10 via ge-0/0/1.0
                      to 168.10.60.10 via ge-0/0/1.0


Nope, still not there yet. Let's try again:

sfouant@p1# run show route 168.10.32.0/19   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.32.1/32     *[OSPF/10] 8w4d 19:04:58, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.2/32     *[OSPF/10] 8w4d 19:04:58, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.3/32     *[OSPF/10] 8w4d 19:04:53, metric 1
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
168.10.32.4/32     *[OSPF/10] 8w4d 19:04:53, metric 1
                    > to 168.10.48.10 via ge-0/0/1.0
                      to 168.10.60.10 via ge-0/0/1.0
168.10.48.4/30     *[OSPF/10] 00:36:26, metric 2
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
                      to 168.10.60.2 via ge-0/0/0.0
168.10.48.8/30     *[Direct/0] 8w4d 19:36:13
                    > via ge-0/0/1.0
168.10.48.9/32     *[Local/0] 8w4d 19:36:13
                      Local via ge-0/0/1.0
168.10.60.0/30     *[Direct/0] 8w4d 19:51:31
                    > via ge-0/0/0.0
168.10.60.1/32     *[Local/0] 8w4d 19:51:31
                      Local via ge-0/0/0.0
168.10.60.4/30     *[OSPF/10] 00:36:26, metric 2
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
                      to 168.10.60.2 via ge-0/0/0.0
168.10.60.8/30     *[Direct/0] 8w4d 19:36:13
                    > via ge-0/0/1.0
168.10.60.9/32     *[Local/0] 8w4d 19:36:13
                      Local via ge-0/0/1.0


Ok, this looks more like it.  Here we can see we have all the Gigabit Ethernet links connecting all devices, as well as the loopback addresses.  This might be a suitable summary.  Let's keep going to see what happens:

sfouant@p1# run show route 168.10.32.0/18   
inet.0: 28 destinations, 28 routes (28 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.0.1/32      *[Direct/0] 8w4d 19:51:37
                    > via lo0.0
168.10.0.2/32      *[Direct/0] 8w4d 19:36:19
                    > via lo0.0
168.10.0.3/32      *[OSPF/10] 00:28:41, metric 1
                    > to 168.10.18.2 via fe-0/0/2.0
                      to 168.10.26.2 via fe-0/0/2.0
168.10.0.4/32      *[OSPF/10] 00:28:41, metric 1
                    > to 168.10.18.2 via fe-0/0/2.0
                      to 168.10.26.2 via fe-0/0/2.0
168.10.18.0/30     *[Direct/0] 8w4d 19:05:28
                    > via fe-0/0/2.0
168.10.18.1/32     *[Local/0] 8w4d 19:05:28
                      Local via fe-0/0/2.0
168.10.26.0/30     *[Direct/0] 8w4d 19:05:28
                    > via fe-0/0/2.0
168.10.26.1/32     *[Local/0] 8w4d 19:05:28
                      Local via fe-0/0/2.0
168.10.32.1/32     *[OSPF/10] 8w4d 19:05:04, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.2/32     *[OSPF/10] 8w4d 19:05:04, metric 1
                    > to 168.10.60.2 via ge-0/0/0.0
168.10.32.3/32     *[OSPF/10] 8w4d 19:04:59, metric 1
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
168.10.32.4/32     *[OSPF/10] 8w4d 19:04:59, metric 1
                    > to 168.10.48.10 via ge-0/0/1.0
                      to 168.10.60.10 via ge-0/0/1.0
168.10.48.4/30     *[OSPF/10] 00:36:32, metric 2
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
                      to 168.10.60.2 via ge-0/0/0.0
168.10.48.8/30     *[Direct/0] 8w4d 19:36:19
                    > via ge-0/0/1.0
168.10.48.9/32     *[Local/0] 8w4d 19:36:19
                      Local via ge-0/0/1.0
168.10.60.0/30     *[Direct/0] 8w4d 19:51:37
                    > via ge-0/0/0.0
168.10.60.1/32     *[Local/0] 8w4d 19:51:37
                      Local via ge-0/0/0.0
168.10.60.4/30     *[OSPF/10] 00:36:32, metric 2
                      to 168.10.48.10 via ge-0/0/1.0
                    > to 168.10.60.10 via ge-0/0/1.0
                      to 168.10.60.2 via ge-0/0/0.0
168.10.60.8/30     *[Direct/0] 8w4d 19:36:19
                    > via ge-0/0/1.0
168.10.60.9/32     *[Local/0] 8w4d 19:36:19
                      Local via ge-0/0/1.0


Clearly from this command, we can see we have now gone beyond what might be considered a suitable summary because we are now encompassing routes that exist within the backbone Area 0.  So it should be clear from this simple set of commands that the 168.10.32.0/19 would be a suitable address to use for our summary.

We could easily apply a similar example to Area 2 to quickly determine what the best summary would be.  We see from looking at the diagram the lowest IP within Area 2 is the 168.10.96.1 loopback address applied to R3.  When we use that as our base and go through the steps above, we can find our summary:

sfouant@p2# run show route 168.10.96.0/19   
inet.0: 27 destinations, 27 routes (27 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
168.10.96.1/32     *[OSPF/10] 01:00:57, metric 1
                      to 168.10.112.2 via fe-0/0/3.0
                    > to 168.10.118.2 via fe-0/0/3.0
168.10.96.2/32     *[OSPF/10] 01:00:57, metric 1
                    > to 168.10.112.2 via fe-0/0/3.0
                      to 168.10.118.2 via fe-0/0/3.0
168.10.112.0/30    *[Direct/0] 01:13:48
                    > via fe-0/0/3.0
168.10.112.1/32    *[Local/0] 01:13:48
                      Local via fe-0/0/3.0
168.10.118.0/30    *[Direct/0] 01:13:48
                    > via fe-0/0/3.0
168.10.118.1/32    *[Local/0] 01:13:48
                      Local via fe-0/0/3.0


And there you have it! As you can see it's really quite simple and if you haven't stumbled upon this already you may be saying to yourself, "Why didn't I think of that before?".  I hear from many candidates that they spend considerable time the old fashioned way to determine summaries and I always ask myself why.  As you can see, there is an easier way!

Clearly the benefit to using a technique such as the above is to easily find the routes that best summarize a bunch of more specific routes.  The utility of such an approach, while very useful during a practical exam, might be considerably lessened in the real-world where it is likely that hierarchy has already been built into the network and you have network diagrams at your disposal.  On the other hand, there may be situations where you inherit a network that was developed with hierarchy in mind, however summarization was never employed, or it was employed improperly.  In such cases, the above technique can be a real time saver, allowing you to spend less time doing binary math and more time doing the fun stuff - like troubleshooting why that MPLS LSP isn't getting established!

Stay tuned for additional articles covering time saving tips and techniques which can be used during your next lab attempt!  Good luck, and may the force be with you!

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

11Apr/112

IETF Provides New Guidance on IPv6 End-Site Addressing

I've always been at odds with the recommendation in RFC 3177 towards allocating /48 IPv6 prefixes to end-sites.  To me this seemed rather short-sighted, akin to saying that 640K of memory should be enough for anybody.  It's essentially equivalent to giving out /12s in the IPv4 world which in this day and age might seem completely ridiculous, but let us not forget that in the early days of IPv4 it wasn't uncommon to get a /16 or even a /8 in some cases.

Granted, I know there are quite a few more usable bits in IPv6 than there are in IPv4, but allocating huge swaths of address space simply because it's there and we haven't thought of all the myriad ways it could be used in the future just seems outright wasteful.

So you can imagine my surprise and also my elation last week when the IETF published RFC 6177 entitled 'IPv6 Address Assignment to End Sites'.  In it, the general recommendation of allocating /48s to end-sites that has long been the defacto standard since the original publication of RFC 3177 in 2001 has finally been reversed.

It seems that sanity has finally prevailed and the IAB/IESG have decided to take a more pragmatic approach towards address allocation in IPv6.  The recommendations in RFC 6177 attempt to balance the conservation of IPv6 addresses while at the same time continuing to make it easy for IPv6 adopters to get the address space that they require without requiring complex renumbering and dealing with other scaling inefficiencies in the long term.  It is clear that acting too conservatively and allocating very small address spaces could act as a disincentive and possibly stifle widespread adoption of IPv6.

The new current recommendations for address allocations are as follows:

  • /48 in the general case, except for very large subscribers
  • /64 when it is known that one and only one subnet is needed by design
  • /128 when it is absolutely known that one and only one device is connecting

It goes on to state other recommendations and offers guidance to operators with regards to when to allocate certain prefix lengths.  But essentially, what this means is that now individual network operators have more options regarding which prefix size to allocate, and allows them to move away from strict general guidelines.  In essence, operators make the decision as to what prefix size to allocate based on an analysis of the needs of particular customers.

Perhaps this practical conservation may never be needed given the trillions of address space available in IPv6, but maybe, just maybe, in the very distant future if IPv6 is still in widespread use, it could very well be due to some of these recommendations being put in place today.  After all, 640K did turn out to be a rather small number didn't it?

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

27Jul/101

Book Review :: JUNOS High Availability: Best Practices for High Network Uptime

JUNOS_High_AvailabilityJUNOS High Availability: Best Practices for High Network Uptime
by James Sonderegger, Orin Blomberg, Kieran Milne, Senad Palislamovic
Paperback: 688 pages
Publisher: O'Reilly Media
ISBN-13: 978-0596523046

5starsHigh Praises for JUNOS High Availability

Building a network capable of providing connectivity for simple business applications is a fairly straightforward and well-understood process. However, building networks capable of surviving varying degrees of failure and providing connectivity for mission-critical applications is a completely different story. After all, what separates a good network from a great network is how well it can withstand failures and how rapidly it can respond to them.

While there are a great deal of books and resources available to assist the network designer in establishing simple network connectivity, there aren't many books which discuss the protocols, technologies, and the myriad ways in which high availability can be achieved, much less tie it all together into one consistent thread. "JUNOS High Availability" does just that, in essence providing a single, concise resource covering all of the bits and pieces which are required in highly available networks, allowing the network designer to build networks capable of sustaining five, six, or even seven nines of uptime.

In general, there are a lot of misconceptions and misunderstandings amongst Network Engineers with regards to implementing high availability in Junos. One only needs to look at the fact that Graceful Restart (GR) protocol extensions and Graceful Routing Engine Switchover (GRES) are often mistaken for the same thing, thanks in no small part to the fact that these two technologies share similar letters in their acronyms. This book does a good job of clarifying the difference between the two and steers clear of the pitfalls typically prevalent in coverage of the subject matter. The chapter on 'Control Plane High Availability' covers the technical underpinnings of the underlying architecture on most Juniper platforms; coverage of topics like the separation between the control and forwarding planes, and kernel replication between the Master and Backup Routing Engine give the reader a solid foundation to understand concepts like Non-Stop Routing, Non-Stop Bridging, and In-Service Software Upgrades (ISSU). In particular I found this book to be very useful on several consulting engagements in which seamless high availability was required during software upgrades as the chapter on 'Painless Software Upgrades' discusses the methodology for achieving ISSU and provides a checklist of things to be performed before, during, and after the upgrade process. Similarly, I found the chapter on 'Fast High Availability Protocols' to be very informative as well, providing excellent coverage of BFD, as well as the differences between Fast Reroute vs. Link and Node Protection.

Overall I feel this book is a valuable addition to any networking library and I reference it often when I need to implement certain high availability mechanisms, or simply to evaluate the applicability of a given mechanism versus another for a certain deployment. The inclusion of factoring costs into a high availability design is a welcome addition and one that all too many authors fail to cover. Naturally, it only makes sense that costs should be factored into the equation, even when high availability is the desired end-state, in order to ensure that ultimately the business is profitable. If I had to make one suggestion for this book it is that there should be additional coverage of implementing High Availability on the SRX Series Services Gateways using JSRP, as this is a fundamental high availability component within Juniper's line of security products. To the authors credit however, this book was written just as the SRX line was being released, so I don't fault the authors for providing limited coverage. Perhaps more substantial coverage could be provided in the future if a Second Edition is published.

The bottom line is this - if you are a Network Engineer or Architect responsible for the continuous operation or design of mission-critical networks, "JUNOS High Availability" will undoubtedly serve as an invaluable resource. In my opinion, the chapters on 'Control Plane High Availability', 'Painless Software Upgrades', and 'Fast High Availability Protocols' are alone worth the entire purchase price of the book. The fact that you get a wealth of information beyond that in addition to the configuration examples provided makes this book a compelling addition to any networking library.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

21Jun/101

Reality Check: Traditional Perimeter Security is Dead!

Recently I came across a marketing event promoted by a network integrator which touted industry leading solutions to assist customers in determining "what was lurking outside their network".

In this day and age, it still surprises me when supposedly network savvy folks are still thinking of network security in terms of a traditional perimeter made up of firewalls or IPS devices. The truth of the matter is that the traditional perimeter vanished quite a few years ago.

Only looking at the perimeter gives the end-user a a false sense of protection. It completely fails to recognize the dangers of mobility in today's traditional workplace environment. Users roam. They might bring in viruses or other Trojans INSIDE your network where they are free to roam unencumbered. In the worst of these cases, the perimeter is only secured in one direction, giving outbound traffic unfettered access and completely ignoring that data which might be leaked from hosts inside your network destined to hosts outside your network, as might be the case with Keyloggers or other similar types of rogue programs.

Furthermore, in today's environment composed of virtualized machines, the line gets even blurrier which is why we are starting to see solutions from startup vendors such as Altor Networks. It’s one thing when we are dealing with physical hosts in the traditional sense, but what about the situation when you are dealing with a multitude of virtual machines on the same physical hosts which must talk to each other?

When you take a data-focused approach instead of a technology-focused approach, the problem and its solutions start to make more sense.   The perimeter should be viewed as the demarcation between the data and any I/O fabric providing connectivity between that data and some external entity. This is the domain of things like Data Loss Prevention (DLP), Network Access Control (NAC), and Virtual Hypervisor Firewalls in addition to that of traditional security devices.

trojan-horse

To deal with the realities of today, we must start to think of network security in terms of Hotels vs. Castles. In the Castle model, we have a big wall around our infrastructure. We might have a moat and some alligators, and perhaps we only lower our drawbridge for very special visitors. This model tends to keep a good majority of the enemies at bay, but it completely ignores that which might already be inside your network (think in terms of the Trojan horse as told in Virgil's epic poem 'The Aeneid').

What is more commonly being employed is that of the Hotel Model.  Initially, to gain entrance into the hotel itself, we must check in with the Concierge and get our room key.  Once we have our room key, we have limited access to our own room, and perhaps some shared facilities like the pool or the gym.  In this model, we are unable to enter into a room in which we do not have access.  The key word here is LIMITED access.

An all-inclusive security posture looks at the network from a holistic point of view.  The principles of Defense-in-Depth will make evident the failings of the traditional perimeter model.  The traditional perimeter is dead.  The perimeter is wherever the data is.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

1Feb/101

What’s the BFD with BFD?

Many networks today are striving for "five nines" high availability and beyond. What this means is that network operators must configure the network to detect and respond to network failures as quickly as possible, preferably on the order of milliseconds. This is in contrast to the failure detection inherent in most routing protocols, which is typically on the order of several seconds or more. For example, the default hold-time for BGP in JUNOS is 90 seconds, which means that in certain scenarios BGP will have to wait for upwards of 90 seconds before a failure is detected, during which time a large percentage of traffic may be blackholed. It is only after the failure is detected that BGP can reconverge on a new best path.

Another example is OSPF which has a default dead interval of 40 seconds, or IS-IS which has a default hold-time of 9 seconds (for DIS routers), and 27 seconds (for non-DIS routers). For many environments which support mission-critical data, or those supporting Voice/Video or any real-time applications, any type of failure which isn't detected in the sub-millisecond range is too long.

While it is possible to lower timers in OSPF or IS-IS to such an extent that a failure between two neighbors can be detected rather quickly (~1 second), it comes at a cost of increased protocol state and considerable burden on the Routing Engine's CPU.  As an example, let us consider the situation in which a router has several hundred neighbors. Maintaining subsecond Hello messages for all of these neighbors will dramatically increase the amount of work that the Routing Engine must perform. Therefore, it is a widely accepted view that a reduction in IGP timers is not the overall best solution to solve the problem of fast failure detection.

Another reason that adjusting protocol timers is not the best solution is that there are many protocols which don't support a reduction of timers to such an extent that fast failure detection can be realized. For example, the minimum BGP holdtime is 20 seconds, which means that the best an operator can hope for is a bare minimum of 20 seconds for failure detection.

Notwithstanding, this does nothing about situations in which there is no protocol at all, for example, Ethernet environments in which two nodes are connected via a switch as can be seen in the figure below.  In this type of environment, R1 has no idea that R2 is not reachable, since R1's local Ethernet segment connected to the switch remains up.  Therefore, R1 can't rely on an 'Interface Down' event to trigger reconvergence on a new path and instead must wait for higher layer protocol timers to age out before determining that the neighbor is not reachable.  (Note to the astute reader: Yes, Ethernet OAM is certainly one way to deal with this situation, but that is a discussion which is beyond the scope of this article).

L2-Connectivity

Essentially, at the root of the problem is either a lack of suitable protocols for fast failure detection of lower layers, or worse, no protocol at all.  The solution to this was the development of Bidirectional Forwarding Detection, or BFD, developed jointly by Cisco and Juniper.  It has been widely deployed and is continuing to gain widespread acceptance, with more and more protocols being adapted to use BFD for fast failure detection. 

So what is the Big Freaking Deal with Bidirectional Forwarding Detection anyway and why are so many operators implementing it in their networks?  BFD is a simplistic hello protocol with the express purpose of rapidly detecting failures at lower layers.  The developers wanted to create a low overhead mechanism for exchanging hellos between two neighbors without all the nonessential bits which are typical in an IGP hello or BGP Keepalives.  Furthermore, the method developed had to be able to quickly detect faults in the Bidirectional path between two neighbors in the forwarding plane.  Originally, BFD was developed to provide a simple mechanism to be used on Ethernet links, as in the example above, prior to the development of Ethernet OAM capabilities.  Hence, BFD was developed with this express purpose in mind with the intent of providing fault identification in an end-to-end path between two neighbors.

Once BFD was developed, the protocol designers quickly found that it could be used for numerous applications beyond simply Ethernet.  In fact, one of the main benefits of BFD is that it provides a common method to provide for failure detection for a large number of protocols, allowing a singular, centralized method which can be reused.  In other words, let routing protocols do what they do best - exchange routing information and recalculate routing tables as necessary, but not perform identification of faults at lower layers.  An offshoot of this is that it allows network operators to actually configure higher protocol timer values for their IGPs, further reducing the burden placed on the Routing Engine.

BFD timers can be tuned such that failure detection can be realized in just a few milliseconds, allowing for failure and reconvergence to take place in similar timeframes to that of SONET Automatic Protection Switching.  A word of caution - while BFD can dramatically decrease the time it takes to detect a failure, operators should be careful when setting the intervals too low.  Very aggressive BFD timers could cause a link to be declared down even when there is only a slight variance in the link quality, which could cause flapping and other disastrous behavior to ensue.  The best current practice with regards to BFD timers is to set a transmit and receive interval of 300ms and a multiplier of 3, which equates to 900ms for failure detection.  This is generally considered fine for most environments, and only the most stringent of environments should need to set their timers more aggressive than this.

One question that is commonly asked is how is it that BFD can send hello packets in the millisecond range without becoming a burden on the router.  The answer to this question lies in the fact that BFD was intended to be lightweight and run in the forwarding plane, as opposed to the control plane (as is the case with routing protocols).  It is true that while early implementations of BFD ran on the control plane, most of the newer implementations run in the forwarding plane, taking advantage of the dedicated processors built into the forwarding plane and alleviating the burden which would otherwise be place on the RE.  In JUNOS releases prior to JUNOS 9.4, BFD Hello packets were generated via RPD running on the RE.  In order to enable BFD to operate in the PFE in JUNOS versions prior to JUNOS 9.4, the Periodic Packet Management Daemon (PPMD) had to be enabled, using the command 'set routing-options ppm delegate processing'.  In JUNOS 9.4 and higher this is the default behavior and BFD Hello packets are automatically handled by PPMD operating within the PFE.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

8Dec/090

Implementing Provider-Provisioned VPNs using Route Reflectors

MPLS/BGP Provider-Provisioned VPNs, such as those proposed in RFC 4364 (formerly RFC 2547) or draft-kompella variants, suffer from some scalability issues due to the fact that all PE routers are required to have a full iBGP mesh in order to exchange VPN-IPv4 NLRI and associated VPN label information.  In a modern network consisting of a large number of PE devices, it becomes readily apparent that this requirement can quickly become unmanageable.

The formula to compute the number of sessions for an iBGP full mesh is n * (n-1)/2.  10 PE devices would only require a total of 45 iBGP sessions (10 * (9)/2 = 45).  However, by simply adding 5 additional PEs into this environment your total number of sessions increases exponentially to 105.  Scalability issues arise because maintaining this number of iBGP sessions on each PE is an operational nightmare; similarly control plane resources are quickly exhausted.

An alternative to this that has gained widespread adoption is to utilize Route Reflectors to reflect the VPN-IPv4 NLRI and associated VPN label between PE devices.  However, several issues arise when using Route Reflectors in such an environment.  In a normal environment without the use of Route Reflectors, MPLS tunnels exist between each PE router such that when the VPN-IPv4 NLRI and associated VPN label are received, a PE router can recurse through its routing table to find the underlying MPLS tunnel used to reach the remote BGP next-hop within the VPN-IPv4 NLRI.  In the Route Reflection model, the Route Reflector typically doesn't have an MPLS tunnel to each PE for which it is receiving VPN-IPv4 NLRI.  Therefore, these routes never become active and are therefore not candidates for reflection back to other client and non-client peers.

A few methods have been developed which circumvent this issue.  One method is to simply define MPLS tunnels from the Route Reflector to each PE.  This solves the problem by allowing the Route Reflector to find a recursive match (i.e. MPLS tunnel) in order to reach the remote PE.  However, this approach suffers from the drawback in that it requires a whole bunch of MPLS tunnels to be configured which only serve to allow the received VPN-IPv4 NLRI to be considered active.  Remember, these tunnels are completely useless in that they will never be used for the actual forwarding of data, they are only used within the control plane to instantiate routes.

An alternative and much more graceful solution to this problem is to configure the Route Reflector with a static discard route within the routing table which is used to reference BGP next-hops in MPLS environments (inet.3 for example in JUNOS).  This static discard route only serves to function as a recursive match when incoming VPN-IPv4 NLRI are received for the express purpose of making these routes active and therefore candidates for redistribution.  In JUNOS, one can accomplish this using the following configuration:

routing-options {
    rib inet.3 {
        static {
            route 0.0.0.0/0 discard;
        }
    }
}

With the above, any VPN-IPv4 NLRI received from a PE router is immediately made active due to the fact that a static route has been created in inet.3 which is the routing table used in JUNOS to recurse for BGP next-hops in MPLS environments.

An excellent whitepaper entitled "BGP Route Reflection in Layer 3 VPN Networks" expands upon this and describes the benefits of using Route Reflection in such environments. It also builds the case for using a distributed Route Reflection design to further enhance scalability and redundancy.

One thing to keep in mind is that with the Route Reflector approach, we have merely moved the problem set from that of the PE device to that of the Route Reflector.  Although it minimizes the number of iBGP sessions required on PE devices, the Route Reflector must be capable of supporting a large number of iBGP sessions and in addition, must be able to store all of the VPN-IPv4 NLRI for all of the VPNs for which it is servicing.  It is highly recommended that adequate amounts of memory are in place on the Route Reflector in order to store this large amount of routing information.

Finally, while using Route Reflectors is an acceptable solution in the interim to addressing scaling concerns with Provider-Provisioned VPNs, it is not clear if this approach is sufficient for the long term.  There are several other options being examined, with some of them outlined in a presentation entitled "2547 L3VPN Control Plane Scaling" given at APRICOT in Kyoto, Japan in 2005.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

3Dec/090

Book Review :: IS-IS: Deployment in IP Networks

IS-IS_DeploymentIS-IS: Deployment in IP Networks
by Russ White, Alvaro Retana
Hardcover: 320 pages
Publisher: Pearson Education
ISBN-13: 978-0201657722

2starsBetter off choosing an alternative selection

As IS-IS is one of the more esoteric protocols, understood only by a few people in large scale ISP environments, I thought this book would be a welcome addition to my library as there isn't much else on the market covering this protocol. There are of course ISO 10589 and RFC 1195 which covers these protocols, but seeing as this is a short book I thought it might be able to shed some light on an otherwise complex protocol.

In reviewing this book I've come up disappointed in general. There are certainly a few golden nuggets and I give the book a couple of stars just for attempting to bridge the gap between the purely theoretical and the purely vendor specific. However, the book comes up short on most other points. Often times I found myself wanting to scrap this book in favor of some of the other selections on the market, but since I have respect for these authors I read the whole book hoping that they might be able to redeem themselves by the time I finished.

Obviously the authors have a great deal of knowledge about the subject, and I don't fault them entirely. The quality of the editing is poor with many grammatical and syntactical errors littered throughout the text. There are abundant instances throughout the book where the diagrams used do not match the text describing them. I was rather disappointed because I usually find that Addison-Wesley publishes some of the best texts on the market.

All in all, I thought this book could have been a lot better than it was. After all, these authors have several other titles under their belt, most notably "Advanced IP Network Design". But in this case, I would say that you are better off looking for other similar titles available on the market, such as Jeff Doyle's "Routing TCP/IP Volume 1" or "The Complete IS-IS Routing Protocol" by Hannes Gredler and Walter Goralski.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

30Nov/091

Book Review :: MPLS-Enabled Applications: Emerging Developments and New Technologies

MPLS Enabled Applications

MPLS-Enabled Applications: Emerging Developments and New Technologies
by Ina Minei, Julian Lucek
Paperback: 526 pages
Publisher: Wiley
ISBN-13: 978-0470986448

5starsExcellent coverage of VPLS, and Multicast over Layer 3 VPNs

Recently I had to work on a project which involved demonstrating Multicast over Layer 3 VPN interoperability between Cisco and Juniper. I spent several days reading through all the RFCs and working-group drafts which pertained to this subject matter, after which I still had many unanswered questions. In order to round out my understanding, I decided to order the Second Edition of 'MPLS-Enabled Applications'. Looking back, I wish I had read this book instead of wasting my time reading the various RFCs and working-group drafts. This book answered all of my questions and went above and beyond to give me a solid understanding of the concepts and their application. As other reviewers have pointed out, often one needs to read a book to understand the technology basics, and then refer to RFCs or working-group drafts in order to keep abreast of the latest changes. Not so with this book... In fact, this book is so current that reading the working-group drafts is largely unnecessary. It is incredibly comprehensive, concise, and gives the reader a thorough understanding of the business drivers. Furthermore, it illustrates the various ways in which MPLS services can be offered and outlines the pros and cons of each approach so that the network designer can make intelligent decisions with regards to implementation.

In addition to the great coverage that was provided by the First Edition, the Second Edition has updated the text to reflect newer trends and applications such as the transport of IPv6 over an IPv4 MPLS core, and detailed coverage of end-to-end and local protection schemes in MPLS networks. Likewise, the chapter previously called "Point-to-Multipoint LSPs" has now been renamed to "MPLS Multicast", with much more detailed coverage of the P2MP hierarchy and the forwarding-plane and control-plane operation. The biggest value for me was the addition of a completely new chapter on "Multicast over Layer 3 VPNs" which provided comprehensive coverage of this emerging technology and fully illustrates the full gamut of operation of either the PIM/GRE approach, or the NG-VPN approach utilizing BGP and P2MP LSPs. Finally, the addition of a chapter on "MPLS in Access Networks" was well deserved seeing as Ethernet is quickly becoming the access technology of choice and MPLS will likely be utilized as an overlay in order to realize the full potential of Ethernet in these environments.

This book has earned a spot on my bookshelf as one of my most coveted resources, and I refer to it quite often to refresh my memory on the myriad workings of various functions within MPLS. I wish I could give this book a rating higher than five stars! I can't overemphasize how exceptional this book is. If you are in the market for a book covering MPLS and emerging applications offered on MPLS networks, this single book should be at the top of your list!

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati

12Nov/091

Hardening DNS Against Reflection Attacks

Generally, the most prevalent types of attacks and the ones which are most likely to target a DNS infrastructure are reflection/amplification attacks (which generally use a DNS servers resources against other third-parties).  Understanding the attack-vector employed in most reflection attacks is necessary in order to understand how to harden an environment against these types of attacks.

In a typical DNS reflection attack, attackers send a large number of queries from a spoofed source address.  The address which is spoofed is typically that of the victim.  When the requests are received by the nameserver, all ensuing responses to these queries are directed back towards the spoofed IP of the victim.  The amplification factor comes into play in these types of attacks because the attacker will typically query for a record of a large size, typically that of a root record which by it's very nature is very large in size.  By simply sending in small queries, on the order of around 90 Bytes, an attacker can typically get a multiplication factor of five times that of the original query.  This allows an attacker to use a smaller number of hosts in the botnet and cause considerably more impact than would otherwise be possible if these devices were sending traffic directly towards the victim.

Due to the fact that the purpose of this attack is to reflect packets back towards the victim, all of the source IPs of the DNS queries contain the same address.  This makes it fairly easy to spot a reflection attack utilizing your infrastructure.  A simple observation of a spike in DNS queries from a singular IP is a clear indication that this is going on.  One would think that these types of attacks can be mitigated fairly easily by implementing simple ACLs on routers to prevent the incoming queries from those spoofed IP, and in fact that is a method commonly used by network administrators to protect against these types of attacks.  However, most security experts suggest that the filtering should actually be implemented on the nameserver itself - in fact this has been considered an industry best practice for quite some time now.  The reason for this is that implementing an ACLs on a router is largely a reactive countermeasure which can only be deployed after the fact.  An administrator will still need to identify the target of the attack before filters can be put in place; furthermore these types of filters only serve to cause a legitimate Denial of Service when that particular victim actually attempts to query anything for which the nameserver is actually authoritative for.  As an alternative to ACLs, some proposed configurations below can be used to eliminate this problem in it's entirety.

At the very onset of your investigation into the vulnerabilities of your DNS infrastructure and potential remedies one of the very first things a network administrator must determine is if the nameservers allow anyone on the outside world to query for root (.).  Using the example below, one can easily check to see if their nameserver responds with a root-referral when queried for root.  If you see something along these lines, you can be fairly certain your nameserver is responding with a root-referral:

/usr/bin/dig . NS @dns.example.com
 
; <<>> DiG 9.2.4 <<>> . NS @DNS.EXAMPLE.COM ; (2 servers found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54368 ;; flags: qr rd; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 13
 
;; QUESTION SECTION:
;. IN NS
 
;; ANSWER SECTION:
. 86400 IN NS A.ROOT-SERVERS.NET.
. 86400 IN NS B.ROOT-SERVERS.NET.
. 86400 IN NS C.ROOT-SERVERS.NET.
. 86400 IN NS D.ROOT-SERVERS.NET.
. 86400 IN NS E.ROOT-SERVERS.NET.
. 86400 IN NS F.ROOT-SERVERS.NET.
. 86400 IN NS G.ROOT-SERVERS.NET.
. 86400 IN NS H.ROOT-SERVERS.NET.
. 86400 IN NS I.ROOT-SERVERS.NET.
. 86400 IN NS J.ROOT-SERVERS.NET.
. 86400 IN NS K.ROOT-SERVERS.NET.
. 86400 IN NS L.ROOT-SERVERS.NET.
. 86400 IN NS M.ROOT-SERVERS.NET.
 
;; ADDITIONAL SECTION:
A.ROOT-SERVERS.NET. 86400 IN A 198.41.0.4
B.ROOT-SERVERS.NET. 86400 IN A 192.228.79.201
C.ROOT-SERVERS.NET. 86400 IN A 192.33.4.12
D.ROOT-SERVERS.NET. 86400 IN A 128.8.10.90
E.ROOT-SERVERS.NET. 86400 IN A 192.203.230.10
F.ROOT-SERVERS.NET. 86400 IN A 192.5.5.241
G.ROOT-SERVERS.NET. 86400 IN A 192.112.36.4
H.ROOT-SERVERS.NET. 86400 IN A 128.63.2.53
I.ROOT-SERVERS.NET. 86400 IN A 192.36.148.17
J.ROOT-SERVERS.NET. 86400 IN A 192.58.128.30
K.ROOT-SERVERS.NET. 86400 IN A 193.0.14.129
L.ROOT-SERVERS.NET. 86400 IN A 199.7.83.42
M.ROOT-SERVERS.NET. 86400 IN A 202.12.27.33
 
;; Query time: 43 msec
;; SERVER: www.example.com#53(www.example.com)
;; WHEN: Thu Nov 12 17:01:51 2009
;; MSG SIZE rcvd: 436
 

Normally, most Internet-facing authoritative nameservers should not respond to recursive third-party queries for root.  If an authoritative nameserver responds to queries for root with a root-referral, attackers will likely use your nameservers as an amplification vector to launch attacks.  It is better to remove the temptation altogether, by not allowing these in the first place.  Furthermore, instead of responding with the root-referral, nameservers should be configured to respond with REFUSED or SERVFAIL or another similar type of message.  The reason for this is simple - a REFUSED message is only on the order of around 50 Bytes.  If a countermeasure such as this is not employed, attackers can send in a relatively small spoofed query and will get roughly a 512 Byte response.  However, if we respond with a REFUSED message, the amplification factor is on the order of 1:1.  From an efficiency standpoint this does not provide much if any amplification, and therefore the attackers will simply look for other providers whom don’t apply such stringent security measures.

NOTE: Of course if you are in the business of providing recursive DNS service, that is an entirely different story - if that is the case, network administrators should take extra precautions by strictly enabling this function on the Recursive DNS servers (not the Authoritatives) and combining firewall-filters to limit recursive service to only IP blocks of paying customers.
 
While we're on the subject, another current best practice in the industry is to apply a similar methodology to requests for records for which a given DNS Server is not authoritative for.  Some resolvers may respond to these types of requests by providing a root-referral, and in the worst cases a resolver may actually perform a recursive query on behalf of the original source.  An authoritative resolver should respond to any non-existent domain requests with either the REFUSED message or other similar type of message, rather than providing a root referral or performing a recursive query.  In fact, BCP 140 (Preventing Use of Recursive Nameservers in Reflector Attacks, RFC 5358) looked at this problem and concluded that sending REFUSED was the best general guidance that can be given.  While BCP 140 applies guidance to recursive servers, returning REFUSED to queries which are not within the namespace served by authoritative servers is entirely consistent with BCP 140.  You can generally find out if your nameservers allows for recursion or if it responds with a root-referral, by using a command such as the following:

/usr/bin/dig +recurs @yournameserver_ip www.facebook.com
 

If you see a response which looks like the large root-referral response above, or some other type of response other than a REFUSED or SERVFAIL, you should take steps to harden your resolver.  One can also look for an "RA" entry in the "Flags" section of the DNS response which should give you some indication as to whether the resolver allows for recursion.

In conclusion, there are several such steps which allow administrators to prevent from being used as an unwitting pawn in an attack against third-parties.  Firewall filters are effective, but are reactive in nature.  Therefore, it is recommended to follow some of the steps below in order to effectively harden your DNS infrastructure against reflection attacks.

SYNOPSIS - Steps to prevent a DNS infrastructure from being used for reflection/amplification type attacks:

  1. Disable querying for root on Authoritative resolvers, return REFUSED
  2. Filter queries to Recursives from only paying customers, via ACLs
  3. Apply BCP 140 or similar rules to cause any non-existent domain queries to respond with a REFUSED.

An excellent whitepaper entitled "Anatomy of Recent DNS Reflector Attacks from the Victim and Reflector Point of View" by Ken Silva, Frank Scalzo and Piet Barber covers this topic in more detail.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Slashdot Post to Technorati