Rethinking SD-Access Design with Cisco: Navigating the Internet Border Complex

We want to delve deeper into a recurring topic: the Internet connection, which is a critical aspect of network design but is often handled differently and sometimes poorly when it comes to High Availability (HA). I’ll be referring to this area as “The Kluge Zone” (cue eerie music).

The Internet connection becomes an especially intricate design topic when using SDA Transit.

The Single Exit Challenge:

A single exit is relatively straightforward. LISP provides a default route, directing traffic to a Border Node (or HA pair) to leave the SD-Access domain. This tunnel typically terminates in the data center, close to the Internet exit path. Additional considerations come into play if there’s a site or two between the data center and the Internet exit. Although rare, the subsequent discussion still applies. With SDA Transit, you can choose one or more default exits, such as your Internet blocks or data center blocks, but not both.

The Dual Exit Dilemma:

When two sites provide Internet exit, achieving failover becomes more challenging. However, there are simple, yet non-obvious, ways to build it.

Cisco’s “border prioritization” feature intended to specify a preferred border site for Internet egress with failover to another site. Consult Cisco SDA experts for more information.

Cisco has discussed a priority feature that establishes primary and fallback exits. Additionally, a new feature called “Affinity” was announced in October 2022 for SDA. A separate blog post will follow to discuss this feature.

Established and Trusted Solutions for Dual Exits:

Despite the challenges, you can achieve a high degree of multi-site high availability with some high-speed interconnection. One simple solution is running IP Transit or “VRF-Lite on all underlay links.” This approach becomes messy with more VNs/VRFs.

This method allows you to choose prefixes and metrics for traffic direction. BGP is a popular choice, while link state protocols like OSPF and IS-IS can make traffic engineering more difficult.

Dual Exit Diagram and Explanation:

You can visualize SDA Transit running into two pairs of “External Border Nodes” at the two data centers. VRF-Lite runs into the Fusion Firewalls (FFWs) and everything in the FFWs is global routing only.

This design decouples the SDA network from the firewalls and edge, simplifying traffic management in the event of a firewall problem. The challenge lies in maintaining firewall state for flows and return traffic, especially for long-lived flows.

Running BGP on the FFWs allows for stateful traffic shifting to another site during a problem. This enables the choice of running with a preferred site or not.

LISP Pub/Sub allows you to designate Internet exits and round-robin between multiple sites. The decoupling in the diagram enables dual site HA and flexible routing options, like single preferred exit or dual with more complexity.

Firewall Failures and Failover:

The primary purpose of FFWs is to control traffic between VNs/VRFs. The secondary purpose is controlling user-to-server access. Understanding failure modes is crucial to ensuring a resilient network.

Potential failures include External border pair failure, IP Transit or LISP Failover, LISP without Pub/Sub not failing over, Border pair to FFW or FFW to core switches failure. Employing dynamic routing (with the FFW participating) can help traffic shift to the other data center when there is a problem. If you have a preferred exit via the Internet at the data center with the failure(s), traffic should still be able to traverse the crosslink (albeit sub-optimally).

The whole FFW block story becomes similar to the core switch/Internet firewalls/outer switches (or routers) story in terms of routing and failover. Essentially, this approach decouples packet delivery to one of the data centers (IP Transit or LISP-based SDA Transit) from firewalling and Internet routing. The result is a modular design with reduced complexity.

Caveats and Variations:

The above assumes that the two exits are geographically close, with latency being a non-issue. If the exits are far apart, location-aware exit priorities become essential, which is what Affinity can achieve. Load-sharing across exit complexes, however, is more challenging due to the need to preserve firewall state. Policy routing based on source IP block might be a workable solution.

Cross-site firewall clustering can address stateful return paths but may introduce complexity and risk. Although it allows for the use of secondary site devices and links, it could also create new failure modes.

Note that if cross-links fail between sites, problems may still arise. Adding cross-site failover to a routing scheme that uses both Internet exits statefully increases complexity further.

Affinity seems to help address these concerns.

Static routes are not recommended, and using public addressing internally with NAT on the firewalls can simplify state preservation. Advertising a /23 and one of the /24s out of each site might be useful in this case.

If more control is needed, consider IP Transit, particularly if you don’t expect to have more than a couple of VNs (VRFs) in your SD-Access network. Keep in mind that switching from IP Transit to SDA Transit later could be painful.

Conclusion:

This discussion should help you think about your Internet exits, redundancy, and failover strategies. Many designs lack automated dual-site Internet failover or only work for certain failure modes. In the past, some network professionals may have been hesitant to trust dynamic routing on firewalls, but recent advancements have made it more reliable.

By considering the various solutions and design elements discussed in this blog post, you can create a robust and resilient SD-Access network that addresses your organization’s unique requirements for Internet connectivity and high availability.