Internet-Draft | EVPN Anycast Aliasing | July 2023 |
Rabadan, et al. | Expires 11 January 2024 | [Page] |
The current Ethernet Virtual Private Network (EVPN) all-active multi-homing procedures in Network Virtualization Over Layer-3 (NVO3) networks provide the required Split Horizon filtering, Designated Forwarder Election and Aliasing functions that the network needs in order to handle the traffic to and from the multi-homed CE in an efficent way. In particular, the Aliasing function addresses the load balacing of unicast packets from remote Network Virtualization Edge (NVE) devices to the NVEs that are multi-homed to the same CE, irrespective of the learning of the CE's MAC/IP information on the NVEs. This document describes an optional optimization of the EVPN multi-homing Aliasing function - EVPN Anycast Aliasing - that is specific to the use of EVPN with NVO3 tunnels (i.e., IP tunnels) and, in typical Data Center designs, may provide savings in terms of data plane and control plane resources in the routers.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 11 January 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Ethernet Virtual Private Network (EVPN) is the de-facto standard control plane in Network Virtualization Over Layer-3 (NVO3) networks deployed in multi-tenant Data Centers [RFC8365][I-D.ietf-nvo3-evpn-applicability]. EVPN provides Network Virtualization Edge (NVE) auto-discovery, tenant MAC/IP dissemination and advanced features required by Network Virtualization Over Layer-3 (NVO3) networks, such as all-active multi-homing. The current EVPN all-active multi-homing procedures in NVO3 networks provide the required Split Horizon filtering, Designated Forwarder Election and Aliasing functions that the network needs in order to handle the traffic to and from the multi-homed CE in an efficent way. In particular, the Aliasing function addresses the load balacing of unicast packets from remote NVEs to the NVEs that are multi-homed to the same CE, irrespective of the learning of the CE's MAC/IP information on the NVEs. This document describes an optional optimization of the EVPN multi-homing Aliasing function - EVPN Anycast Aliasing - that is specific to the use of EVPN with NVO3 tunnels (i.e., IP tunnels) and, in typical Data Center designs, may provide some savings in terms of data plane and control plane resources in the routers.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Figure 1 depicts the typical Clos topology in multi-tenant Data Centers, only simplified to show three Leaf routers and two Spines. The NVEs or Leaf routers run EVPN for NVO3 tunnels, as in [RFC8365]. We assume VXLAN is used as the NVO3 tunnel, given that VXLAN is highly prevalent in multi-tenant Data Centers. This diagram is used as a reference throught this document. In very large scale Data Centers though, the number of Tenant Systems, Leaf routers and Spines (in multiple layers) may be significant.¶
In the example of Figure 1 the Tenant Systems TS1 and TS2 are multi-homed to Leaf routers L1 and L2, and Ethernet Segments Identifiers ESI-1 and ESI-2 are the representation of TS1 and TS2 Ethernet Segments in the EVPN control plane for the Split Horizon filtering, Designated Forwarder and Aliasing functions [RFC8365].¶
Taking Tenant Systems TS1 and TS3 as an example, the EVPN all-active multi-homing procedures guarantee that, when TS3 sends unicast traffic to TS1, Leaf L3 does per-flow load balancing towards Leaf routers L1 and L2. As explained in [RFC7432] and [RFC8365] this is possible due to L1 and/or L2 Leaf routers advertising TS1's MAC address in an EVPN MAC/IP Advertisement route that includes ESI-1 in the Ethernet Segment Identifier field. When the route is imported in Leaf L3, TS1's MAC address is programmed with a destination associated to ESI-1 next hop list. This ESI-1 next hop list is created based on the reception of the EVPN A-D per ES and A-D per EVI routes for ESI-1 received from Leaf routers L1 and L2. Assuming Ethernet Segment ES-1 links are operationally active, Leaf routers L1 and L2 advertise the EVPN A-D per ES/EVI routes for ESI-1 and Leaf L3 adds L1 and L2 to its next hop list for ESI-1. Unicast flows from TS3 to TS1 are therefore load balanced to Leaf routers L1 and L2, and L3's ESI-1 next hop list is what we refer to as the "overlay ECMP-set" for ESI-1 in Leaf L3. In addition, once Leaf L3 selects one of the next hops in the overlay ECMP-set, e.g. L1, Leaf L3 does a route lookup of the L1 address in the Base router route table. The lookup yields a list of two next hops, Spine-1 and Spine-2, which we refer to as the "underlay ECMP-set". Therefore, for a given unicast flow to TS1, Leaf L3 does per flow load balancing at two levels: a next hop in the overlay ECMP-set is selected first, e.g., L1, and then a next hop in the underlay ECMP-set is selected, e.g., Spine-1.¶
While aliasing [RFC7432] provides an efficient method to load balance unicast traffic to the Leaf routers attached to the same all-active Ethernet Segment, there are some challenges in very large Data Centers where the number of Ethernet Segments and Leaf routers is significant:¶
Inefficient forwarding during a failure: A further consequence of ECMP being performed in the overlay ECMP-set is the potential for in-flight packets sent by remote Leaf routers being rerouted in an inefficient way. Some examples follow:¶
There are existing proprietary multi-chassis Link Aggregation Group implementations, collectively and commonly known as MC-LAG, that attempt to work around the above challenges by using the concept of "Anycast VTEPs", or the use of a shared loopback IP address that the Leaf routers attached to the same multi-homed Tenant System can use to terminate VXLAN packets. As an example in Figure 1, if Leaf routers L1 and L2 used an Anycast VTEP address "anycast-IP1" to identify VXLAN packets to Tenant System TS1:¶
However, the use of proprietary MC-LAG technologies in EVPN NVO3 networks is being abandoned due to the superior functionality of EVPN Multi-Homing, including mass withdraw [RFC7432], advanced Designated Forwarding election [RFC8584] or weighted load balancing [I-D.ietf-bess-evpn-unequal-lb], to name a few features.¶
This document specifies an EVPN Anycast Aliasing extension that can be used as an alternative to EVPN Aliasing [RFC7432]. EVPN Anycast Aliasing replaces the per-flow overlay ECMP load-balancing with a simplified per-flow underlay ECMP load balancing, in a similar way to how proprietary MC-LAG solutions do it, but in a standard way and keeping the superior advantages of EVPN Multi-Homing, such as the Designated Forwarder Election, Split Horizon filtering or the mass withdraw function, all of them described in [RFC8365] and [RFC7432]. The solution uses the A-D per ES routes to advertise the Anycast VTEP address to be used when sending traffic to the Ethernet Segment and suppresses the use of A-D per EVI routes for the Ethernet Segments configured in this mode. This solution addresses the challenges outlined in Section 1.2.¶
The solution is valid for all NVO3 tunnels, or even for IP tunnels in general. Sometimes the description uses VXLAN as an example, given that VXLAN is highly prevalent in multi-tenant Data Centers. However, the examples and procedures are valid for any NVO3 tunnel type.¶
This specification makes use of two BGP extensions that are used along with the A-D per ES routes [RFC7432].¶
The first extension is the flag "A" or "Anycast Aliasing mode" and it is requested to IANA to be allocated in bit 2 of the EVPN ESI Multihoming Attributes registry for the 1-octect Flags field in the ESI Label Extended Community, as follows:¶
Where the following Flags are defined:¶
Name | Meaning | Reference |
---|---|---|
RED | Multihomed site redundancy mode | [I-D.ietf-bess-rfc7432bis] |
SHT | Split Horizon type | [I-D.ietf-bess-evpn-mh-split-horizon] |
A | Anycast Aliasing mode | This document |
When the NVE advertises an A-D per ES route with the A flag set, it indicates the Ethernet Segment is working in Anycast Aliasing mode. The A flag is set only if the RED = 00 (All-Active redundancy mode), and MUST NOT be set if RED is different from 00.¶
The second extension that this document introduces is the encoding of the "Anycast VTEP" address in the BGP Tunnel Encapsulation Attribute, Tunnel Egress Endpoint Sub-TLV (code point 6) [RFC9012]. NOTE from the authors: a new Sub-TLV may also be considered in future versions of this document, depending on the feedback of the Working Group.¶
This document proposes an OPTIONAL "EVPN Anycast Aliasing" procedure that provides a solution to optimize the behavior in case the challenges described in Section 1.2 become a problem. The description makes use of the terms "Ingress NVE" and "Egress NVE". In this document, Egress NVE refers to an NVE that is attached to an Ethernet Segment working in Anycast Aliasing mode, whereas Ingress NVE refers to the NVE that trasmits unicast traffic to a MAC address that is associated to a remote Ethernet Segment that works in Anycast Aliasing mode. In addition, the concepts of Unicast VTEP and Anycast VTEP are used. A Unicast VTEP is a loopback IP address that is unique in the Data Center fabric and it is owned by a single NVE terminating VXLAN (or NVO3) traffic. An Anycast VTEP is a loopback IP address that is shared among the NVEs attached to the same Ethernet Segment and it is used to terminate VXLAN (or NVO3) traffic on those NVEs. An Anycast VTEP in this document MUST NOT be used as BGP next hop of any EVPN route NLRI. This is due to the need for the Multi-Homing procedures to uniquely identify the originator of the EVPN routes via their NLRI next hops.¶
The solution consists of the following alternative modifications of the [RFC7432] EVPN Aliasing function:¶
The default behavior for an Egress NVE attached to an Ethernet Segment follows [RFC8365]. The Anycast Aliasing mode MUST be explicitly configured for a given all-active Ethernet Segment. When the Egress NVE Ethernet Segment is configured to follow the Anycast Aliasing behavior, the egress NVE:¶
Advertises EVPN A-D per ES routes for the Ethernet Segment with:¶
The Ingress NVE that supports this document:¶
Non-upgraded NVEs ignore the Anycast Aliasing flag value and the BGP tunnel encapsulation attribute.¶
It is important to note that this solution MUST NOT be used in the following cases:¶
Consider the example of Figure 3 where three Leaf routers run EVPN over VXLAN tunnels. Suppose Leaf routers L1, L2 and L3 support Anycast Aliasing as per Section 3 and Ethernet Segment ES-1 is configured as an Anycast Aliasing Ethernet Segment, all-active mode, with Anycast VTEP IP12. The three Leaf routers use VNI-1 to identify the Broadcast Domain BD1. Leaf routers L1 and L2 both advertise an A-D per ES route for ESI-1 with the Anycast Aliasing flag set and Anycast VTEP IP12. Suppose only Leaf L1 learns TS1 MAC address, hence only L1 advertises a MAC/IP Advertisement route for TS1 MAC with ESI-1.¶
In this example:¶
Spine-1 and Spine-2 also create underlay ECMP-sets for Anycast VTEP IP12 with next hops L1 and L2. Therefore, in case of:¶
While the solution described in Section 3 suppresses the advertisement of an A-D per EVI route per Ethernet Segment per Broadcast Domain, it also requires the underlay routing protocol to advertise an additional Anycast VTEP IP address per Ethernet Segment. In very large scale Data Centers, the injection of as many /32 or /128 prefixes as Ethernet Segments may have a significant impact in the Forwarding Information Base tables of the Leaf and Spine routers. Therefore the use of Anycast Aliasing becomes a trade-off between the number of A-D per EVI routes in regular EVPN Aliasing and the number of additional Anycast VTEP loopback addresses injected in the underlay routing protocol in the case of Anycast Aliasing. As an example, suppose two Leaf routers L1 and L2 are attached to the same 128 Ethernet Segments and each Ethernet Segment has four Attachment Circuits (in four different Broadcast Domains). In this case:¶
Section 4 discusses solutions to minimize the impact of Anycast Aliasing into the underlay Forwarding tables. We refer to those solutions as Multi Ethernet Segment Anycast (MESA) Aliasing.¶
The procedures described in this section minimize the impact of Anycast Aliasing into the underlay, while preserving the benefits of the solution. The additional extensions build upon the procedure described in Section 3, with some modifications as follows:¶
On the Egress NVEs:¶
On the Ingress NVEs:¶
In most of the use cases in multi-tenant Data Centers, there are two Leaf routers per rack that share all the Ethernet Segments of Tenant Systems in the rack. In this case, a single Anycast VTEP address per rack is injected in the underlay routing protocol, making the solution highly scalable. In addition, in this common use case the "anycast-aliasing-threshold" is set to 2. In case of link failure on the Ethernet Segment, this limits the amount of "fast-rerouted" traffic to only the in-flight packets.¶
Consider the example of Figure 1. Suppose Leaf routers L1, L2 and L3 support Multi Ethernet Segment Anycast Aliasing as per Section 4. Leaf routers L1 and L2 both advertise an A-D per ES route for ESI-1, and an A-D per ES route for ESI-2. Both routes will carry the Anycast Aliasing flag set and the same Anycast VTEP IP12. Following the described procedure, Leaf L3 is configured with anycast-aliasing-threshold = 2 and collect-timer = t. Upon receiving MAC/IP Advertisement routes for the two Ethernet Segments and the expiration of "t" seconds, Leaf L3 determines that the number of NVEs for ESI-1 and ESI-2 is equal to the threshold. Therefore, when sending unicast packets to Tenant Systems TS1 or TS2, L3 uses the Anycast VTEP address as outer IP address.¶
Suppose now that the link TS1-L1 fails. Leaf L1 then sends an MP_UNREACH_NLRI for the A-D per ES route for ESI-1. Upon the recepcion of the message, Leaf L3 changes the resolution of the ESI-1 destination from the Anycast VTEP to the Unicast VTEP derived from the MAC/IP Advertisement route next hop. Packets sent to Tenant System TS2 (on ES-2) still use the Anycast VTEP. In-flight packets sent to TS1 but still arriving at Leaf L1 are "fast-rerouted" to Leaf L2 as per Section 5.¶
The proposal in Section 4 uses a shared VTEP for all the Ethernet Segments in a common Egress NVE group. In case the number of Egress NVEs sharing the group of Ethernet Segments is limited to two, an alternative proposal is to still use a different Anycast VTEP per Ethernet Segment, however allocate all those Anycast VTEP addresses from the same subnet. A single IP Prefix for such subnet is announced in the underlay routing protocol by the Egress NVEs. The benefit of this proposal is that, in case of link failure in one individual Ethernet Segment, e.g., link TS1-L1 in Figure 1, Leaf L2 detecs the failure (based on the withdraw of the A-D per ES and ES routes) and can immediately announce the specific Anycast VTEP address (/32 or /128) into the underlay. Based on a Longest Prefix Match when routing NVO3 packets, Spines can immediately reroute packets (with destination the Anycast VTEP for ESI-1) to Leaf L2. This may reduce the amount of fast-rerouted VXLAN packets and spares the Ingress NVE from having to change the resolution of the Ethernet Segment destination from the Anycast VTEP to the Unicast VTEP.¶
The procedures in Section 3 and Section 4 may lead to some temporary situations in which traffic destined to an Anycast VTEP for an Ethernet Segment arrives at an Egress NVE where the Ethernet Segment link is in a failed state. In that case, the Egress NVE SHOULD re-encapsulate the traffic into a NVO3 tunnel following the procedures described in [I-D.burdet-bess-evpn-fast-reroute], section 7.1, with the following modifications:¶
In addition, when rerouting traffic, the Egress NVE uses the Anycast VTEP of the Ethernet Segment as outer source IP address of the NVO3 tunnel. Note this is the only case in this document where the use of the Anycast VTEP as source IP address is allowed. When an Egress NVE receives NVO3-encapsulated packets where the source VTEP matches a local Anycast VTEP, there are two implicit behaviors on the Egress NVE:¶
The procedures described in this document are applicable also to IP Aliasing use cases in [I-D.sajassi-bess-evpn-ip-aliasing]. Details will be added in future versions of this document.¶
To be added.¶
To be added.¶
IANA is requested to allocate the flag "A" or "Anycast Aliasing mode" in bit 2 of the EVPN ESI Multihoming Attributes registry for the 1-octect Flags field in the ESI Label Extended Community.¶