Internet-Draft | OSPF Flooding Reduction in MSDCs | July 2023 |
Xu, et al. | Expires 26 January 2024 | [Page] |
OSPF is one of the used underlay routing protocol for MSDC (Massively Scalable Data Center) networks. For a given OSPF router within the CLOS topology, it would receive multiple copies of exactly the same LSA from multiple OSPF neighbors. In addition, two OSPF neighbors may send each other the same LSA simultaneously. The unnecessary link-state information flooding wastes the precious process resource of OSPF routers greatly due to the presence of too many OSPF neighbors for each OSPF router within the CLOS topology. This document proposes extensions to OSPF so as to reduce the OSPF flooding within such MSDC networks. The reduction of the OSPF flooding is much beneficial to improve the scalability of MSDC networks. These modifications are applicable to both OSPFv2 and OSPFv3.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 26 January 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
OSPF is commonly used as an underlay routing protocol for Massively Scalable Data Center (MSDC) networks where CLOS is the most popular topology. MSDCs are also called Large-Scale Data Centers.¶
For a given OSPF router within the CLOS topology, it would receive multiple copies of exactly the same LSA from multiple OSPF neighbors. In addition, two OSPF neighbors may send each other the same LSA simultaneously. The unnecessary link-state information flooding significantly wastes the precious process resource of OSPF routers and therefore OSPF could not scale very well in MSDC networks. As a result, some MSDC operators had to choose BGP as the routing protocol in their data centers [RFC7938]. However, with the emergence of high-performance Ethernet networks for AI and high performance computing (HPC), the visibility of the whole network topology, and even the link load information, is crucial for the end-to-end path load-balancing. As a result, link-state routing protocols, such as OSPF, would have to be reconsidered as the routing protocol for large-scale AI and HPC Ethernet networks. Of course, the prerequisite is the scaling issue associated with link-state routing protocols as mentioned above could be addressed.¶
This document describes a pragmatic approach to the above scaling issue. The basic idea is as follows: instead of flooding link-state information across neighboring OSPF routers with the MSDC network fabric, link-state information originated from each OSPF routers would be collected to centralized controllers, which in turn reflect the collected link-state information to all OSPF routers within the MSDC. As shown in Figure 1, all OSPF routers within a MDSC network fabric are connected to one or more centralized controllers via a dedicated Local Area Network (LAN) , referred to as link-state collection and distribution LAN, which is used for link-state information collection and distribution purpose. For redundancy, there should be at least two link-state collection and distribution LANs.¶
+----------+ +----------+ |Controller| |Controller| +----+-----+ +-----+----+ |DR |BDR | | | | ---+---------+---+----------+-----------+---+---------+- LS Collection&Distribution LAN | | | | | |Non-DR |Non-DR |Non-DR |Non-DR |Non-DR | | | | | | +---+--+ | +---+--+ | | |Router| | |Router| | | *------*- | /*---/--* | | / \ -- | // / \ | | / \ -- | // / \ | | / \ --|// / \ | | / \ /*- / \ | | / \ // | -- / \ | | / \ // | -- / \ | | / /X | -- \ | | / // \ | / -- \ | | / // \ | / -- \ | | / // \ | / -- \ | | / // \ | / -- \ | | / // \ | / -- \ | | / // \ | / -- \ | +-+- //* +\\+-/-+ +---\-++ |Router| |Router| |Router| +------+ +------+ +------+ Figure 1¶
With the assistance of these controllers which are acting as OSPF Designated Router (DR)/Backup Designated Router (BDR) for the link-state collection and distribution LAN, OSPF routers within the MSDC network don't need to exchange any other types of OSPF packet than the OSPF Hello packet among them. As specified in [RFC2328], these Hello packets are used for the purpose of establishing and maintaining neighbor relationships and ensuring bidirectional communication between OSPF neighbors, and even the DR/BDR election purpose in the case where those OSPF routers are connected to a broadcast network. In order to obtain the full topology information (i.e., the fully synchronized link-state database) of the MSDC's network, these OSPF routers only need to exchange the link-state information with the controllers being elected as OSPF DR/BDR for the link-state collection and distribution LAN instead.¶
To further suppress the flooding of multicast OSPF packets originated from OSPF routers over the link-state collection and distribution LAN, OSPF routers would not send multicast OSPF Hello packets over the link-state collection and distribution LAN. Instead, they just wait for OSPF Hello packets originated from the controllers being elected as OSPF DR/BDR initially. Once OSPF DR/BDR for the link-state collection and distribution LAN have been discovered, they start to send OSPF Hello packets directly (as unicasts) to OSPF DR/BDR periodically. In addition, OSPF routers would send other types of OSPF packets (e.g., Database Descriptor packet, Link State Request packet, Link State Update packet, Link State Acknowledgment packet) to OSPF DR/BDR for the LINK-STATE collection and distribution LAN as unicasts as well. In contrast, the controllers being elected as OSPF DR/BDR would send OSPF packets as specified in [RFC2328]. As a result, OSPF routers within the MSDC would not receive OSPF packets from one another unless these OSPF packets are forwarded as unknown unicasts over the LINK-STATE collection and distribution LAN. Through these modifications to the legacy OSPF router behaviors, the OSPF flooding is greatly reduced, which is much beneficial to improve the overall scalability of MSDC networks. These modifications specified in this document are applicable to both OSPFv2 [RFC2328] and OSPFv3 [RFC5340].¶
The mechanism for OSPF refresh and flooding reduction in stable topologies as described in [RFC4136] may be considered as well.¶
After the exchange of OSPF Hello packets among OSPF routers, the OSPF neighbor relationship among them would transition to and remain in the 2-WAY state. OSPF routers would originate Router-LSAs and/or Network-LSAs accordingly depending upon the link-types. Note that the neighbors in the 2-WAY state would be advertised in the Router-LSAs and/or Network-LSA. This is slightly different from the legacy OSPF router behavior as specified in [RFC2328] where the neighbors in the TWO-WAY state would not be advertised. However, these self-originated LSAs need not to be exchanged directly among them anymore. Instead, these LSAs only need to be sent solely to the controllers being elected as OSPF DR/BDR for the LINK-STATE collection and distribution LAN.¶
To further reduce the flood of multicast OSPF packets over the LINK-STATE collection and distribution LAN, OSPF routers SHOULD send OSPF packets as unicasts. More specifically, OSPF routers SHOULD send unicast OSPF Hello packets periodically to the controllers being elected as OSPF DR/BDR. In other words, OSPF routers SHOULD NOT send any OSPF Hello packet over the LINK-STATE collection and distribution LAN until they have found an OSPF DR/BDR for the LINK-STATE collection and distribution LAN. Note that OSPF routers, within the MSDC, SHOULD NOT be elected as OSPF DR/BDR for the LINK-STATE collection and distribution LAN (This is done by setting the Router Priority of those OSPF routers to zero). As a result, OSPF routers would not see each other over the LINK-STATE collection and distribution LAN. Furthermore, OSPF routers SHOULD send all other types of OSPF packets than OSPF Hello packets to the controllers being elected as OSPF DR/BDR as unicasts as well.¶
To avoid the data traffic from being forwarded across the link-state collection and distribution LAN, the cost of all OSPF routers' interfaces to the link-state collection and distribution LAN SHOULD be set to the maximum value.¶
When a given OSPF router lost its connection to the link-state collection and distribution LAN, it SHOULD actively establish FULL adjacency with all of its OSPF neighbors within the MSDC network. As such, it could obtain the full LSDB of the MSDC network while flooding its self-originated LSAs to the remaining part of the whole network. That's to say, for a given OSPF router within the MSDC network, it would not actively establish FULL adjacency with its OSPF neighbor in the 2-WAY state by default. However, it SHOULD NOT refuse to establish FULL adjacency with a given OSPF neighbors when receiving Database Description Packets from that OSPF neighbor.¶
The controllers being elected as OSPF DR/BDR would send OSPF packets as multicasts or unicasts as per [RFC2328]. In addition, Link State Acknowledgment packets are RECOMMENDED to be sent as unicasts rather than multicasts.¶
The authors would like to thank Acee Lindem and Mohamed Boucadair for their valuable comments and suggestions on this document.¶
TBD.¶