Internet-Draft | Broadening the Scope of ESP | August 2023 |
Rossberg, et al. | Expires 16 February 2024 | [Page] |
There are certain use cases where the Encapusalating Security Payload (ESP) protocol in its current form cannot reach its maximum potential regarding security, features and performance. Although these scenarios are quite different, the shortcomings could be remedied by three measures: Introducing more fine-grained sub-child-SAs, adapting the ESP header and trailer format, and allowing parts of the transport layer header to be unencrypted. These mechanisms are neither completely interdependent, nor are they entirely orthogonal, as the implementation of one measure does influence the integration of another. Although an independent specification and implementation of these mechanisms is possible, it may be worthwhile to consider a combined solution to avoid a combinatorial explosion of optional features.¶
Therefore, this document does not yet propose a specific change to ESP. Instead, explains the relevant scenarios, details possible modifications of the protocol, collects arguments for (and against) these changes, and discusses their implications.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 16 February 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
This document does not (yet) describe an addition to IPsec. Rather, it attempts to describe scenarios where ESP currently cannot be used optimally. Afterwards, possible solutions for those scenarios are discussed and evaluated.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Especially, but not limited, to intra-data-center traffic there are several challenges when deploying IPsec. In particular, these challenges originate in implementing one of the following techniques, or a combination thereof:¶
As these challenges are due to the same root causes and can therefore solved by the same set of measures, they are commonly addressed in this document. In particular, these root causes are:¶
Before discussing possible solutions, the following sections will elaborate how the techniques above collide with these root causes.¶
Due to IPsec being often processed in software, small-packet throughputs of significantly above 10Gbit/s are currently only achievable when scaling to multiple CPU cores. However, this scaling only works if cores do not have to synchronize tightly. In particular, it is impossible to synchronize anti-replay windows and sequence counters efficiently, even when using atomic CPU instructions. Detailed explanations may be found in [I-D.pwouters-ipsecme-multi-sa-performance]. Consequently, scaling over multiple cores leads to multiple packet streams, one per processing core. These streams may advance independently, and thus introduce packet reordering. This reordering contradicts to the concept of an anti-replay window which does not allow for packets being too far out of order. Consequently, packets might be dropped unpredictably.¶
Similarly, traffic may be categorized into different classes to provide quality of service. QoS classes do not belong to the traffic selector of a Child SA. So using different QoS classes for the same traffic selector will introduce reordering of packets within a child SA. In contrast to multicore software processing, this type of packet reordering is intentional and not accidental. The consequences, however, are comparable.¶
A sender may also decide to send packets to single receiver via multiple paths, e.g., by using multiple uplinks in an SD-WAN scenario. Depending on the characteristics of the uplinks, this shows similarities to the multicore scenario (uplinks with relatively similar characteristics) or the QoS scenario (uplinks with rather different characteristics).¶
A multicast scenario with only a single sender does not pose an issue, as the sender can simply increment its sequence counter. Each receiver has a complete view of the traffic and can thus maintain its replay window as usual. But as soon as there are multiple senders, they would need to coordinate their sequence number usage, which is even less efficiently implementable than in the multicore case. Finally, as currently only the lower half of the sequence number is actually transmitted in the packet, a receiver joining late cannot guess the respective upper half. Therefore, replay protection is usually disabled in multicast scenarios.¶
Increasing link speeds and basically constant packet sizes lead to higher and higher packet rates that need to be processed. Multi-core processing as described in Section 2.1 can only partially compensate for this, as, e.g., a single TCP flow cannot be parallelized across multiple cores. Thus, modern implementations must be highly optimized to cope with the high packet rates. However, ESP's split header/trailer makes this unnecessarily complicated, as header and trailer end up in different cache lines. Similarly, the alignment to a 4-byte boundary is too short for modern architectures. Please refer to [PRGS20] for a more elaborate discussion.¶
Software-defined networking often uses information from the transport header, e.g., the port numbers, for identifying flows, steering and microsegmentation. This can currently not be combined with ESP, as this information is encrypted. On the other hand, in many scenarios this intended to avoid leaking flow information. An adaptable approach could cater for both needs.¶
Besides the obvious requirement of not impairing security the following shall be considered:¶
There are several approaches to deal with the presence of multiple independent packet streams:¶
SDN could be enabled by:¶
The ESP header/trailer format could be modernized by:¶
A straightforward solution would be to simply disable replay protection. For example, PSP was designed without replay protection (see [PSP]).¶
Advantages:¶
Disadvantages:¶
This approach may be acceptable for specific scenarios (e.g., multicast), but not for the general case. It is especially problematic for any multicore scenarios, as the status quo without parallelization provides replay protection. This approach is therefore not discussed any further.¶
For some scenarios, it might be reasonable to set up multiple, separate IKE SAs.¶
Advantages:¶
Disadvantages:¶
In summary, the main issue of this approach is scalability. It may be appropriate for certain scenarios, where the total number of additional IKE SAs is low. It is not suited for general usage in large deployments. In particular, deploying multiple of the techniques described in Section 2 leads to a combinatorial explosion of the number of required SAs. For example, if one intends to transport traffic with 8 QoS classes between two gateways with 32 cores, there would be already 256 SAs solely between these two gateways. Even if the data plane and IKE daemon can support such a setup, there may be too much complexity pushed into the operational domain. Therefore, this approach is not generally applicable.¶
This approach has been proposed recently as a draft [I-D.pwouters-ipsecme-multi-sa-performance]. The draft is restricted to the multicore scenario outlined in Section 2.1. It is similar to establishing multiple IKE SAs, but avoids a significant portion of their overhead by restricting the multiple instantiations to child SAs.¶
Advantages:¶
Disadvantages:¶
Using multiple child SAs is a significant step forward for the multicore scenario. It is a simple (in the positive sense), straightforward solution harvesting low-hanging fruits. But this simplicity inherits some drawbacks from the multiple-IKE-SAs approach caused by the independence of the child SAs regarding setup, state, rekeying and failure. These disadvantages get worse the more child SAs are required. Therefore, the per-CPU child SAs approach is not an ideal fit to the other scenarios described in Section 2, or a combination of the scenarios.¶
This approach differs from the previous two as it does not attempt to create multiple replay windows, but to accommodate the traffic within a single anti-replay window. This fits to the QoS scenario depicted in Section 2.2 if any higher-prioritized traffic does not advance the anti-replay window too far for the lower-prioritized traffic. The idea is not applicable to the multicore or multicast scenarios, as larger windows can only solve the problem of packets being reordered by the network, but do not allow unsynchronized sequence counters (as, e.g., [RFC4303] requires strict monotonicity).¶
Advantages:¶
Disadvantages:¶
As discussed previously, this approach is only suitable for the QoS and multipath scenarios. A comparison with other mechanisms requires an estimation of the required window sizes. The time low-priority packets may be delayed by shapers and queues depends on many parameters, e.g., the actual and admitted traffic rates, the sizes of admissible burst, strict-priority scheduling, etc.¶
An attempt to simplify the problem is to make windows large enough to admit packets that are delayed up to a certain time threshold T. Consider a packet being "stuck" in the network due to other packets being prioritized. Those packets advance the replay window. Let their Ethernet size be S and their throughput TP. It makes sense for TP to be an interface speed, otherwise, the delayed packet would not be stuck. We therefore end up with the following packets rates R:¶
S [byte] | TP [Gbit/s] | R [Mp/s] |
---|---|---|
64 | 10 | 14.881 |
1518 | 10 | 0.813 |
64 | 100 | 148.810 |
1518 | 100 | 8.127 |
For T = 100 ms, this would mean that the windows must, in the worst case, accommodate between 80,000 and 14.8 million packets. It might be argued that the higher boundary is currently unrealistic, as it would require a 100 Gbit/s link to be saturated with small, prioritized packets. On the other hand, 100 ms is the acceptable delay for VoIP, whereas for applications with low priority demands, it might make sense to deliver even older packets.¶
The final possibility is standardizing a new approach that tries to combine the advantages of the approaches discussed previously. In essence, it is the idea of allowing multiple sequence counters (and thus use multiple anti-replay windows) per child SA. These sequence counters must allow incrementing independently of each other, making the approach applicable to all outlined scenarios. It is also possible to think of the individual counter/windows pairs as sub-SAs within a child SA.¶
First of all, receivers must be able to distinguish those sub-SAs. There are multiple possibilities to achieve this:¶
In any case, the approach necessitates some additional clarifications:¶
Using a single key for multiple sub-SAs has implications on security:¶
Advantages:¶
Disadvantages:¶
Compared to setting up separate IKE or child SAs, it might be argued that the idea of sub-SAs keeps the complexity and overhead away from the VPN's operation. Furthermore, storing an SPI, a 64-bit sequence number, and a replay window for 64 packets for 64 different QoS classes requires a total of 10240 bit. This is significantly less than even the lower boundary established for the approach described in Section 4.4. However, of the discussed alternatives, it is the most complex change to existing standard and implementation semantics.¶
It would be possible to add an encryption offset to ESP, signalling that a number of bytes at the beginning of the packet are not encrypted. Note that they can still be authenticated, e.g., as Additional Authenticated Data in modern AEAD modes. This approach was chosen for [PSP].¶
Advantages:¶
Disadvantages:¶
As more drastic approach, one could insert the ESP header not between network and transport layer headers, but, e.g., between transport layer header and the payload. Alternatively, the transport layer header could be "copied out".¶
Advantages:¶
Disadvantages:¶
Transmitting the entire sequence number makes processing easier, and enables receivers to join late in multicast scenarios. In the later versions of [I-D.ponchon-ipsecme-anti-replay-subspaces], it was proposed to transmit a larger portion of the sequence number.¶
Advantages:¶
Disadvantages:¶
The trailer could be removed by moving its fields to header. Contrary to Ethernet, there appears to be no requirement for "cut-through" packet processing.¶
Advantages:¶
Disadvantages:¶
Dropping the 4-byte alignment does probably not warrant a change of ESP on its own. However, when the ESP frame format is updated for other reasons, it is worth considering, as modern architectures and their SIMD instructions typically require larger alignment. Please note that the removing the cryptographic padding (which is not required for all current AEAD modes) would allow even more simplification, but also significantly limit cryptographic agility.¶
Advantages:¶
Disadvantages:¶
Please note: For any of the sub-child-SA approaches it is essential for the receiver to steer traffic being generated by a CPU core of the sender to a determined CPU core that handles the incoming traffic. For example, if a two CPU cores at the sender generate large amounts of traffic in one QoS class, it is not only sufficient to perform RSS on the child SAs or sub-child SAs, as this would not avoid the two streams being mapped to the same receiver CPU.¶
This memo includes no request to IANA.¶
TODO: In its current state, this draft discusses multiple alternatives. Please refer to Section 4 for a discussion including remarks on security.¶