Broadening the Scope of Encapsulating Security Payload (ESP) Protocol

Internet-Draft	Broadening the Scope of ESP	August 2023
Rossberg, et al.	Expires 16 February 2024	[Page]

Abstract

There are certain use cases where the Encapusalating Security Payload (ESP) protocol in its current form cannot reach its maximum potential regarding security, features and performance. Although these scenarios are quite different, the shortcomings could be remedied by three measures: Introducing more fine-grained sub-child-SAs, adapting the ESP header and trailer format, and allowing parts of the transport layer header to be unencrypted. These mechanisms are neither completely interdependent, nor are they entirely orthogonal, as the implementation of one measure does influence the integration of another. Although an independent specification and implementation of these mechanisms is possible, it may be worthwhile to consider a combined solution to avoid a combinatorial explosion of optional features.¶

Therefore, this document does not yet propose a specific change to ESP. Instead, explains the relevant scenarios, details possible modifications of the protocol, collects arguments for (and against) these changes, and discusses their implications.¶

2. Envisioned Scenarios

Especially, but not limited, to intra-data-center traffic there are several challenges when deploying IPsec. In particular, these challenges originate in implementing one of the following techniques, or a combination thereof:¶

Multicore Software Processing¶
Quality-of-Service (QoS)¶
Multipath¶
Multicast¶
High-Speed Links¶
Software-Defined Networking (SDN)¶

As these challenges are due to the same root causes and can therefore solved by the same set of measures, they are commonly addressed in this document. In particular, these root causes are:¶

The idea of an SA forming a single stream of packets that is generally in-order.¶
The concept of a flow as five tuple, including TCP and UDP port numbers that are invisible to the network equipment due to ESP's encryption.¶
The header/trailer format that does not always match current hardware's realities (4-byte sequence number, header/trailer split, and the 4-byte alignment requirement).¶

Before discussing possible solutions, the following sections will elaborate how the techniques above collide with these root causes.¶

2.1. Multicore Software Processing

Due to IPsec being often processed in software, small-packet throughputs of significantly above 10Gbit/s are currently only achievable when scaling to multiple CPU cores. However, this scaling only works if cores do not have to synchronize tightly. In particular, it is impossible to synchronize anti-replay windows and sequence counters efficiently, even when using atomic CPU instructions. Detailed explanations may be found in [I-D.pwouters-ipsecme-multi-sa-performance]. Consequently, scaling over multiple cores leads to multiple packet streams, one per processing core. These streams may advance independently, and thus introduce packet reordering. This reordering contradicts to the concept of an anti-replay window which does not allow for packets being too far out of order. Consequently, packets might be dropped unpredictably.¶

2.2. Implementing QoS mechanisms

Similarly, traffic may be categorized into different classes to provide quality of service. QoS classes do not belong to the traffic selector of a Child SA. So using different QoS classes for the same traffic selector will introduce reordering of packets within a child SA. In contrast to multicore software processing, this type of packet reordering is intentional and not accidental. The consequences, however, are comparable.¶

2.3. Multipath

A sender may also decide to send packets to single receiver via multiple paths, e.g., by using multiple uplinks in an SD-WAN scenario. Depending on the characteristics of the uplinks, this shows similarities to the multicore scenario (uplinks with relatively similar characteristics) or the QoS scenario (uplinks with rather different characteristics).¶

2.4. Multicast

A multicast scenario with only a single sender does not pose an issue, as the sender can simply increment its sequence counter. Each receiver has a complete view of the traffic and can thus maintain its replay window as usual. But as soon as there are multiple senders, they would need to coordinate their sequence number usage, which is even less efficiently implementable than in the multicore case. Finally, as currently only the lower half of the sequence number is actually transmitted in the packet, a receiver joining late cannot guess the respective upper half. Therefore, replay protection is usually disabled in multicast scenarios.¶

2.5. High-Speed Links

Increasing link speeds and basically constant packet sizes lead to higher and higher packet rates that need to be processed. Multi-core processing as described in Section 2.1 can only partially compensate for this, as, e.g., a single TCP flow cannot be parallelized across multiple cores. Thus, modern implementations must be highly optimized to cope with the high packet rates. However, ESP's split header/trailer makes this unnecessarily complicated, as header and trailer end up in different cache lines. Similarly, the alignment to a 4-byte boundary is too short for modern architectures. Please refer to [PRGS20] for a more elaborate discussion.¶

2.6. Software-Defined Networking (SDN)

Software-defined networking often uses information from the transport header, e.g., the port numbers, for identifying flows, steering and microsegmentation. This can currently not be combined with ESP, as this information is encrypted. On the other hand, in many scenarios this intended to avoid leaking flow information. An adaptable approach could cater for both needs.¶

4. Discussion of possible approaches

There are several approaches to deal with the presence of multiple independent packet streams:¶

Disabling replay protection (Section 4.1)¶
Using multiple IKE SAs (Section 4.2)¶
Using multiple child SAs (Section 4.3)¶
Increasing anti-replay window sizes (Section 4.4)¶
Using sub-child SAs (Section 4.5)¶

SDN could be enabled by:¶

Using an encryption offset(Section 4.6)¶
Moving the ESP header(Section 4.7)¶

The ESP header/trailer format could be modernized by:¶

Transmitting the sequence number entirely(Section 4.8)¶
Removing the trailer by moving its fields to the header(Section 4.9)¶
Dropping the 4-byte alignment requirement(Section 4.10)¶

4.1. Disabling Replay Protection

A straightforward solution would be to simply disable replay protection. For example, PSP was designed without replay protection (see [PSP]).¶

Advantages:¶

Trivially solves all the reordering and synchronization issues discussed previously. Note: This may still violate existing RFCs, which require sequence numbers to be generated in order, but this violation should not have an impact.¶

Disadvantages:¶

The approach significantly lowers the level of security. Although most upper layer protocols (e.g., TCP) provide protection from duplicated data, this cannot be assumed for the general case. Even if the duplicates are never delivered to a user application, they usually do trigger responses from the receivers' network stack, e.g., TCP RSTs or ICMP errors. This in turn enables an attacker to trigger ciphertext generation, possibly facilitating subsequent attacks. Such attacks have practically been used against WiFi encryption in the early 2000s.¶
It is unclear how an SA protecting multiple plaintext flows can be distributed to multiple cores on the receiver. Receive-Side Scaling (RSS) or explicit steering rules need some indication which packets carry the same plaintext flow and thus need to be sent to the same core. Otherwise, intra-flow reordering is introduced, which may severely disturb higher level protocols, e.g., TCP's congestion control or VoIP audio streams. Thus, efficient multicore processing is not possible for the receiver.¶

This approach may be acceptable for specific scenarios (e.g., multicast), but not for the general case. It is especially problematic for any multicore scenarios, as the status quo without parallelization provides replay protection. This approach is therefore not discussed any further.¶

4.2. Using multiple IKE SAs

For some scenarios, it might be reasonable to set up multiple, separate IKE SAs.¶

Advantages:¶

As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.¶
Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.¶
No changes to existing standards required.¶

Disadvantages:¶

There is a time and communication overhead due to the negotiation of every IKE SA requiring network round trips, packet processing, asymmetric cryptography, etc. The initial setup could be accelerated by a reactive instead of a proactive SA negotiation, i.e., delaying the setup of the SA for a specific core or QoS class until the first packet arrives on the core or with the respective QoS tag. However, this is a highly debatable strategy, as it induces either drops or large delays for the initial packets of these flows.¶
There is a state/memory overhead due to completely separate state of every SA, e.g., traffic selectors, keys, lifetimes. To a large extent, these states will hold identical information.¶
During operation, there is overhead due to the regular rekeying of each SA and, if enabled, dead peer detection.¶
Additional effort to configure the required number of SAs must be made. Furthermore, monitoring larger networks becomes more complex due to the fact that multiple SAs now mapping to identical connections.¶
The failure model is unspecified if a subset of the IKE SAs cannot be established. For example, in the multicore scenario, this leads to packet loss or at least performance fluctuations on some plaintext flows, depending on the core they are processed on. Such situations historically have a bad track record, e.g., partially loading websites with (non-persistent) HTTP; SIP-working-but-RTP-failing conditions in VoIP, etc.¶

In summary, the main issue of this approach is scalability. It may be appropriate for certain scenarios, where the total number of additional IKE SAs is low. It is not suited for general usage in large deployments. In particular, deploying multiple of the techniques described in Section 2 leads to a combinatorial explosion of the number of required SAs. For example, if one intends to transport traffic with 8 QoS classes between two gateways with 32 cores, there would be already 256 SAs solely between these two gateways. Even if the data plane and IKE daemon can support such a setup, there may be too much complexity pushed into the operational domain. Therefore, this approach is not generally applicable.¶

4.3. Using multiple (per-CPU) child SAs

This approach has been proposed recently as a draft [I-D.pwouters-ipsecme-multi-sa-performance]. The draft is restricted to the multicore scenario outlined in Section 2.1. It is similar to establishing multiple IKE SAs, but avoids a significant portion of their overhead by restricting the multiple instantiations to child SAs.¶

Advantages:¶

There is significantly less overhead compared to setting up independent IKE SAs.¶
As there are independent sequence numbers and anti-replay windows, there is no need to synchronize between multiple CPU cores or senders.¶
Distinct SPIs allow RSS or explicit steering, and thus enable processing without reordering.¶
The draft incurs only a small change in standards and existing source code, as multiple child SAs are already possible in IKEv2 [RFC7296], and the draft simply adds a mechanism to negotiate them explicitly.¶

Disadvantages:¶

Due to the setup of child SAs via separate CREATE_CHILD_SA exchanges, there is still communications overhead, especially for larger numbers of SAs. As for multiple IKE SAs, both a proactive setup or a reactive setup are possible, i.e., resulting in a longer establishment time or a less predictable runtime behavior, respectively.¶
There is still some per-child-SA state overhead in the data plane. However, as the IKE daemon knows about those SAs being child per-Queue children of the same IKE SA, an optimized implementation might be able to reduce that overhead to a minimum.¶
During operation, there is overhead traffic due to the regular rekeying.¶
Similar to separate IKE SAs, there is the possibility of a partially working SA if some the child SAs fail to set up. It is not immediately clear what the correct reaction should be, especially in the scope of a large VPN deployment, compared to the all-or-nothing failure model when parallel child SAs are not used.¶

Using multiple child SAs is a significant step forward for the multicore scenario. It is a simple (in the positive sense), straightforward solution harvesting low-hanging fruits. But this simplicity inherits some drawbacks from the multiple-IKE-SAs approach caused by the independence of the child SAs regarding setup, state, rekeying and failure. These disadvantages get worse the more child SAs are required. Therefore, the per-CPU child SAs approach is not an ideal fit to the other scenarios described in Section 2, or a combination of the scenarios.¶

4.4. Increasing anti-replay window sizes

This approach differs from the previous two as it does not attempt to create multiple replay windows, but to accommodate the traffic within a single anti-replay window. This fits to the QoS scenario depicted in Section 2.2 if any higher-prioritized traffic does not advance the anti-replay window too far for the lower-prioritized traffic. The idea is not applicable to the multicore or multicast scenarios, as larger windows can only solve the problem of packets being reordered by the network, but do not allow unsynchronized sequence counters (as, e.g., [RFC4303] requires strict monotonicity).¶

Advantages:¶

No changes to standards are required, as the anti-replay window size is a local matter.¶
The approach inherits the advantages of a single child SA, e.g., there is no setup overhead, less state overhead than with multiple child SAs (only the larger replay windows) and no complex failure model.¶

Disadvantages:¶

Even in software implementations, the anti-replay windows cannot grow indefinitely large. Especially in latency-sensitive deployments, i.e., where one would use QoS, achieving throughput above 10 Gbit/s depends on the ability to keep state in the CPU caches, even for a larger number of peers.¶
Complex configuration: Choosing a correct value of for window size depends not on only the number of QoS classes, but also on the maximum divergence of sequence numbers, which in turn depends on the QoS configuration, the possible throughput and the traffic mix.¶

As discussed previously, this approach is only suitable for the QoS and multipath scenarios. A comparison with other mechanisms requires an estimation of the required window sizes. The time low-priority packets may be delayed by shapers and queues depends on many parameters, e.g., the actual and admitted traffic rates, the sizes of admissible burst, strict-priority scheduling, etc.¶

An attempt to simplify the problem is to make windows large enough to admit packets that are delayed up to a certain time threshold T. Consider a packet being "stuck" in the network due to other packets being prioritized. Those packets advance the replay window. Let their Ethernet size be S and their throughput TP. It makes sense for TP to be an interface speed, otherwise, the delayed packet would not be stuck. We therefore end up with the following packets rates R:¶

Table 1: Packet rates
S [byte]	TP [Gbit/s]	R [Mp/s]
64	10	14.881
1518	10	0.813
64	100	148.810
1518	100	8.127

For T = 100 ms, this would mean that the windows must, in the worst case, accommodate between 80,000 and 14.8 million packets. It might be argued that the higher boundary is currently unrealistic, as it would require a 100 Gbit/s link to be saturated with small, prioritized packets. On the other hand, 100 ms is the acceptable delay for VoIP, whereas for applications with low priority demands, it might make sense to deliver even older packets.¶

4.5. Using Sub-Child SAs

The final possibility is standardizing a new approach that tries to combine the advantages of the approaches discussed previously. In essence, it is the idea of allowing multiple sequence counters (and thus use multiple anti-replay windows) per child SA. These sequence counters must allow incrementing independently of each other, making the approach applicable to all outlined scenarios. It is also possible to think of the individual counter/windows pairs as sub-SAs within a child SA.¶

First of all, receivers must be able to distinguish those sub-SAs. There are multiple possibilities to achieve this:¶

Using the SPI: The SPI would be allocated per sub-SA, i.e., a range of SPIs would belong to a single child SA. Therefore, it is possible to embed, e.g., the ID of the sending core in some bits of the SPI.¶
Using the sequence number: Some bits of the sequence number would be used to indicate the sub-SA. This approach reduces the available sequence numbers. Note that the consequences depend on whether the traffic is distributed evenly among the individual sub-SAs (e.g., multicore scenario) or not (e.g., QoS scenario).¶
Using an additional field: Of course, it is also possible to introduce a new field to the ESP header. This can lead to a simpler design, but also constitutes the largest change to existing standards. It was proposed by [I-D.ponchon-ipsecme-anti-replay-subspaces] in the later versions of the draft.¶

In any case, the approach necessitates some additional clarifications:¶

The receiver may use the steering capabilities of its NIC to map ingress packets to its sub-SAs, e.g., to different queues, to allow for efficient multicore utilization. This is especially important for the multicore scenario, as software redirects to other cores must be avoided for performance reasons. The simplest case is the sub-SA being encoded in the SPI, as many NICs already provide features for matching on SPIs. For the other two distinguishing mechanisms, flexible or raw matchers may be used.¶
The setup and renewal of sub-SAs should happen in bulk, i.e., there is only one exchange to set up the child SA. This leads to reliable performance characteristics, as there is no on-demand sub-SA creation. Furthermore, the failure model is very simple: The child SA with all its sub-SAs exists, or it does not.¶
Only the sequence counters and anti-replay windows would be allocated per sub-SA.¶
All other properties of the SA are per child SA i.e., traffic selectors, mode, but also the key material. Using the same key for all sub-SAs needs to be done with care to avoid effects on security (details will follow shortly). However, if there were different keys, neither the scalability (bulk setup and rekeying) nor the predictable failure model would be possible.¶

Using a single key for multiple sub-SAs has implications on security:¶

It must be ensured that this approach cannot lead to reused IVs for counter modes. For example, in the case of AES-GCM [RFC4106], this means either the salt must be different for each sub-SA, or the IV space must be partitioned accordingly. Note that partitioning the IV space is not possible with implicit IV modes ([RFC8750]), as [RFC4303] requires sequence numbers to be initialized to zero.¶
Hard limits for packet and byte counters must be scaled accordingly. For example, if no more than 2^64 packets should be transmitted using a given key, and the child SA consists of 2^8 sub-SAs, then every sub-SA must not be allowed to send more than 2^56 packets, in case no fine-grained synchronization is possible. In case transmission happens on the same CPU core, overcommitting may be possible as long as the total number of packets or bytes is ensured to be never exceeded.¶
Rekey limits must apply to all sub-SAs combined. For example, if a child SA is configured to be rekeyed after transmission of X bytes or Y packets, then the rekey must be triggered if the sum of bytes or packets on all sub-SAs reaches X or Y. For situations where overcommitting is not possible, we suggest to reference the sub-SA with the maximum number of bytes/packets already sent, say X'_max and Y'_max. X'_max and Y'_max are multiplied with number of sub-SAs and if that value exceeds X or Y, a rekeying is initiated.¶
In case SPIs or an explicit header field are used to encode sub-SAs it may (theoretically) be possible to send more than 2^64 packets using a single key. This may form a problem for ciphers, such as AES-GCM. In this case a hard limit of at most 2^64 packets MUST be enforced.¶

Advantages:¶

Independent sequence numbers and anti-replay windows are available.¶
The approach allows for RSS or explicit steering, especially if the SPI-encoding is used.¶
Most scalable approach: The child SA setup requires exchanging, e.g., an SPI range but does not depend on the number of sub-SAs allocated. Similarly, there is only an ID, sequence counters, and an anti-replay window to store per sub-SA. The remainder of state can be shared.¶
There is no rekeying overhead, as just a single Child SA needs to be rekeyed.¶
Predictable performance characteristics due to the batched, proactive establishment.¶
Clean failure model due to the all-or-nothing setup.¶

Disadvantages:¶

There are potential security implications, which must be discussed thoroughly, to avoid weakening security at any point.¶
The change in the data plane may seem be a bit more complex change compared to per-CPU child SAs. Nevertheless, fallback SAs like mentioned in [I-D.pwouters-ipsecme-multi-sa-performance] are avoided.¶

Compared to setting up separate IKE or child SAs, it might be argued that the idea of sub-SAs keeps the complexity and overhead away from the VPN's operation. Furthermore, storing an SPI, a 64-bit sequence number, and a replay window for 64 packets for 64 different QoS classes requires a total of 10240 bit. This is significantly less than even the lower boundary established for the approach described in Section 4.4. However, of the discussed alternatives, it is the most complex change to existing standard and implementation semantics.¶

4.6. Using an encryption offset

It would be possible to add an encryption offset to ESP, signalling that a number of bytes at the beginning of the packet are not encrypted. Note that they can still be authenticated, e.g., as Additional Authenticated Data in modern AEAD modes. This approach was chosen for [PSP].¶

Advantages:¶

Enables SDN use cases.¶
Is optional (zero offset), and can even be applied to just a subset of the packets.¶

Disadvantages:¶

Significant change to IPsecs semantics and security guarantees.¶
Intermediate devices need to implement ESP to parse the header. This is a significant issues for flow matching engines implemented in hardware.¶

4.7. Moving the ESP header

As more drastic approach, one could insert the ESP header not between network and transport layer headers, but, e.g., between transport layer header and the payload. Alternatively, the transport layer header could be "copied out".¶

Advantages:¶

Transparent for intermediate devices, i.e., no changes to their hardware of software necessary.¶

Disadvantages:¶

Significant change to IPsecs semantics due to the layering violation.¶
The receiver needs to know where the ESP header can be found. This is only simple if all senders use the same logic, otherwise, a complex negotiation is required.¶
Transport layer length and checksum fields must be adapted if they are checked by any device on the path.¶

4.8. Transmitting the sequence number entirely

Transmitting the entire sequence number makes processing easier, and enables receivers to join late in multicast scenarios. In the later versions of [I-D.ponchon-ipsecme-anti-replay-subspaces], it was proposed to transmit a larger portion of the sequence number.¶

Advantages:¶

Relatively minor change.¶

Disadvantages:¶

Uses more bandwidth, although at least for counter-based cipher modes, this can be compensated by using implicit IVs (see [RFC8750]).¶

4.9. Removing the trailer

The trailer could be removed by moving its fields to header. Contrary to Ethernet, there appears to be no requirement for "cut-through" packet processing.¶

Advantages:¶

Software packet processing benefits from cache locality.¶
Parsing is simpler as there is no variable-length payload in between.¶

Disadvantages:¶

Larger change to the existing packet layout.¶

4.10. Dropping the 4-byte alignment requirement

Dropping the 4-byte alignment does probably not warrant a change of ESP on its own. However, when the ESP frame format is updated for other reasons, it is worth considering, as modern architectures and their SIMD instructions typically require larger alignment. Please note that the removing the cryptographic padding (which is not required for all current AEAD modes) would allow even more simplification, but also significantly limit cryptographic agility.¶

Advantages:¶

Relatively minor change.¶

Disadvantages:¶

Enables only minimal simplification of processing on its own.¶

Broadening the Scope of Encapsulating Security Payload (ESP) Protocol

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Requirements Language

2. Envisioned Scenarios

2.1. Multicore Software Processing

2.2. Implementing QoS mechanisms

2.3. Multipath

2.4. Multicast

2.5. High-Speed Links

2.6. Software-Defined Networking (SDN)

3. Requirements

4. Discussion of possible approaches

4.1. Disabling Replay Protection

4.2. Using multiple IKE SAs

4.3. Using multiple (per-CPU) child SAs

4.4. Increasing anti-replay window sizes

4.5. Using Sub-Child SAs

4.6. Using an encryption offset

4.7. Moving the ESP header

4.8. Transmitting the sequence number entirely

4.9. Removing the trailer

4.10. Dropping the 4-byte alignment requirement

5. Remark on steering

6. IANA Considerations

7. Security Considerations

8. References

8.1. Normative References

8.2. Informative References

Authors' Addresses