quic-io-arch.md 26 KB

QUIC I/O Architecture

This document discusses possible implementation options for the I/O architecture internal to the libssl QUIC implementation, discusses the underlying design constraints driving this decision and introduces the resulting I/O architecture. It also identifies potential hazards to existing applications, and identifies how those hazards are mitigated.

Objectives

The requirements for QUIC which have formed the basis for implementation include the following requirements:

  • The application must have the ability to be in control of the event loop without requiring callbacks to process the various events. An application must also have the ability to operate in “blocking” mode.

  • High performance applications (primarily server based) using existing libssl APIs; using custom network interaction BIOs in order to get the best performance at a network level as well as OS interactions (IO handling, thread handling, using fibres). Would prefer to use the existing APIs - they don’t want to throw away what they’ve got. Where QUIC necessitates a change they would be willing to make minor changes.

As such, there are several objectives for the I/O architecture of the QUIC implementation:

  • We want to support both blocking and non-blocking semantics for application use of the libssl APIs.

  • In the case of non-blocking applications, it must be possible for an application to do its own polling and make its own event loop.

  • We want to support custom BIOs on the network side and to the extent feasible, minimise the level of adaptation needed for any custom BIOs already in use on the network side. More generally, the integrity of the BIO abstraction layer should be preserved.

QUIC-Related Requirements

Note that implementation of QUIC will require that the underlying network BIO passed to the QUIC implementation be configured to support datagram semantics instead of bytestream semantics as has been the case with traditional TLS over TCP. This will require applications using custom BIOs on the network side to make substantial changes to the implementation of those custom BIOs to model datagram semantics. These changes are not minor, but there is no way around this requirement.

It should also be noted that implementation of QUIC requires handling of timer events as well as the circumstances where a network socket becomes readable or writable. In many cases we need to handle these events simultaneously (e.g. wait until a socket becomes readable, or writable, or a timeout expires, whichever comes first).

Note that the discussion in this document primarily concerns usage of blocking vs. non-blocking I/O in the interface between the QUIC implementation and an underlying BIO provided to the QUIC implementation to provide it access to the network. This is independent of and orthogonal to the application interface to libssl, which will support both blocking and non-blocking I/O.

Blocking vs. Non-Blocking Modes in Underlying Network BIOs

The above constraints make it effectively a requirement that non-blocking I/O be used for the calls to the underlying network BIOs. To illustrate this point, we first consider how QUIC might be implemented using blocking network I/O internally.

To function correctly and provide blocking semantics at the application level, our QUIC implementation must be able to block such that it can respond to any of the following events for the underlying network read and write BIOs immediately:

  • The underlying network write BIO becomes writeable;
  • The underlying network read BIO becomes readable;
  • A timeout expires.

Blocking sockets and select(3)

Firstly, consider how this might be accomplished using the Berkeley sockets API. Blocking on all three wakeup conditions listed above would require use of an API such as select(3) or poll(3), regardless of whether the network socket is configured in blocking mode or not.

While in principle APIs such as select(3) can be used with a socket in blocking mode, this is not an advisable usage mode. If a socket is in blocking mode, calls to send(3) or recv(3) may block for some arbitrary period of time, meaning that our QUIC implementation cannot handle incoming data (if we are blocked on send), send outgoing data (if we are blocked on receive), or handle timeout events.

Though it can be argued that a select(3) call indicating readability or writeability should guarantee that a subsequent send(3) or recv(3) call will not block, there are several reasons why this is an extremely undesirable solution:

  • It is quite likely that there are buggy OSes out there which perform spurious wakeups from select(3).

  • The fact that a socket is writeable does not necessarily mean that a datagram of the size we wish to send is writeable, so a send(3) call could block anyway.

  • This usage pattern precludes multithreaded use barring some locking scheme due to the possibility of other threads racing between the call to select(3) and the subsequent I/O call. This undermines our intentions to support multi-threaded network I/O on the backend.

Moreover, our QUIC implementation will not drive the Berkeley sockets API directly but uses the BIO abstraction to access the network, so these issues are then compounded by the limitations of our existing BIO interfaces. We do not have a BIO interface which provides for select(3)-like functionality or which can implement the required semantics above.

Moreover, even if we used select(3) directly, select(3) only gives us a guarantee (under a non-buggy OS) that a single syscall will not block, however we have no guarantee in the API contract for BIO_read(3) or BIO_write(3) that any given BIO implementation has such a BIO call correspond to only a single system call (or any system call), so this does not work either. Therefore, trying to implement QUIC on top of blocking I/O in this way would require violating the BIO abstraction layer, and would not work with custom BIOs (even if the poll descriptor concept discussed below were adopted).

Blocking sockets and threads

Another conceptual possibility is that blocking calls could be kept ongoing in parallel threads. Under this model, there would be three threads:

  • a thread which exists solely to execute blocking calls to the BIO_write of an underlying network BIO,
  • a thread which exists solely to execute blocking calls to the BIO_read of an underlying network BIO,
  • a thread which exists solely to wait for and dispatch timeout events.

This could potentially be reduced to two threads if it is assumed that BIO_write calls do not take an excessive amount of time.

The premise here is that the front-end I/O API (SSL_read, SSL_write, etc.) would coordinate and synchronise with these background worker threads via threading primitives such as conditional variables, etc.

This has a large number of disadvantages:

  • There is a hard requirement for threading functionality in order to be able to support blocking semantics at the application level. Applications which require blocking semantics would only be able to function in thread assisted mode. In environments where threading support is not available or desired, our APIs would only be usable in a non-blocking fashion.

  • Several threads are spawned which the application is not in control of. This undermines our general approach of providing the application with control over OpenSSL's use of resources, such as allowing the application to do its own polling or provide its own allocators.

At a minimum for a client, there must be two threads per connection. This means if an application opens many outgoing connections, there will need to be 2n extra threads spawned.

  • By blocking in BIO_write calls, this precludes correct implementation of QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended to be transmitted as soon as they are generated. QUIC packets contain fields such as the ACK Delay value, which is intended to describe the time between a packet being received and a return packet being generated. Correct calculation of this field is necessary to correct calculation of connection RTT. It is therefore important to only generate packets when they are ready to be sent, otherwise suboptimal performance will result. This is a usage model which aligns optimally to non-blocking I/O and which cannot be accommodated by blocking I/O.

  • Since existing custom BIOs will not be expecting concurrent BIO_read and BIO_write calls, they will need to be adapted to support this, which is likely to require substantial rework of those custom BIOs (trivial locking of calls obviously does not work since both of these calls must be able to block on network I/O simultaneously).

Moreover, this does not appear to be a realistically implementable approach:

  • The question is posed of how to handle connection teardown, which does not seem to be solvable. If parallel threads are blocked in blocking BIO_read and BIO_write calls on some underlying network BIO, there needs to be some way to force these calls to return once SSL_free is called and we need to tear down the connection. However, the BIO interface does not provide any way to do this. At best we might assume the BIO is a BIO_s_dgram (but cannot assume this in the general case), but even then we can only accomplish teardown by violating the BIO abstraction and closing the underlying socket.

This is the only portable way to ensure that a recv(3) call to the same socket returns. This obviously is a highly application-visible change (and is likely to be far more disruptive than configuring the socket into non-blocking mode).

Moreover, it is not workable anyway because it only works for a socket-based BIO and violates the BIO abstraction. For BIOs in general, there does not appear to be any viable solution to the teardown issue.

Even if this approach were successfully implemented, applications will still need to change to using network BIOs with datagram semantics. For applications using custom BIOs, this is likely to require substantial rework of those BIOs. There is no possible way around this. Thus, even if this solution were adopted (notwithstanding the issues which preclude this noted above) for the purposes of accommodating applications using custom network BIOs in a blocking mode, these applications would still have to completely rework their implementation of those BIOs. In any case, it is expected to be comparatively rare that sophisticated applications implementing their own custom BIOs will do so in a blocking mode.

Use of non-blocking I/O

By comparison, use of non-blocking I/O and select(3) or similar APIs on the network side makes satisfying our requirements for QUIC easy, and also allows our internal approach to I/O to be flexibly adapted in the future as requirements may evolve.

This is also the approach used by all other known QUIC implementations; it is highly unlikely that any QUIC implementations exist which use blocking network I/O, as (as mentioned above) it would lead to suboptimal performance due to the ACK delay issue.

Note that this is orthogonal to whether we provide blocking I/O semantics to the application. We can use blocking I/O internally while using this to provide either blocking or non-blocking semantics to the application, based on what the application requests.

This approach in general requires that a network socket be configured in non-blocking mode. Though some OSes support a MSG_DONTWAIT flag which allows a single I/O operation to be made non-blocking, not all OSes support this (e.g. Windows), thus this cannot be relied on. As such, we need to configure any socket FD we use into non-blocking mode.

Of the approaches outlined in this document, the use of non-blocking I/O has the fewest disadvantages and is the only approach which appears to actually be implementable in practice. Moreover, most of the disadvantages can be readily mitigated:

  • We rely on having a select(3) or poll(3) like function available from the OS.

    However:

    • Firstly, we already rely on select(3) in our code, at least in non-no-sock builds, so this does not appear to raise any portability issues;

    • Secondly, we have the option of providing a custom poller interface which allows an application to provide its own implementation of a select(3)-like function. In fact, this has the potential to be quite powerful and would allow the application to implement its own pollable BIOs, and therefore perform blocking I/O on top of any custom BIO.

    For example, while historically none of our own memory-based BIOs have supported blocking semantics, a sophisticated application could if it wished choose to implement a custom blocking memory BIO and implement a custom poller which synchronises using a custom poll descriptor based around condition variables rather than sockets. Thus this scheme is highly flexible.

    (It is worth noting also that the implementation of blocking semantics at the application level also does not rely on any privileged access to the internals of the QUIC implementation and an application could if it wished build blocking semantics out of a non-blocking QUIC instance; this is not particularly difficult, though providing custom pollers here would mean there should be no need for an application to do so.)

  • Configuring a socket into non-blocking mode might confuse an application.

    However:

    • Applications will already have to make changes to any network-side BIOs, for example switching from a BIO_s_socket to a BIO_s_dgram, or from a BIO pair to a BIO_s_dgram_pair. Custom BIOs will need to be substantially reworked to switch from bytestream semantics to datagram semantics. Such applications will already need substantial changes, and this is unavoidable.

    Of course, application impacts and migration guidance can (and will) all be documented.

    • In order for an application to be confused by us putting a socket into non-blocking mode, it would need to be trying to use the socket in some way. But it is not possible for an application to pass a socket to our QUIC implementation, and also try to use the socket directly, and have QUIC still work. Using QUIC necessarily requires that an application not also be trying to make use of the same socket.

    • There are some circumstances where an application might want to multiplex other protocols onto the same UDP socket, for example with protocols like RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit. However, these use cases cannot be supported without explicit assistance from a QUIC implementation and this use case cannot be facilitated by simply sharing a network socket, as incoming datagrams will not be routed correctly. (We may offer some functionality in future to allow this to be coordinated but this is not for MVP.) Thus this also is not a concern. Moreover, it is extremely unlikely that any such applications are using sockets in blocking mode anyway.

    • The poll descriptor interface adds complexity to the BIO interface.

Advantages:

  • An application retains full control of its event loop in non-blocking mode.

    When using libssl in application-level blocking mode, via a custom poller interface, the application would actually be able to exercise more control over I/O than it actually is at present when using libssl in blocking mode.

  • Feasible to implement and already working in tests. Minimises further development needed to ship.

  • Does not rely on creating threads and can support blocking I/O at the application level without relying on thread assisted mode.

  • Does not require an application-provided network-side custom BIO to be reworked to support concurrent calls to it.

  • The poll descriptor interface will allow applications to implement custom modes of polling in the future (e.g. an application could even building blocking application-level I/O on top of a on a custom memory-based BIO using condition variables, if it wished). This is actually more flexible than the current TLS stack, which cannot be used in blocking mode when used with a memory-based BIO.

  • Allows performance-optimal implementation of QUIC RFC requirements.

  • Ensures our internal I/O architecture remains flexible for future evolution without breaking compatibility in the future.

Use of Internal Non-Blocking I/O

Based on the above evaluation, implementation has been undertaken using non-blocking I/O internally. Applications can use blocking or non-blocking I/O at the libssl API level. Network-level BIOs must operate in a non-blocking mode or be configurable by QUIC to this end.

Block Diagram

Support of arbitrary BIOs

We need to support not just socket FDs but arbitrary BIOs as the basis for the use of QUIC. The use of QUIC with e.g. BIO_s_dgram_pair, a bidirectional memory buffer with datagram semantics, is to be supported as part of MVP. This must be reconciled with the desire to support application-managed event loops.

Broadly, the intention so far has been to enable the use of QUIC with an application event loop in application-level non-blocking mode by exposing an appropriate OS-level synchronisation primitive to the application. On *NIX platforms, this essentially means we provide the application with:

  • An FD which should be polled for readability, writability, or both; and
  • A deadline (if any is currently applicable).

Once either of these conditions is met, the QUIC state machine can be (potentially) advanced meaningfully, and the application is expected to reenter the QUIC state machine by calling SSL_tick() (or SSL_read() or SSL_write()).

This model is readily supported when the read and write BIOs we are provided with are socket BIOs:

  • The read-pollable FD is the FD of the read BIO.
  • The write-pollable FD is the FD of the write BIO.

However, things become more complex when we are dealing with memory-based BIOs such as BIO_dgram_pair which do not naturally correspond to any OS primitive which can be used for synchronisation, or when we are dealing with an application-provided custom BIO.

Pollable and Non-Pollable BIOs

In order to accommodate these various cases, we draw a distinction between pollable and non-pollable BIOs.

  • A pollable BIO is a BIO which can provide some kind of OS-level synchronisation primitive, which can be used to determine when the BIO might be able to do useful work once more.

  • A non-pollable BIO has no naturally associated OS-level synchronisation primitive, but its state only changes in response to calls made to it (or to a related BIO, such as the other end of a pair).

Supporting Pollable BIOs

“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use unified handle spaces (UNIX, Windows) though it is likely there are more obscure APIs on these platforms which have other handle spaces. However, this unification is not necessarily significant.

For example, Windows sockets are kernel handles and thus like any other object they can be used with the generic Win32 WaitForSingleObject() API, but not in a useful manner; the generic readiness mechanism for WIndows handles is not plumbed in for socket handles, and so sockets are simply never considered ready for the purposes of this API, which will never return. Instead, the WinSock-specific select() call must be used. On the other hand, other kinds of synchronisation primitive like a Win32 Event must use WaitForSingleObject().

Thus while in theory most modern operating systems have unified handle spaces in practice there are substantial usage differences between different handle types. As such, an API to expose a synchronisation primitive should be of a tagged union design supporting possible variation.

A BIO object will provide methods to retrieve a pollable OS-level synchronisation primitive which can be used to determine when the QUIC state machine can (potentially) do more work. This maintains the integrity of the BIO abstraction layer. Equivalent SSL object API calls which forward to the equivalent calls of the underlying network BIO will also be provided.

The core mechanic is as follows:

#define BIO_POLL_DESCRIPTOR_TYPE_NONE        0
#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD     1
#define BIO_POLL_DESCRIPTOR_CUSTOM_START     8192

#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM       4

typedef struct bio_poll_descriptor_st {
    int type;
    union {
        int fd;
        union {
            void        *ptr;
            uint64_t    u64;
        } custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM];
    } value;
} BIO_POLL_DESCRIPTOR;

int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);

int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);

Currently only a single descriptor type is defined, which is a FD on *NIX and a Winsock socket handle on Windows. These use the same type to minimise code changes needed on different platforms in the common case of an OS network socket. (Use of an int here is strictly incorrect for Windows; however, this style of usage is prevalent in the OpenSSL codebase, so for consistency we continue the pattern here.)

Poll descriptor types at or above BIO_POLL_DESCRIPTOR_CUSTOM_START are reserved for application-defined use. The value.custom field of the BIO_POLL_DESCRIPTOR structure is provided for applications to store values of their choice in. An application is free to define the semantics.

libssl will not know how to poll custom poll descriptors itself, thus these are only useful when the application will provide a custom poller function, which performs polling on behalf of libssl and which implements support for those custom poll descriptors.

For BIO_s_ssl, the BIO_get_[rw]poll_descriptor functions are equivalent to the SSL_get_[rw]poll_descriptor functions. The SSL_get_[rw]poll_descriptor functions are equivalent to calling BIO_get_[rw]poll_descriptor on the underlying BIOs provided to the SSL object. For a socket BIO, this will likely just yield the socket's FD. For memory-based BIOs, see below.

Supporting Non-Pollable BIOs

Where we are provided with a non-pollable BIO, we cannot provide the application with any primitive used for synchronisation and it is assumed that the application will handle its own network I/O, for example via a BIO_s_dgram_pair.

When libssl calls BIO_get_[rw]poll_descriptor on the underlying BIO, the call fails, indicating that a non-pollable BIO is being used. Thus, if an application calls SSL_get_[rw]poll_descriptor, that call also fails.

There are various circumstances which need to be handled:

  • The QUIC implementation wants to write data to the network but is currently unable to (e.g. BIO_s_dgram_pair is full).

    This is not hard as our internal TX record layer allows arbitrary buffering. The only limit comes when QUIC flow control (which only applies to application stream data) applies a limit; then calls to e.g. SSL_write we must fail with SSL_ERROR_WANT_WRITE.

  • The QUIC implementation wants to read data from the network but is currently unable to (e.g. BIO_s_dgram_pair is empty).

    Here calls like SSL_read need to fail with SSL_ERROR_WANT_READ; we thereby support libssl's classic nonblocking I/O interface.

It is worth noting that theoretically a memory-based BIO could be implemented which is pollable, for example using condition variables. An application could implement a custom BIO, custom poll descriptor and custom poller to facilitate this.

Configuration of Blocking vs. Non-Blocking Mode

Traditionally an SSL object has operated either in blocking mode or non-blocking mode without requiring explicit configuration; if a socket returns EWOULDBLOCK or similar, it is handled appropriately, and if a socket call blocks, there is no issue. Since the QUIC implementation is building on non-blocking I/O, this implicit configuration of non-blocking mode is not feasible.

Note that Windows does not have an API for determining whether a socket is in blocking mode, so it is not possible to use the initial state of an underlying socket to determine if the application wants to use non-blocking I/O or not. Moreover this would undermine the BIO abstraction.

As such, an explicit call is introduced to configure an SSL (QUIC) object into non-blocking mode:

int SSL_set_blocking_mode(SSL *s, int blocking);
int SSL_get_blocking_mode(SSL *s);

Applications desiring non-blocking operation will need to call this API to configure a new QUIC connection accordingly. Blocking mode is chosen as the default for parity with traditional Berkeley sockets APIs and to make things simpler for blocking applications, which are likely to be seeking a simpler solution. However, blocking mode cannot be supported with a non-pollable BIO, and thus blocking mode defaults to off when used with such a BIO.

A method is also needed for the QUIC implementation to inform an underlying BIO that it must not block. The SSL object will call this function when it is provided with an underlying BIO. For a socket BIO this can set the socket as non-blocking; for a memory-based BIO it is a no-op; for BIO_s_ssl it is equivalent to a call to SSL_set_blocking_mode().

Internal Polling

When blocking mode is configured, the QUIC implementation will call BIO_get_[rw]poll_descriptor on the underlying BIOs and use a suitable OS function (e.g. select()) or, if configured, custom poller function, to block. This will be implemented by an internal function which can accept up to two poll descriptors (one for the read BIO, one for the write BIO), which might be identical.

Blocking mode cannot be used with a non-pollable underlying BIO. If BIO_get[rw]poll_descriptor is not implemented for either of the underlying read and write BIOs, blocking mode cannot be enabled and blocking mode defaults to off.