A High-Performance I/O Subsystem with QoS

For this proposal, there are three important I/O subsystem components, each of which must be appropriately tuned to support QoS and maximize efficiency:

  1. The system interconnect (network and network devices)

  2. Network protocol processing

  3. Disk and filesystem I/O

We have identified a set of common principles and optimizations that are essential for both network and filesystem I/O to maximize efficiency and provide QoS guarantees. These optimizations include:

    Binding an appropriately-prioritized periodic task to each data flow to allow QoS guarantees for protocol and filesystem processing.

    Using a zero-copy buffer management system that works generally (with at least both the filesystem and network protocol stack) to eliminate unnecessary data-copying overheads.

    Processing I/O events according to some specification, rather than in response to system activities, to minimize interrupts and interferences between applications.





  1. The system interconnect: APIC


    Figure 1: APIC Architecture

    A high-level schematic of our proposed system architecture is shown in Figure 1. At the heart of this architecture is a daisy-chained interconnect containing a number of ATM Port Interconnect Controller (APIC) chips. Individual chips interface either directly to the main system bus, or through a dual-ported memory to an I/O device. One of the I/O chips on the interconnect is connected to the external ATM network through a link interface. ATM cells arriving from the network pass from one APIC to another until they reach an APIC chip that is directly connected to the destination device (perhaps, the system bus). Similarly, data originating from a source device is passed as an ATM cell stream from the APIC directly connected to that device down the APIC interconnect towards the link interface, and finally out to the network. It is also possible for two local devices to communicate over the APIC interconnect, without cells ever having to leave the desk area. Thus, for example, a video cell stream originating at a camera could traverse the interconnect to reach the local display device. The APIC that interfaces directly to the system bus can be used for traffic originating from, or destined for, the system's main memory. All other traffic movement happens only over the APIC interconnect, sparing the main system bus this burden. In our prototype implementation, each APIC will interface to another APIC over two unidirectional 1.2 Gb/s links.

    To improve throughput and reduce latency for applications running on the host CPU, the APIC includes support for several innovative techniques, many of which have been developed locally by the APIC design team. Some of these are described below.

    Protected I/O is a mechanism that is used to allow user-level protocol libraries programmatic I/O access to memory-mapped registers on the APIC, with specific access permissions to individual registers specified by the OS kernel. Protected DMA is a mechanism that allows user-space protocol implementations to exchange data buffers directly with the APIC without kernel involvement. Together, these mechanisms allow data transfers from user space directly to the APIC without system calls or any other kernel intervention (see Figure 3) while respecting the security and protection mechanisms of the operating system. This significantly improves throughput and latency characteristics for end applications.

  2. Pool DMA with Packet Splitting is a mechanism that enables construction of zero-copy kernel-resident protocol stacks, using page remapping to avoid data copying.

    Interrupt Demultiplexing is a technique used by the APIC to significantly reduce the frequency of interrupts to the host processor. This technique exploits the connection-oriented nature of ATM to differentiate interrupts associated with distinct flows, and to mask out interrupts corresponding to flows that have already been queued for service.


  3. Network protocol processing

    Protocol implementations in conventional operating systems (such as UNIX or Windows NT) suffer from several drawbacks that prevent them from performing periodic protocol processing and data delivery in real-time. At the communication subsystem layer, incoming packets are processed in FIFO order. Thus, high priority packets can be delayed by an immediately preceding burst of low priority packets. Furthermore, process scheduling is optimized for time-sharing rather than real-time operation. So variations in the priority of an application process and unpredictable background system load make it impossible to provide deterministic data delivery.

    We have developed protocol implementations based on real-time upcalls to provide periodic protocol processing and data delivery for constant-bandwidth applications. The RTU mechanism exploits the iterative nature of protocol processing in these applications to deliver greater efficiency than alternative mechanisms (such as real-time threads).



    Figure 2: Multithreaded Protocol Processing with QoS

    Figure 2 shows how the RTU mechanism can be used to do periodic processing for network I/O. Incoming packets are demultiplexed directly to the destination process's buffers and other packet-header buffers. Moving data directly into an application's address space or shared space eliminates data copying, and the early demultiplexing ensures that there is no priority inversion in packet processing for different applications. Using the RTU mechanism, the protocol data in these buffers is processed periodically in real-time. By choosing appropriate parameters for its RTUs, an application can receive its own quality of service, independent of other applications. Note that these parameters should be set by the QoS-mapping component of the ORB's Object Adapter. Eventually, all protocol and ORB middleware processing will be done in a vertically-integrated fashion by a single RTU running in the Object Adapter. Thus, inefficiencies resulting from layered implementations are eliminated. Processed data is then consumed by the application, again using the periodic model provided by the RTU mechanism.

    The periodic RTU model described above is good for applications that have periodic data communication needs, but not for applications that need transaction processing with very low latencies. To support such low-latency applications, we have introduced reactive RTUs. As its name implies, a reactive RTU is made runnable in response to some system event, rather than according to a specified period. There are two types of reactive RTUs: low-delay RTUs, all of which get priority over all periodic RTUs, and best-effort RTUs, all of which have lower priority than any periodic RTU. Low-delay RTUs would presumably be more useful in this setting, as they would usually provide lower, and more predictable, latencies. Reactive RTUs of either type are associated with a priority within their class.

    A reactive RTU is typically associated with a connection, and is made runnable by the network interface driver in response to packet arrivals on that connection. The advantage to using reactive RTUs for incoming traffic, rather than the traditional responsive mechanisms, is that the RTUs may be prioritized. Our prototype implementation of the TCP/IP stack in NetBSD has demonstrated that our approach allows high performance at very low runtime costs, and can support QoS for both high-bandwidth and low-latency applications.



    Figure 3: Multiprocessor Scheduling

    In multi-threaded OS kernels (such as Solaris or Windows NT) on multiprocessor platforms, there are several other important issues that must be addressed. For example, the global state of a protocol or protocol stack must be kept consistent, though that state is shared among multiple concurrent threads. But since the global state is usually updated only at connection setup or teardown time, simple locking of shared data is an acceptable solution. For multiprocessors, each protocol processing thread has to be assigned to a processor, and a thread may have to migrate across processors during the lifetime of the connection. In our proposed system, network connections (periodic and aperiodic) requiring QoS guarantees will be assigned to processors using the first-fit or next-fit heuristics, and scheduled using a rate-monotonic algorithm. We believe binding a connection to a particular processor and avoiding frequent migration of connections to be the best solution because:

    Connections with periodic data exhibit a very high instruction cache affinity, and

    Aperiodic streams (e.g. low-delay messages) exhibit both data cache and instruction cache affinity for the lifetime of the connection. In rare cases (e.g., where processor loads are greatly imbalanced), we will consider migrating a connection across processors.



  4. Disk and filesystem I/O

    High-performance storage servers that provide remote access to stored multimedia and other information over high-speed networks will be important components of the emerging computing and communication infrastructure. To support large numbers of concurrent accesses with QoS guarantees, such servers must support efficient periodic zero-copy data transfers from the storage system to the network interface subsystem. However, the conventional UNIX- and Windows NT-based storage servers do not support such low-overhead data transfers, and are plagued with performance bottlenecks.



    Figure 4: Networked File I/O in existing OSs

    To minimize (slow) disk I/O, conventional operating systems perform all disk reads and writes in large (typically, 8192-byte) blocks, and cache the blocks in a sophisticated buffer cache. For traditional text and binary file accesses, the buffer cache has been very effective. Multimedia data such as audio, video, and animations, however, do not possess any caching properties: first, they have a ravenous appetite for storage space and second, any piece of data is relevant only for very short time. Retrieving such data through a buffer cache, therefore, does not provide any performance benefits.

    Another major source of inefficiency for applications that do data transfer from a disk file to the network is excessive data copying. Typically, in response to a read system call from the application, the data is copied from the disk into the buffer cache, and from there into the application buffer. The application then performs a write system call, which copies data again into the small buffers (perhaps mbufs) used by network protocol stack. The data may even get copied one last time into the network interface's on-board memory before it is sent over the network. Each unnecessary data copy wastes precious system bus and memory bandwidth. Clearly, the use of these two different buffer systems, which were designed with different objectives, leads to the excessive copying and system-call overheads.

    Current UNIX systems are optimized for interactive computing, and do not provide any mechanisms for applications to request periodic, deadline-constrained data transfers from storage devices or to network interfaces.


    We have been prototyping a new data path within the file system of the NetBSD operating system that can provide very efficient data transfer between the disk and network, and can support QoS provisions via priorities. An important component of this data path is a new buffer system called multimedia mbufs (mmbufs). Mmbufs provide a zero-copy data path between the disk and network. Buffers in this system can be dynamically allocated and chained, much like traditional mbufs. But an mmbuf simultaneously looks like a (cluster) mbuf that network modules can process, and a traditional filesystem buffer that disk drivers handle.


    Figure 5: Modified filesystem I/O with QoS support

    As shown in Figure 5, we retain the old buffer-cache-based data path for applications that use the well-established read and write interfaces. For those applications requiring QoS guarantees, however, we provide new system calls with different semantics to leverage the power of mmbufs. The stream_read call completely bypasses the old buffer cache, instead reading data from a file directly into an mmbuf chain. An application can then send that data to the network interface using the stream_send call, which transforms the mmbuf chain into a cluster mbuf chain (without any data copies!) and passes it to the protocol processing code.

    To provide QoS guarantees for data transfers between the disk and network, applications use RTUs to schedule disk and network I/O. We have also modified the SCSI disk driver to queue requests from the stream_read calls into a real-time request queue (RTQ), while ordinary read calls go into a non-real-time queue (NRTQ). The driver always services RTQ first, and maintains a separate queue for each application or stream (early demultiplexing) so that a (prioritized) RTU can do subsequent processing without any priority inversion or interference from other streams. Because TCP does congestion control, which leads to a variable data transfer rate to the network, a flow control mechanism is needed between the network and file I/O subsystems. This could be implemented by dynamically changing the RTU parameters in response to, for example, TCP window size.

    Our early experiments suggest that this proposed approach can support multiple applications doing networked disk I/O at high data rates, each with QoS guarantees.

    Our objective is to generalize these network and file I/O subsystems and demonstrate them in NetBSD, Solaris, Windows NT, and other operating systems running on uni- and multi-processor hardware platforms. We also hope to integrate the I/O subsystem with the OO middleware and global resource allocation system.