For this proposal, there are three important I/O subsystem components, each of which must be appropriately tuned to support QoS and maximize efficiency:
We have identified a set of common principles
and optimizations that are essential for both network and filesystem I/O
to maximize efficiency and provide QoS guarantees. These optimizations include:
Binding an
appropriately-prioritized periodic task to each data flow to allow QoS
guarantees for protocol and filesystem processing.
Using a zero-copy buffer
management system that works generally (with at least both the
filesystem and network protocol stack) to eliminate unnecessary
data-copying overheads.
Processing I/O events
according to some specification, rather than in response to system
activities, to minimize interrupts and interferences between
applications.
- The system interconnect: APIC

Figure 1: APIC Architecture
A high-level schematic of our proposed system architecture is shown in
Figure 1. At the heart of this architecture is a daisy-chained
interconnect containing a number of ATM
Port Interconnect Controller (APIC) chips. Individual chips
interface either directly to the main system bus, or through a
dual-ported memory to an I/O device. One of the I/O chips on the
interconnect is connected to the external ATM network through a link
interface. ATM cells arriving from the network pass from one APIC to
another until they reach an APIC chip that is directly connected to
the destination device (perhaps, the system bus). Similarly, data
originating from a source device is passed as an ATM cell stream from
the APIC directly connected to that device down the APIC interconnect
towards the link interface, and finally out to the network. It is
also possible for two local devices to communicate over the APIC
interconnect, without cells ever having to leave the desk area. Thus,
for example, a video cell stream originating at a camera could
traverse the interconnect to reach the local display device. The APIC
that interfaces directly to the system bus can be used for traffic
originating from, or destined for, the system's main memory. All
other traffic movement happens only over the APIC interconnect,
sparing the main system bus this burden. In our prototype
implementation, each APIC will interface to another APIC over two
unidirectional 1.2 Gb/s links.
To improve throughput and reduce latency for applications running on
the host CPU, the APIC includes support for several innovative techniques,
many of which have been developed locally by the APIC design team. Some
of these are described below.
Protected
I/O is a mechanism that is used to allow user-level protocol libraries programmatic
I/O access to memory-mapped registers on the APIC, with specific access
permissions to
individual registers specified by the OS kernel. Protected DMA is
a mechanism that allows user-space
protocol implementations to exchange data buffers directly with the APIC without
kernel involvement. Together, these mechanisms allow data transfers
from user space directly to the APIC without system calls or any other kernel
intervention (see
Figure 3) while respecting the security and protection mechanisms
of the operating system.
This significantly improves throughput and latency characteristics for
end applications.
Pool
DMA with Packet Splitting is a mechanism that enables construction
of zero-copy kernel-resident
protocol stacks, using page remapping to avoid data copying.
Interrupt
Demultiplexing is a technique used by the APIC to significantly reduce
the frequency of interrupts
to the host processor. This technique exploits the connection-oriented
nature of ATM to differentiate
interrupts associated with distinct flows, and to mask out interrupts corresponding
to flows that have
already been queued for service.
- Network protocol processing
Protocol implementations in conventional operating systems (such
as UNIX or Windows NT) suffer from several drawbacks that prevent them
from performing periodic protocol processing and data delivery in
real-time. At the communication subsystem layer, incoming packets are
processed in FIFO order. Thus, high priority packets can be delayed
by an immediately preceding burst of low priority packets.
Furthermore, process scheduling is optimized for time-sharing rather
than real-time operation. So variations in the priority of an
application process and unpredictable background system load make it
impossible to provide deterministic data delivery.
We have developed protocol implementations based on real-time upcalls to
provide periodic protocol processing and data delivery for constant-bandwidth
applications. The RTU mechanism exploits the iterative nature
of protocol processing in these applications to deliver greater efficiency
than alternative mechanisms (such as real-time threads).

Figure
2: Multithreaded Protocol Processing with QoS
Figure 2
shows how the RTU mechanism can be used to do periodic processing for network
I/O. Incoming packets are demultiplexed directly to the destination
process's buffers and other packet-header buffers. Moving data directly
into an application's address space or shared space eliminates
data copying, and the early demultiplexing ensures that there is
no priority inversion in packet processing for different applications.
Using the RTU mechanism, the protocol data in these buffers is processed
periodically in real-time. By choosing appropriate parameters for its RTUs,
an application can receive its own quality of service, independent of other
applications. Note that these parameters should be set by the QoS-mapping component
of the ORB's Object Adapter. Eventually, all protocol and
ORB middleware processing will be done in a vertically-integrated fashion by a single RTU running
in the Object Adapter. Thus, inefficiencies resulting from layered
implementations are eliminated. Processed data is then consumed by the application, again
using the periodic model provided by the RTU mechanism.
The periodic RTU model described above is good for applications that
have periodic data communication needs, but not for applications that
need transaction processing with very low latencies. To support such
low-latency applications, we have introduced reactive RTUs.
As its name implies, a reactive RTU is made runnable in response to
some system event, rather than according to a specified period. There
are two types of reactive RTUs: low-delay RTUs, all of which get
priority over all periodic RTUs, and best-effort RTUs, all of which
have lower priority than any periodic RTU. Low-delay RTUs would
presumably be more useful in this setting, as they would usually
provide lower, and more predictable, latencies. Reactive RTUs of
either type are associated with a priority within their class.
A reactive RTU is typically associated with a connection, and is made
runnable by the network interface driver in response to packet
arrivals on that connection. The advantage to using reactive RTUs
for incoming traffic, rather than the traditional responsive
mechanisms, is that the RTUs may be prioritized. Our prototype
implementation of the TCP/IP stack in NetBSD has demonstrated that our
approach allows high performance at very low runtime costs, and can
support QoS for both high-bandwidth and low-latency applications.

Figure 3: Multiprocessor Scheduling
In multi-threaded OS kernels (such as Solaris or Windows NT) on
multiprocessor platforms, there are several other important issues
that must be addressed. For example, the global state of a protocol
or protocol stack must be kept consistent, though that state is shared
among multiple concurrent threads. But since the global state is
usually updated only at connection setup or teardown time, simple
locking of shared data is an acceptable solution. For
multiprocessors, each protocol processing thread has to be assigned to
a processor, and a thread may have to migrate across processors during
the lifetime of the connection. In our proposed system, network
connections (periodic and aperiodic) requiring QoS guarantees will be
assigned to processors using the first-fit or next-fit
heuristics, and scheduled using a rate-monotonic algorithm. We
believe binding a connection to a particular processor and avoiding
frequent migration of connections to be the best solution because:
Connections
with periodic data exhibit a very high instruction cache affinity, and
Aperiodic
streams (e.g. low-delay messages) exhibit both data cache and instruction
cache affinity for the lifetime of
the connection. In rare cases (e.g., where processor loads are greatly
imbalanced), we will consider migrating
a connection across processors.
- Disk and filesystem I/O
High-performance storage servers that provide remote access to
stored multimedia and other information over high-speed networks will be
important components of the emerging computing and communication infrastructure.
To support large numbers of concurrent accesses with QoS guarantees,
such servers must support efficient periodic zero-copy data transfers from
the storage system to the network interface subsystem. However, the conventional
UNIX- and Windows NT-based storage servers do not support such low-overhead data transfers,
and are plagued with performance bottlenecks.

Figure 4: Networked File I/O in existing OSs
To minimize (slow) disk
I/O, conventional operating systems perform all disk reads and writes
in large (typically, 8192-byte) blocks, and cache the blocks in a
sophisticated buffer cache. For traditional text and binary file
accesses, the buffer cache has been very effective. Multimedia data
such as audio, video, and animations, however, do not possess any
caching properties: first, they have a ravenous appetite for storage
space and second, any piece of data is relevant only for very short
time. Retrieving such data through a buffer cache, therefore, does
not provide any performance benefits.
Another major source of
inefficiency for applications that do data transfer from a disk file
to the network is excessive data copying. Typically, in response to a
read system call from the application, the data is copied from
the disk into the buffer cache, and from there into the application
buffer. The application then performs a write system call,
which copies data again into the small buffers (perhaps mbufs)
used by network protocol stack. The data may even get copied one last
time into the network interface's on-board memory before it is sent
over the network. Each unnecessary data copy wastes precious system
bus and memory bandwidth. Clearly, the use of these two different
buffer systems, which were designed with different objectives, leads
to the excessive copying and system-call overheads.
Current UNIX systems are
optimized for interactive computing, and do not provide any mechanisms
for applications to request periodic, deadline-constrained data
transfers from storage devices or to network interfaces.
We have been prototyping a new data path within the file system of the
NetBSD operating system that can provide very efficient data transfer
between the disk and network, and can support QoS provisions via
priorities. An important component of this data path is a new buffer
system called
multimedia mbufs (mmbufs). Mmbufs provide a zero-copy data
path between the disk and network. Buffers in this system can be
dynamically allocated and chained, much like traditional mbufs. But
an mmbuf simultaneously looks like a (cluster) mbuf that
network modules can process, and a traditional filesystem buffer that
disk drivers handle.

Figure 5: Modified filesystem I/O with QoS support
As shown in Figure 5, we retain the old buffer-cache-based data path
for applications that use the well-established read and
write interfaces. For those applications requiring QoS
guarantees, however, we provide new system calls with different
semantics to leverage the power of mmbufs. The stream_read
call completely bypasses the old buffer cache, instead reading data
from a file directly into an mmbuf chain. An application can
then send that data to the network interface using the
stream_send call, which transforms the mmbuf chain into
a cluster mbuf chain (without any data copies!) and passes it
to the protocol processing code.
To provide QoS guarantees for data transfers between the disk and network,
applications use RTUs to schedule disk and network I/O.
We have also modified the SCSI disk driver to queue requests from the stream_read
calls into a real-time request queue (RTQ), while ordinary read
calls go into a non-real-time queue (NRTQ). The driver always services
RTQ first, and maintains a separate queue for each application or stream
(early demultiplexing) so that a (prioritized) RTU can do subsequent
processing without any priority inversion or interference from other streams.
Because TCP does congestion control, which leads to a variable data transfer
rate to the network, a flow control mechanism is needed between the network
and file I/O subsystems. This could be implemented by dynamically changing
the RTU parameters in response to, for example, TCP window size.
Our early experiments suggest that this proposed approach can support
multiple applications doing networked disk I/O at high data rates,
each with QoS guarantees.
Our objective is to generalize these network and file I/O subsystems and
demonstrate them in NetBSD, Solaris, Windows NT, and other operating systems
running on uni- and multi-processor hardware platforms. We also hope to integrate
the I/O subsystem with the OO middleware and global resource allocation
system.