Technologies for Dynamically Extensible Networks (techX)


1. Introduction

For much of the last ten years, performance has been a central concern for networking researchers. In the early 1990s, the performance of network routers lagged far behind the capabilities of the underlying technology and the needs of emerging multimedia applications. Through a combination of advances in technology and design innovations at all levels, we are now at a point where performance is no longer such an over-riding concern. Industry can now produce routers with 10 Gb/s links and aggregate capacities of more than 1 Tb/s. These performance levels are sufficient to support ubiquitous and inexpensive network access at speeds of 100 Mb/s, and we can expect cost-performance to improve by another factor of 50 to 100 over the next decade.

The key challenge for networking researchers in the years ahead will be to find ways to exploit continuing technology advances to make networks more flexible, more functional and more robust, with the objective of enhancing the productivity of network users and operators. The purpose of this project is to prototype, demonstrate and evaluate key components for a new generation of extensible data networks, designed to enable rapid deployment of advanced network services. The project is organized around four central themes.

The project addresses a major challenge facing networking researchers. Extensibile network technologies can enable dynamic service creation, paving the way for novel network services that will enhance the productivity of network users and operators. The successful development of these technologies can have a profound impact on the nation’s cyber-infrastructure. The project is structured to enable broad participation by researchers in networking and distributed computing. These aspects of the project will help ensure a broad impact, both within the research community and in the society, at large.

2. Extensible Network Services Platforms

Modern routers support 10 Gb/s links and multi-terabit capacities. To enable advanced network services, routers must become flexible data processing engines, not just bit transporters. We use the term Network Services Platform (NSP) to describe a system that combines communication and processing capabilities within a single, flexible architecture. We are developing a prototype NSP to demonstrate and evaluate key architectural features and technological capabilities. The system will support link speeds of 1 to 10 Gb/s and will be highly modular, allowing configuration of systems with capacities from 10 Gb/s to nearly 1 Tb/s. While the target performance range does not stretch the state-of-the-art, in terms of raw communication performance, the system will offer levels of flexibility and programmability not found in conventional routers. For each 10 Gb/s of link capacity, it will support up to 16 Extended Services Engines (XSE), each comprising an embedded RISC processor and a block of configurable logic resources (hardware), capable of very high performance processing. The XSEs will support tightly coupled hardware/software plugins, in which software on an embedded processor interacts closely with application-specific hardware. All the major system components will be implemented using advanced FPGAs, making it possible to support a variety of protocols and packet formats and to explore a wide range of architectural options, using the same physical platform.

2.1. Overall System Architecture

The overall organization of the system is shown in Figure 1. Each port has two major components. The Packet Processor (PP) performs route lookup, packet classification and queue management. The PP will have the logic and memory resources to support advanced packet scheduling mechanisms, including per-flow weighted fair queueing for up to 100,000 flows. The PP will also implement supporting functions for the switch fabric, including packet resequencing, distributed flow control and multicast relaying.

Extended Services Module (XSM) provides the processing resources needed for advanced network services. The PP directs packets to the XSM, based on results of the packet classification operation. Packets are distributed to one of several Extended Services Engines ( XSE) within the XSM. We expect to do packet distribution on a per flow basis (all packets in a flow are sent to the same XSE), but also plan to study packet-by-packet distribution and flow-based distribution with adaptive rebalancing.

The ports are linked by a robust, high performance switch fabric. The Switch Elements ( SE) communicate with each other and the PPs, using serial links with a data rate of 2.5 Gb/s, and each has about 500 KB of internal storage for queueing. The fabric is organized as a set of parallel planes, each of which uses a two level folded topology in which SEs are connected by bidirectional links. The use of parallel switch planes can simplify packaging and provides a high degree of fault tolerance, with a relatively small amount of redundant hardware. The folded topology is well-suited to systems that use serial links because it allows adjacent SEs to exchange flow control information by piggy-backing this information with packets going in the “opposite” direction. The fabric supports a novel multicast switching algorithm with strong nonblocking properties.

2.2.Core Packet Processing

(PP), shown in Figure 2, is divided into six major parts. The Link Classification and Route Lookup (LCRL) component interfaces with the line card and with the XSM. It classifies packets and does lookups for packets received from the external link. Several forms of lookup will be supported. Route lookup will be done using a fast, best matching prefix algorithm. We have identified and are evaluating enhancements to the tree bitmap algorithm [TA02b] to improve performance on IPv6 addresses. The results of these studies will guide the choice of route lookup algorithm. We are considering novel multi-level hash tables for implementing exact match flow filters. In this approach, flows that hash to buckets with “too many” entries spill over to a second hash table, using a different hash function. Extension to more than two levels will be considered based on performance studies. We also plan to support more general packet filters, for performing flow aggregation functions. We are considering both brute force approaches such as ternary CAMs and more sophisticated algorithms, which use commodity memory [PR02]. We also plan to implement MPLS lookup.

The Link Queue Manager (LQM) maintains queues for packets going to the outgoing link or to the XSM. It supports over 100,000 queues and weighted fair scheduling mechanisms. We have developed an implementation technique for virtual-time based scheduling algorithms which requires just constant time to enqueue or dequeue a packet. This is accomplished using an approximate radix-sorting technique, which trades off jitter in the packet scheduling time for a significant reduction in complexity. This technique can be applied directly to packet scheduling algorithms such as Self-Clocked Fair Queueing [GO94] and can be used to implement a variant of Worst-Case Fair, Weighted Fair Queueing [BE97]. We are currently evaluating alternatives to develop a better understanding of the trade-off between queueing performance and implementation cost.

Link Packet Storage Manager (LPSM) manages the storage of packets going to the outgoing link or the XSM, in external DRAM. There are several things that make the design of the LPSM challenging. First, it must have sufficient memory bandwidth to accommodate data coming from both the XSM and the switch. While on-chip storage can be used to handle short peak periods, the overall memory bandwidth required exceeds 50 Gb/s. This requires the use of multiple memory modules, creating the potential for “hot-spot contention” where accesses to individual memory modules become unbalanced for extended periods of time. Simple memory interleaving is not an effective solution to this problem, since efficient use of DRAM bandwidth requires block accesses of at least eight words at a time and many packets are of comparable length. We plan to explore both sophisticated techniques using on-chip caches that offer worst-case performance guarantees [IY01] and randomized distribution strategies, which offer only probabilistic guarantees but are much simpler to implement.

Switch Classification and Route Lookup (SCRL) is similar to the LCRL, but processes packets received from the switch. The SCRL has several unique functions. First, it distributes traffic across the parallel switch planes, to balance the load. Second, it resequences packets received from the switch fabric, since packets may experience different delays in the fabric. Third, it processes multicast packets, which are handled using a two pass copy-then-route algorithm. This is described in detail in subsection 2.4. The Switch Queue Manager (SQM) maintains queues for packets going to the switch. It implements a set of virtual output queues for each output port. The rates at which packets are sent from these queues are dynamically adjusted by a distributed scheduling algorithm, discussed in subsection 3.3. The Switch Packet Storage Manager (SPSM) manages the storage of packets going to the switch, in external DRAM.

2.3. Hosting Software and Hardware Plugins

The Extended Services Module (XSM) provides the hardware resources needed to implement advanced services. As shown in Figure 3, the XSM has two interfaces to allow multiple XSMs to be linked together in a linear array, so that the processing capacity can be adjusted to meet individual system requirements. The bandwidth of these interfaces is matched to the external link bandwidth. The Control and Routing Module (CRM) handles IO and performs dynamic reconfiguration of the two Extended Services Processors (XSP) which are used to implement advanced services.

We plan to implement the XSPs using advanced FPGAs, such as the Xilinx Virtex Pro family of components. These chips support up to four embedded PowerPC processors, 110K flip flops, 110K programmable four input logic blocks, 1 MB of on-chip SRAM and 24 multi-gigabit serial links, capable of transferring data at 2.5 Gb/s. Part of the configurable logic on each XSP will implement a fixed infrastructure, including a high speed ring, interfaces to off-chip SRAM and SDRAM, plus a controller. The remaining logic is reserved for hardware plugin slots [TA02a], regions of reconfigurable logic that can host dynamically configured hardware plugins. Each processor and plugin slot can send and receive packets over an associated Ring Interface (RI). The ring was chosen for its simplicity and its ability to support asymmetric bandwidth use. A processor and its adjacent plugin slot can combine to form a single hybrid plugin.

The CRM can configure individual plugin slots of an XSP, while data is flowing through the XSP. While a given slot is being reconfigured, the XSP controller puts its RI in bypass mode and isolates it from its associated processor and external memory. The configuration of a single slot can be done in less than 1 ms. The CRM has a dedicated SDRAM that can store the configurations for many hardware plugins. This allows a remote Control Processor to pre-load plugins and have them transferred into plugin slots on-demand. Note that appropriately designed plugins can process packets for multiple flows, using either the on-chip or off-chip memory to store any per-flow state that needs to be maintained between packets. The architecture described above uses fixed plugin slots, each with a fixed allocation of resources. We also plan to explore approaches that would allow more flexible assignment of hardware resources to plugins.

The switch fabric used to connect the ports uses parallel planes and a folded topology within each plane. This approach was chosen to take best advantage of advanced multi-gigabit serial link technology. Most switch fabrics transfer data in fixed length cells, even though the external links send and receive variable length packets. While this means that the port hardware must segment packets into cells for transfer through the fabric, the use of fixed length cells simplifies the fabric hardware sufficiently to make the trade-off worthwhile. However, the use of fixed length cells results in a significant performance penalty in the presence of worst-case packet lengths. We propose the use of flexible cells, which vary in length from 64 to 128 bytes to minimize this performance penalty, while retaining most of the implementation advantages of fixed length cells.

The switch fabric is constructed using buffered Switch Elements (SE), as shown in Figure 4. Received cells are held in a central Cell Store (CS) with an expected capacity 4096 cells. The Queue Manager (QM) supports a collection of linked list queues, with multiple queues for each output of the switch fabric, to support multiple classes of service and to allow cells to flow freely to uncongested outputs even when other outputs are congested. In a system with an aggregate capacity of 1 Tb/s, eight separate service classes can be supported using 800 distinct queues. Inter-stage flow control will be provided on a per-queue basis to prevent traffic going to congested outputs from consuming all the cell storage resources.

The system supports robust multicast communication, using a novel two-pass multicast algorithm. In recent years, switch designers have recognized that switching systems need to provide nonblocking performance, even in the presence of traffic conditions that cause some output links to be overloaded for extended periods of time. The use of virtual output queues and sophisticated scheduling mechanisms [KR99, CH99b] have made it possible for crossbar-based switches to match the performance of ideal output-queued switches under arbitrary traffic conditions. Similar methods have recently been developed for multistage switching fabrics [PA03a]. Unfortunately, no one has yet developed practical multicast scheduling methods with similar guarantees. The method proposed here offers stronger guarantees than previous approaches, although it still falls short of ideal performance under certain extreme conditions.

The proposed method processes multicast cells in two passes. In the first pass, a multicast cell with fanout f is sent to f consecutive outputs of the switch fabric. To balance the traffic, successive cells from each input are sent to successive sets of outputs. This ensures that each output receives approximately the same amount of first-pass multicast traffic. When a PP receives a copy of a multicast cell, it performs a lookup to determine the output that copy should be forwarded to. It then places the cell in a virtual output queue for that output. In the second pass, multicast cells are handled just like unicast cells.

The two pass processing means that the switch fabric needs some extra bandwidth to handle multicast traffic. Since multicast traffic is not expected to be a large fraction of the traffic in typical systems, a relatively small amount of extra bandwidth is sufficient. We propose to engineer the prototype with 20% extra bandwidth in the fabric, allowing any traffic configuration in which the total share of the output link bandwidth used by multicast sessions is no more than 20%. Of course, a robust system must handle situations where the arriving multicast traffic exceeds the engineered multicast capacity. While some delay of the multicast traffic is unavoidable under these conditions, the system must ensure that temporary surges in multicast traffic do not interfere with unicast traffic. We plan to extend the techniques developed in [PA03a] to periodically distribute multicast traffic information among the PPs to allow them to regulate the flow of first-pass multicast traffic into the switch fabric, so that the total amount of first-pass multicast traffic flowing through the fabric does not exceed the system bandwidth available to accommodate it. During periods when the total multicast traffic (from all sessions) exceeds the available capacity, multicast packets will accumulate in the queues at the PPs where they first arrive. Under these conditions, even multicast cells destined for uncongested outputs will experience delays. Note that this is the only condition that causes multicast traffic to experience less than ideal performance. Since multicast traffic is treated just like unicast traffic in the second pass, it receives the same performance guarantees that unicast traffic receives, so long as the total multicast traffic in the system does not exceed the engineered system capacity.

2.5. Prototype Implementation Issues

We plan to make extensive use of FPGAs in the experimental NSP. While the performance of FPGAs has historically lagged well behind that of Application Specific Integrated Circuits (ASIC), we feel that FPGAs are the right choice for all parts of the system, not just the XSPs. There are several reasons for this. First, the gap between FPGAs and ASICs has shrunk in recent years, as large fractions of both types of devices are being devoted to high level system components like SRAM blocks and processors, which are equally efficient in both technologies. Second, while there remains a significant performance gap, the absolute performance level of FPGAs has risen to a point where they can be used effectively, even for major system components. Third, in an experimental system, the ability to reprogram FPGAs to implement a wide variety of architectural alternatives is a huge advantage.

The system design has been structured to enable the configuration of modular systems across a wide range of sizes. The smallest system would have a throughput of 10 Gb/s and would consist of a Line Card containing transmission interface components (gigabit Ethernet, 10G Ethernet or OC-192 packet-over-SONET), a Packet Processor Card (PPC) containing a single PP and one or two Extended Services Cards (XSC) each containing one XSM. These cards would all be about 8” by 11” in size. A 40 Gb/s system can be implemented as a single rack-mount Network Services Unit (NSU), approximately 12” in height. An NSU (see Figure 5) contains four line cards, four PPCs, 4-8 XSCs and one Level 1 Switch Card (SC1) containing six SEs. Each PP connects to the SC1 with 12 multi-gigabit per second serial links, each with a bandwidth of 2.5 Gb/s. This provides a raw bandwidth of 30 Gb/s for each 10 Gb/s external link. This speedup provides a factor of 1.25 to account for the overhead of internal cell headers and the effects of fragmentation, a factor of 1.2 for multicast traffic and a factor of 2 to allow for control of congestion while maintaining work-conserving operation under extreme traffic conditions [PA03]. In larger system configurations, each of these SEs forms part of the first level of a two level folded network. Larger systems are configured by connecting multiple NSUs to a separate unit with up to eight Level 2 Switch Cards (SC2), each containing six SEs. Systems with up to 120 Gb/s throughput can be implemented with three NSUs and one SC2. 240 Gb/s systems can be implemented with six NSUs and two SC2s, 480 Gb/s systems with twelve NSUs and four SC2s and 960 Gb/s systems with 24 NSUs and eight SC2s. We propose to prototype a 240 Gb/s system to demonstrate the scalability of the architecture.

3. Robust Network Processing

Data networks have become central components of the national cyber infrastructure. As networks have become more pervasive, they have become subject to a wider range of operating conditions, including conditions caused by individuals actively seeking to disrupt network services. Data networks have traditionally been designed with certain expectations regarding user behavior that are no longer tenable. As networks are used for more mission-critical applications, and are exposed to an ever growing array of challenges, there is a pressing need for a new approach to network design that can meet the highest expectations of users for robust, reliable network services, even under worst-case conditions. We argue that continuing advances in technology are making such a worst-case approach feasible. By combining the latest technology with new system design innovations, it will be possible to build systems that operate reliably and consistently, even under the most extreme operating conditions.

3.1. Packet Scheduling using Approximate Radix Sorting

It has long been recognized that per flow queues in routers can provide traffic isolation and limit the ability of individual users to consume an unfair share of resources. Per flow queues have not been widely implemented in higher performance systems, due to concerns about cost and scalability. Technology advances have largely eliminated the cost concern, since the per flow resources required for each queue are negligible (about ten bytes of SRAM for head and tail pointers and queue length). However, scalability remains an issue, since the most effective packet scheduling algorithms require O(log n) time to enqueue and dequeue a packet, where n is the number of flows [GO94,BE97]. In high speed systems, with tens of thousands of flows, algorithms that take more than constant time are not practical. Algorithms that use approximate radix sorting to implement fair queueing can provide an effective solution to this problem.

Figure 6 shows a data structure that can be used for scheduling a large number of flows and is well-suited to efficient implementation in hardware. The data structure has a set of timing wheels and an output list. Each timing wheel has an ordered set of buckets and each bucket contains a list of queues. A queue is scheduled by placing it in a bucket in one of the timing wheels. The selection of the timing wheel and bucket is based on the queue’s deadline, that is, the time by which its next packet should be sent. When the output list feeding the link is empty, it is refreshed from the non-empty bucket that has the earliest deadline. Queues are serviced from the output list as fast as the link allows and are then rescheduled; that is, they are re-entered into some bucket in some timing wheel.

Each timing wheel w has an associated time increment t(w) and a range r(w), which is simply t(w) times the number of buckets. The first wheel has the smallest increment, and the increments and ranges get successively larger in subsequent timing wheels. When a queue is to be scheduled, we first check to see if the time until its deadline is within the range of the first timing wheel. If so, we place the queue in the bucket that corresponds to the deadline. If the deadline lies outside the range of the first timing wheel, we try successive timing wheels, inserting the queue in the first timing wheel with a large enough range.

The time when a queue is actually served by this mechanism can differ from the ideal target time for the queue. This is inherent in any packet scheduler, since several queues may have coincident times when their next packet should be sent. The coarse granularity of the time increments in the timing wheels also contributes to deviations in the transmission time of a packet. This effect can be limited by selecting the time increment for each wheel to be a small fraction of the range of the previous wheel. This allows one to bound the relative error in the scheduling of the next packet. When the data structure is used to implement virtual-time schedulers, the errors between individual packets do not accumulate, so the only impact is to add a controlled amount of jitter to the flow.

The reason for using multiple timing wheels with different time increments, rather than a single timing wheel, is to increase the range of rates that can be handled, without requiring excessive amounts of memory. An example configuration might have three timing wheels with 1024 buckets each, and time increments of 100 ns, 5 ms and 250 ms. This results in an inter-packet scheduling error of no more than 5% and allows rates that range from less than 10 Kb/s to over 100 Gb/s. A single timing wheel with a similar range would require more than 2.5 million buckets.

To implement work-conserving packet schedulers, we need a way to quickly determine the next non-empty bucket. This can be done using a set of fast forward bits for each timing wheel. These bits are set for non-empty buckets and can be quickly scanned in hardware to find the next non-empty bucket. For a timing wheel with a large number of buckets, the brute-force implementation of this scheme can be expensive. The cost can be greatly by reduced by dividing the bit vector into shorter words and then creating a summary vector, in which each bit is the logical-OR of the bits in the corresponding word. By consulting the summary vector, we can determine the first word that has a bit set, then use that word to determine the first non-empty bucket.

The timing wheel data structure can be directly applied to virtual-time based schedulers like Self Clocked Fair Queueing (SCFQ) [GO94]. More sophisticated algorithms like like Worst-Case Fair, Weighted Fair Queueing (WF2Q+) [BE97] can be implemented using two copies of the data structure, one containing queues that are currently eligible to transmit a packet and the other containing queues that are not eligible. The use of dual data structures creates the possibility that a queue selected for transmission may not be the one specified by the WF2Q+ algorithm, since it's possible for many queues to become eligible at the same time. If we continue to transmit packets on the link while queues are being transferred from one data structure to the other, some scheduling errors can occur. We propose to evaluate the impact of such scheduling errors and also to evaluate the practical differences among various packet scheduling algorithms under a wide range of realistic operating conditions.

3.2. Adaptive Resequencing

Large multistage switch fabrics with buffered switch elements typically route individual cells independently of one another, in order to balance the traffic as evenly as possible across the entire system. This creates the possibility of cells being delivered out of order, making it necessary to implement resequencing at the output ports. There are two fundamental approaches to resequencing. For unicast traffic, sequence numbers can be used, with each input labeling cells to each output with consecutive sequence numbers. Outputs then maintain separate buffers for cells from different inputs and use the sequence numbers to ensure that the cells are forwarded in the proper order. Sequence numbers are difficult to generalize to multicast traffic and impose a significant cost in systems with large numbers of inputs and outputs.

Time-based resequencing [HE92] offers a less complex alternative to sequence numbers, which is intrinsically robust (unaffected by lost cells) and which applies equally well to both multicast and unicast traffic. In time-based resequencing, cells are time-stamped when they enter the fabric and placed in a resequencing buffer at the output ports. Cells are forwarded from the resequencing buffer in timestamp order, after being held long enough to ensure (with high probability) that there are no cells still present in the fabric with smaller timestamps. Typically, cells are held in the resequencing buffer until the difference between the current time and their timestamps exceeds some pre-defined age threshold, which is chosen to be larger than the largest delay that cells are expected to experience when passing through the fabric. The drawback of this approach is that in systems where the switch fabric delay can vary widely, the age threshold must be chosen to be relatively large, leading to a large minimum cell delay. If the fabric is to be robust to widely varying traffic conditions, this is clearly unacceptable.

We propose an alternative approach to time-based resequencing in which the age threshold is dynamically adjusted as the delay through the fabric changes. In this approach, the age threshold at an output is set equal to the maximum delay experienced by cells that have arrived at the output in the recent past, plus an increment, called the delay spread. If contemporaneous cells arriving at the output experience delays that differ by no more than the delay spread, they will be resequenced correctly and will experience a combined delay in the fabric and resequencer that is bounded by the maximum delay experienced by cells that reached the output in the same time period, plus the delay spread. This means that cells going to lightly loaded outputs experience relatively small delays, while cells going to heavily loaded outputs experience delays that are slightly larger than the intrinsic delay caused by the switch fabric. Preliminary studies of a simple adaptive resequencer, which monitors delays in two recent measurement windows and uses a constant delay offset, show that adaptive resequencing can deliver excellent performance across widely varying conditions. We propose to make a more systematic study of adaptive resequencers, to better characterize their performance and to explore alternative designs. The best alternative will be implemented in the experimental NSP.

3.3. Distributed Queueing

High capacity routers must be scalable to hundreds or even thousands of ports. This poses special challenges in networks, which strive to handle even unreasonable traffic situations, such as sustained long-term overloads at individual outputs, while still delivering on all per flow performance guarantees. As a practical matter, terabit router capacities can only be obtained using multistage interconnection networks with a small speed advantage relative to the external links; that is, internal data paths operate at speeds that are faster than the external links by a small constant factor (typically 2). This creates a potential problem in the presence of overloaded outputs, since in this situation the fabric can become congested with packets attempting to reach the overloaded output. To limit the effects of this type of congestion we have devised distributed queueing algorithms that regulate the flow of packets to each output.

Distributed queueing requires the use of Virtual Output Queueing [AO93, MA96, MI96] at each input. That is, each input maintains separate queues for each output. In a typical system based on virtual output queueing, there will be one, or perhaps a few queues going to each output at each input. The queues are implemented as linked lists, so the only per queue overhead is the overhead for the queues’ head and tail pointers. To enable per flow performance guarantees, we need to go further than this. Specifically, each input must have a set of per flow queues for each output it is sending information to. It might seem that this implies an n times increase in the number of queues needed at the inputs, relative to the outputs, but this is not the case, since the total number of input side queues is clearly equal to the total number of output side queues. Assuming that flows are no less evenly distributed over input links than they are over output links, the number of queues per input port will match the number per output.

The regulation of the traffic flow from inputs to outputs is done using a distributed queueing algorithm. Inputs periodically broadcast information about their backlog status to each output, and outputs periodically send back information that the inputs use to set the rates at which they send to the outputs. The information broadcast by the inputs cannot be based on individual flows, as this would drastically limit scalability. The challenge is to distribute just summary information about the backlog status, while still allowing individual flow guarantees to be met. To further improve scalability, the switch fabric can participate in the algorithm, reducing the flow of data to each port, so that each port receives only the information that is really needs to know. Specifically, each output port is only sent information about its own backlogs, while each input port is only sent information about the rates it must use when sending to the outputs. To accomplish this, the interconnection network repackages the information and sends it to the inputs and outputs in this repackaged form. Using this approach, a system designed to support 1000 links at 10 Gb/s each, can broadcast information every 100 microseconds and consume only about 1% of the interconnection network bandwidth for data transmissions connected with distributed queueing. More detailed discussion of distributed queueing appears in [PA03a].

4. Extending Network Functionality

Extensible networks allow new functionality to be added incrementally. The ability to insert hardware and software plugins into flexible Network Services Platforms is extremely powerful. We propose to develop software and hardware mechanism to make network extension a first class network service that can be exploited as routinely as other services, such as datagram forwarding. First, however, we describe some representative services that might be implemented using network extensions. We plan to implement instances of such services to demonstrate the power and flexibility of network extension.

4.1. Representative Advanced Network Services

Near Video on Demand. Several researchers have proposed “near video on demand” services, which transmit frames of a video program repeatedly over a multicast session [HU97, JA02]. By transmitting frames that occur early in the program with relatively high frequency, and using storage at the receiver to save frames for later playback, it is possible to provide a service that appears like video-on-demand to the user, but with greatly reduced resource requirements at the source and in the network. Unfortunately, these techniques also increase the bandwidth at the receiver by a small constant factor (typically between 2 and 5, with the choice determining the startup delay). Since the bandwidth of access links is typically fairly constrained, this can severely limit the applicability of the technique. We propose to demonstrate a variant of this service in which video storage is provided by the last hop router, which terminates the multicast distribution tree and provides unicast streams to all local receivers. The plugin will provide VCR like controls and control the streaming of video frames from an attached storage subsystem.

Distributed Denial of Service Attack Mitigation. DDoS attacks have become a major problem for commercial information/service providers as well as governments and the military. A wide variety of strategies have been developed to detect and mitigate DDoS attacks [FE03, FU01, PA01]. Detection techniques include monitoring for well-known “attack signatures,” checking plausibility of source addresses, monitoring statistical characteristics of packet header values (e.g. entropy) and detecting attacks through effects on targeted systems. Mitigation techniques include filtering or rate-limiting suspect traffic. Both detection and mitigation can be done most effectively within the network, preferably close to the source of an attack. Successful mitigation requires the ability to respond to ever-changing attack strategies, making extensible network components an essential part of an effective defense. We plan to evaluate DDoS defenses based on source address plausibility checking and filtering.

Lightweight Flow Setup. Because of the inherently unregulated nature of datagram traffic, it is not possible to make any strong performance guarantees in a network, based on the datagram service alone. The increasingly critical nature of networked applications makes it important for applications to have a way of obtaining such guarantees. The conventional way to obtain performance guarantees in networks is to use a signaling protocol to reserve capacity from end-to-end for particular application data flows. Conventional signaling protocols have met with little success in the Internet to date, in part because of the perceived complexities and performance penalties associated with their use. More lightweight methods have also been proposed, both in the ATM [HJ98, BI98] and IP contexts [PA99]. We propose to implement a Lightweight Flow Setup (LFS) protocol [KU03], which is similar to Yessir [PA99] in its overall character and objectives, but which is simple enough to allow the core reservation mechanisms to be implemented by the router hardware. This makes it possible to perform flow setup at wire speeds, allowing reservation bandwidths to be adjusted dynamically on relatively short time scales. LFS is designed so that receivers need not participate in the protocol in order to experience its benefits. This means, for example, that an information provider (e.g. a web site) could use LFS to deliver real time video to a receiver, providing the benefits of a reserved data channel without the need to first deploy the protocol to all potential receivers.

Reliable Multicast. Multicast communication is most often used for real-time applications where reliable transmission is less important than timely delivery. However, in some situations, reliable multicast can be very useful. Many approaches have been proposed for reliable multicast, with most emphasizing end-to-end operation, in which the network plays no direct role [FL95]. The ability to insert plugins into network routers makes it possible to implement more effective reliable multicast protocols, while still allowing applications to select the protocol that best suits their needs [PA03b]. In certain applications (e.g. satellite networks), it may be desirable for routers to actually transmit packets reliably over each link using local retransmission, to avoid the delays associated with end-to-end retransmission. In other contexts, it may be appropriate for end systems to handle retransmission, but for the network to support ack suppression and targeted retransmission [TU97].

Reserved Delivery Subnetworks. Reserved Delivery Subnetworks (RDS) [QI02] provide an alternative approach to providing consistent performance to network users. Information service providers can use an RDS to set aside sufficient network capacity to ensure that their users receive consistent performance. A typical RDS branches out from one or a few sites to a larger number of locations where the provider has a significant user population. The endpoints of the RDS are typically not individual hosts, but routers within the network. Traffic from the provider is routed within the RDS, to the endpoint nearest its destination and from there, using the normal datagram service. Routing within the RDS is based on filters installed at the routers in the RDS. Traffic from users to the service provider can also be routed within the RDS. Specifically, routers outside the RDS can be configured to route traffic to the provider through the nearest router in the RDS. The routing of upstream traffic through the RDS makes it possible for the network to deliver traffic to the service provider in a well-regulated fashion. In particular, the upstream forwarding of requests can be paced, so as not to exceed the processing capacity of the service provider. Separate queues can be maintained for the different segments of the subnetwork, to prevent excess traffic from one location from blocking requests from other locations. This can provide some protection from denial of service attacks, originating from a limited number of locations.

4.2. Making Network Extension a First-Class Capability

Network extensibility represents a fundamental departure from conventional approaches to networking. There are many different approaches one can take to the creation of extensible networks. Active Networking [TE97], and particularly the capsule approach, in which packets are treated as programs to be executed, rather than as data to be forwarded, is undoubtedly the best known. We argue that other approaches to network extension are likely to be both more powerful and easier to deploy and use.

One straightforward way to deploy network extensions is through administrative mechanisms that apply plugins to packets matching specified packet filters. This has the advantage that it does not require any interaction with the hosts participating in the application session, making it relatively easy to deploy. It can be usefully applied in enterprise networks with a unified network management and in more limited ways, in public networks. It does have certain drawbacks. Applications that require per flow state may not work reliably in situations where network routes can change. Also, because it requires that the network determine user requirements through implicit means, rather than explicit directions, it can be difficult to apply to applications for which the required behavior can vary widely from session to session. A more flexible and general network extension capability can be provided using an explicit session configuration protocol and a network-resident database that maintains application specifications, which provide the network with the essential information needed to map application sessions onto available resources, including configuring and initializing plugins at selected NSPs [CH01, ST02].

4.2.1. Session Specification and Configuration

Because the best configuration of an individual application session can vary, depending on both user requirements and network resource availability, application specifications must describe essential characteristics of sessions, while leaving flexibility for users to express their own preferences and requirements and for the network to map sessions onto available resources. When an application session is initiated by an end-system, application-specific parameters can be specified, including typically, the initial set of end-systems participating in the session. Applications may allow certain parameters to be changed during the session and may allow the set of participants to change dynamically.

To implement a new network service, a network application programmer provides a set of plugins in the appropriate “executable” form to the network control software, which maintains copies throughout the network. The programmer specifies how the application can be implemented using these plugins (and perhaps other, previously defined plugins). We propose to develop a simple application specification language that allows an application programmer to express how application sessions can be configured. The language will allow application sessions to be expressed in terms of flexible patterns that can be mapped, by the network control software, onto the available resources, at the time an individual session is being configured. For unicast sessions, a pattern will specify the sequence of processing steps that must be performed for the application, the plugins that implement those processing steps and the processing resources (or possibly hardware resources) needed for the plugin. The pattern will also specify the communications channels used to pass data between consecutive plugins and their bandwidth requirements. An application may have multiple patterns for use in different situations. For a one-to-many multicast application, the pattern will specify the sequence of processing steps and communication channels on any path from the sender to a receiver. For a many-to-many multicast application, the pattern will specify the processing steps and communication channels on any path from a core of the multicast to any participant. The application specification language will support plugin interface declarations that specify the communications channels used by plugins. It will allow the application programmer to specify how session parameters are used to determine arguments to individual plugins and communication channels. For example, session parameters might be used to specify the resolution and frame rate of a video stream, which in turn will determine the bandwidth requirements of the channels and the processing capacity required by plugins that process the video.

While these ideas for defining application sessions are still in a preliminary stage, they suggest some interesting possibilities. A systematic way of providing the network with knowledge of applications makes it possible for the network to configure applications for users, without users needing to understand what goes on behind the scenes. This makes it far easier for users to take advantage of advanced network services, making it more likely that they will do so.

4.2.2. Efficient Session Configuration

Application sessions in extensible networks can involve a variety of steps, each with its own processing requirements and possibly conditions on where or how it is to be performed. Extensible networks will require mechanisms for determining how the available network resources can be used most effectively to satisfy the requirements of such sessions. Consider, for example, a session used to communicate confidential information between two sites. To maintain the privacy of the communicated information, encryption plugins are to be configured at both ends of the communication path. It doesn’t matter exactly where the encryption takes place, so long as it is within the local networks at each end. The network must provide mechanisms to map the session onto the resources available, while respecting the constraints on the allowed locations for the encryption plugins. As another example, consider a multicast video broadcast in which compression is to be used to reduce the cost of long distance transmission. It may be advantageous to decompress the video before it reaches its ultimate destination, so that the cost of decompression can be shared over multiple receivers. The best place for the decompression to occur will depend on the relative cost of transmission bandwidth vs. decompression processing. Ideally, the network’s control mechanisms will insert decompression plugins into the multicast distribution tree at the locations that yield the lowest cost.

The above examples motivate our definition of the session configuration problem for extensible networks. First, we represent the network as a directed graph, in which vertices represent NSPs and edges represent communication links. Edges and vertices both have capacities and costs. In the case of edges, the capacity represents the available link bandwidth, and in the case of vertices, the capacity represents the available processing capacity of an NSP. Costs are the cost of using some standard amount of a link’s bandwidth or an NSP’s processing capacity. A session is also represented by a directed graph, in which vertices correspond to communication endpoints or intermediate processing steps, and edges correspond to information flows. Edges and vertices in the session graph may have requirements representing the bandwidth needed for a given information flow or the amount of processing that a given processing step entails. Vertices in the session graph may have associated placement constraints, which control how the session is mapped onto the network graph.  A feasible solution to the session configuration problem is an embedding of the session graph onto the network graph. An embedding maps session graph vertices onto network graph vertices, and maps session graph edges onto paths in the network graph. The vertex mapping must satisfy all vertex capacity constraints and placement constraints, and the edge mapping must satisfy the capacity requirements of the session graph edges. An optimal solution is a feasible solution of minimum cost.

While the general session configuration problem is NP-hard, the important special case of sessions whose graphs are paths does have an efficient solution, as shown in [CH03]. The methods developed in [CH03] are based on a "layered graph" representation of the network that converts it into a more conventional path problem. The layered graph technique can also be applied to multicast applications. Although, finding the best multicast route remains computationally intractable, standard heuristics for finding efficient, if suboptimal, routes can be applied in the layered graph context, making configuration of multicast sessions with intermediate processing no more difficult than configuring multicast sessions in conventional networks.

5. Open Network Laboratory

Washington University has recently received support to create an Open Network Laboratory (ONL) for use by the networking research community, to enable experimental evaluation of advanced networking concepts in a realistic experimental environment. The laboratory is being built around a set of open-source, gigabit routers that have been developed at Washington University, and which will be made available to remote users through a Remote Laboratory Interface (RLI) to be developed through this program. The interface will allow users to extend, modify or replace the software running in the routers’ embedded processors and to similarly modify parts of the routers’ hardware, implemented using Field Programmable Gate Arrays. The RLI will also allow users to configure the testbed network, run applications and monitor those running applications using the built-in data gathering mechanisms that the routers provide. Support for data visualization and real-time remote display will be provided, allowing users to develop the insights needed to understand the behavior of new capabilities within a realistic experimental environment. The ONL builds on our previous experience of providing Gigabit Network Kits to research users [KI98]. The RLI will automate most of the mundane tasks associated with setting up and running an experiment, making it much easier for research users to get results, and minimizing the learning curve normally required for using complex experimental equipment.

We plan to integrate the extensible NSP to be developed in this project into the ONL, making the new capabilities of this advanced platform available for use by researchers throughout the country and around the world. The new NSP will provide a much richer set of resources and capabilities than the routers in the initial ONL configuration. Specifically, the current routers have much more limited embedded processing resources, lack the ability to host hardware plugins and have a fixed switch fabric, which cannot be reconfigured to explore alternative architectures. The new NSP will also support higher speed links (10 Gb/s) and a range of different system configurations.

References

[AD98] Adiseshu, H., G. Parulkar and R. Yavatkar. "A State Management Protocol for IntServ, DiffServ and Label Switching," in Proceedings of IEEE ICNP, 1998.
[BE97] Bennett, Jon and Hui Zhang. Hierarchical Packet Fair Queueing Algorithms, IEEE/ACM Transactions on Networking, 1997.
[CH97] Chaney, Tom, Andy Fingerhut, Margaret Flucke and Jonathan Turner. "Design of a Gigabit ATM Switch," Proceedings of Infocom, 4/97.
[CH99a] Choi, Sumi, Dan Decasper, John Dehart, Ralph Keller, John Lockwood, Jonathan Turner, Tilman Wolf. "Design of a Flexible Open Platform for High Performance Active Networks," In Proceedings of the Allerton Conference, 9/99.
[CH99b] Chuang, Shang-Tse, Ashish Goel, Nick McKeown and Balaji Prabhakar. "Matching Output Queueing with a Combined Input and Output Queued Switch," Proceedings of Infocom, 1999.
[CH01] Chandra, Prashant, Yang-Hua Chu, Allan Fisher, Jun Gao, Corey Kosak, T.S. Eugene Ng, Peter Steenkiste, Eduardo Takahashi, and Hui Zhang. "Darwin: Customizable Resource Management for Value-Added Network Services," IEEE Network, 1/01.
[CH02] Choi, Sumi, John Dehart, Ralph Keller, Fred Kuhns, John Lockwood, Prashanth Pappu, Jyoti Parwatikar W. David Richard, Ed Spitznagel, David Taylor, Jonathan Turner and Ken Wong. "Design of a High Performance Dynamically Extensible Router,'' Proceedings of the DARPA Active Networks Conference and Exposition, 5/02.
[CH03] Choi, S., Turner, J., Wolf, T., "Configuring Sessions in Programmable Networks," Computer Networks, Vol. 41, No. 2, pp. 269-284, 2/03.
[DE97] Decasper, Dan, Marcel Waldvogel, Zubin Dittia, Hari Adiseshu, Guru Parulkar and Bernard Plattner. "Crossbow - A Toolkit for Integrated Services over Cell Switched IPv6." Proceedings of the IEEE ATM Workshop, 5/97.
[DE98] Decasper, Dan and Bernhard Plattner. "DAN - Distributed Code Caching for Active Networks," In Proceedings of Infocom, 4/98.
[DE99] Decasper, Dan, Guru Parulkar, Sumi Choi, John DeHart, Tilman Wolf, Bernhard Plattner. "A Scalable, High Performance Active Network Node," In IEEE Network, 1/99.
[DE01] DeHart, J., W. D. Richard, E. Spitznagel, D. Taylor. "The Smart Port Card: An Embedded Unix Processor Architecture for Network Management and Active Networking," submitted to IEEE/ACM Transactions on Networking.
[DI97] Dittia, Zubin, Guru Parulkar, Jerry Cox. "The APIC Approach to High Performance Network Interface Design: Protected DMA and Other Techniques" Proceedings of INFOCOM, 1997.
[FE03] Feinstein, L., D. Schnackenberg, R. Balupari, and D. Kindred. "Statistical Approaches to DDoS Attack Detection and Response," in Proceedings of DISCEX III, 4/03.
[FL95] Floyd, S., V. Jacobson, S. McCanne, C. Liu, L. Zhang. "A Reliable Multicast Framework for Light-Weight Sessions and Application Framing," Proceedings of SIGCOMM, 1995.
[FU01] Fulp, Errin, Zhi Fu, Douglas S. Reeves, S. Felix Wu, and Xiaobing Zhang. "Preventing Denial of Service Attacks on Quality of Service," Proceedings of DISCEX II, 6/01.
[GO94] Golestani, S. "A self-clocked fair queuing scheme for high speed applications," Proceedings of Infocom, 1994.
[HE92] Henrion, M. A. "Resequencing System for a Switching Node," United States Patent #5,127,000, 6/92.
[HJ98] Hjalmtisson, Gisli and K. K. Ramakrishnan. "UNITE - An Architecture for Lightweight Signalling in ATM Networks," in the Proceedings of IEEE Infocom'98, pp. 832-840, San Francisco, CA, March 1998.
[HU97] Hua, Kien and Simon Sheu. "Skyscraper Broadcasting: a New Broadcasting Scheme for Metropolitan Video-on-Demand Systems," Proceedings of SIGCOMM, 1997.
[IY01] Iyer, Sundar, Ramana Rao Kompella, and Nick McKeown. "Analysis of a Memory Architecture for Fast Packet Buffers," IEEE - High Performance Switching and Routing, Dallas, Texas, May 2001, pp. 368-373.
[JA02] Janakiraman, Ramaprabhu, Marcel Waldvogel and Lihao Xu. "Fuzzycast: Efficient Video-on-Demand over Multicast," Proceeding of IEEE INFOCOM, 6/02.
[KE00] Keller, Ralph, Sumi Choi, Marcel Dasen, Dan Decasper, George Fankhauser and Bernhard Plattner. "An Active Router Architecture for Multicast Video Distribution," Proceedings of Infocom. 4/00.
[KI98] Gigabit Kits Technology Distribution Program., http://www.arl.wustl.edu/gigabitkits, 1998.
[KR99] Krishna, P., N. Patel, A. Charny and R. Simcoe. "On the speedup required for work-conserving crossbar switches," IEEE J. Selected Areas of Communications, 6/99.
[KU03] Kuhns, Fred, Samphel Norden and Jonathan Turner. "Lightweight Flow Setup for Wirespeed Flow Reservation," Proceedings of the Allerton Conference on Communication, Control and Computing, 10/03.
[PA99] Pan, P. and H. Schulzrinne, "YESSIR: A Simple Reservation Mechanism for the Internet," Computer Communication Review, Vol. 29, No. 2, April 1999.
[PA01] Park, K. and H. Lee. "On the Effectiveness of Route-based Packet Filtering for Distributed DoS Attack Prevention in Power-law Internets," in Proceeding of the ACM SIGCOMM, 2001.
[PA03a] Pappu, Prashanth, Jyoti Parwatikar, Jonathan Turner and Ken Wong. "Distributed Queueing in Scalable High Performance Routers," Proceeding of IEEE Infocom, 4/03.
[PA03b] Papadopoulos, Christos, Guru Parulkar and George Varghese. "LMS: A Router-Assisted Scheme for Reliable Multicast," IEEE/ACM Transactions on Networking, 2003.
[PR02] Prakash, Amit and Adnan Aziz, "A middle ground between CAMs and DAGs for high-speed packet classification," Proceedings of Hot Interconnects, 2002.
[QI02] Qiu, Ruibiao and Jonathan Turner. "Configuration of Reservered Delivery Subnetworks," Proceedings of Service Infrastructure for Virtual Enterprises Symposium, Globecom 2002, Taipei, Taiwan, November 2002.
[ST02] Steenkiste, Peter, Prashant Chandra, Jun Gao, Umair Shah, "An Active Networking Approach to Service Customization," DARPA Active Networks Conference and Exposition, 5/02
[TA02a] Taylor, David E., Jonathan S. Turner, John W. Lockwood. "Dynamic Hardware Plugins (DHP): Exploiting Reconfigurable Hardware for High-Performance Programmable Routers," Computer Networks, 2/02, vol. 38, no. 3, pp. 295-310.
[TA02b] Taylor, David E., John W. Lockwood, Todd Sproull, Jonathan S. Turner and David B. Parlour. "Scalable IP Lookup for Programmable Routers," Proceedings of IEEE Infocom, 6/02.
[TE97] Tennenhouse, D., J. Smith, W. Sincoskie, D. Wetherall, G. Minden. "A Survey of Active Network Research," IEEE Communications, 35:1, 80-86.
[TU97] Turner, Jonathan. "Extending ATM Networks for Efficient Reliable Multicast," Proceedings of Workshop on Communication and Architectural Support for Network-Based Parallel Computing, Springer Verlag, 2/97.

Related Projects

  1. Extreme Networking
  2. Field-programmable Port eXtender (FPX)
  3. Advanced Network Services Platform

Contacts

The techX project is a collaborative effort by the members of Washington University's Applied Research laboratory. For general information on this project contact either Dr. Jonathan Turner, or Dr. John Lockwood.

Acknowledgements

This work is supported by the National Science Foundation. However, any opinions, findings and conclusions or recomendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.


Last updated 02/13/04