IP Lookup and Packet Classification PublicationsReconfigurable Network Group in the Applied Research Laboratory
(In Reverse Chronological Order)
(Also available as BIBTeX format)
Abstract: Ternary Content Addressable Memory (TCAM), although widely used for general packet classification, is an expensive and high power-consuming device. Algorithmic solutions which rely on commodity memory chips are relatively inexpensive and power-efficient but have not been able to match the generality and performance of TCAMs. Therefore, the development of fast and power-efficient algorithmic packet classification techniques continues to be a research subject.
In this paper we propose a new approach to packet classification which combines architectural and algorithmic tech- niques. Our starting point is the well-known crossproduct algorithm which is fast but has significant memory overhead due to the extra rules needed to represent the crossproducts. We show how to modify the crossproduct method in a way that drastically reduces the memory requirement without compromising on performance. Unnecessary accesses to the off-chip memory are avoided by filtering them through on- chip Bloom filters. For packets that match p rules in a rule set, our algorithm requires just 4+p+e. independent memory accesses to return all matching rules, where . e << 1 is a small constant that depends on the false positive rate of the Bloom filters. Using two commodity SRAM chips, a throughput of 38 Million packets per second can be achieved. For rule set sizes ranging from a few hundred to several thousand filters, the average rule set expansion factor attributable to the al- gorithm is just 1.2 to 1.4. The average memory consumption per rule is 32 to 45 bytes.
Abstract: Some of the fastest practical algorithms for IP route lookup are based on space-efficient encodings of multi-bit tries. Unfortunately, the time required by these algorithms grows in proportion to the address length, making them less attractive for IPv6. This paper describes and evaluates a new data structure called a shape-shifting trie, in which the data structure nodes correspond to arbitrarily shaped subtrees of the underlying binary trie for a given set of address prefixes. The ability to adapt the node shape to the trie reduces the number of nodes that must be accessed to perform a lookup, especially for tries with large sparse regions. We give a fast algorithm for optimally dividing a trie into nodes so as to minimize the maximum lookup depth. We show that seven data structure accesses are sufficient for route tables with more than 150,000 IPv6 prefixes. This makes it possible to achieve wire-speed processing for OC192 link using a single QDRII SRAM chip.
Abstract: Hash table is used as one of the fundamental modules in several network processing algorithms and applications such as route lookup, packet classification, per-flow state management and network monitoring. These applications, which typically form components of data-path in a high-speed router, must process and forward packets with little or no buffer in order to maintain the wire-speed throughout. A poorly designed hash table can critically affect the worstcase throughput due to multiple memory accesses required for each lookup. Hence, high throughput requirement in turn underscores the need for a hash table having good and more predictable worstcase lookup performance. While most of the existing hash table based packet processing algorithms rely on the assumption that hash table lookup needs constant time, very little discussion is provided on the underlying engineering considerations to achieve this performance.
We present a novel hash table data structure and lookup algorithm which improves the performance of a naive hash table by providing better bounds on the hash collisions and memory accesses per search. Our algorithm extends the multiple-hashing Bloom Filter data structure to support exact matches. We contrive our hash table architecture by coupling our algorithm with the latest advances in embedded memory technology. Through theoretical analysis and simulations we show that our algorithm is significantly faster for practical purposes than the naive hash table using the same amount of memory, hence it can support better throughput for router applications based on hash tables.
Abstract: Intrusion rule processing in reconfigurable hardware enables intrusion detection and prevention services to run at multi Gigabit/second rates. High-level intrusion rules mapped directly into hardware separate malicious content from benign content in network traffic. Hardware parallelism allows intrusion systems to scale to support fast network links, such as OC-192 and 10 Gbps Ethernet. In this paper, a Snort Intrusion Filter for TCP (SIFT) is presented that operates as a preprocessor to prevent benign traffic from being inspected by an intrusion monitor running Snort. Snort is a popular open-source rule-processing intrusion system. SIFT selectively forwards IP packets that contain questionable headers or defined signatures to a PC where complete rule processing is performed. SIFT alleviates the need for most network traffic from being inspected by software. Statistics, like how many packets match rules, are used to optimize rule processing systems. SIFT has been implemented and tested in FPGA hardware and used to process Internet traffic from a campus Internet backbone with live data.
Abstract: High-performance rule processing systems are needed by network administrators in order to protect Internet systems from attack. Researchers have been working to implement components of intrusion detection systems (IDS), such as the highly popular Snort system, in reconfigurable hardware. While considerable progress has been made in the areas of string matching and header processing, complete systems have not yet been demonstrated that effectively combine all of the functionality necessary to perform rule processing for network systems. In this paper, a framework for implementing a rule processing system in reconfigurable hardware is presented. The framework integrates the functionality to scan data flows for regular expressions, fixed strings, and header values. It also allows modules to be added to perform extended functionality to support all features found in Snort rules. Reconfigurability and flexibility are key components of the framework that enable it to adapt to protect Internet systems from threats including malicious worms, computer viruses, and network intruders. To prove the framework viable, a system has been built that scans all bytes of Transmission Control Protocol/Internet Protocol (TCP/IP) traffic entering and leaving a network's gateway at multi-gigabit rates. Using Xilinx FPGA hardware on the Field programmable Port eXtender (FPX) platform, the framework can process 32,768 complex rules at data rates of 2.5 Gbps. Systems to handle data at 10 Gbps rates can be built today using the same framework in the latest reconfigurable hardware devices such as the Virtex 4.
Abstract: FPGA technology has become widely used for real-time network intrusion detection. In this paper, a novel packet classification architecture called BV-TCAM is presented, which is implemented for an FPGA-based Network Intrusion Detection System (NIDS). The classifier can report multiple matches at gigabit per second network link rates. The BV-TCAM architecture combines the Ternary Content Addressable Memory (TCAM) and the Bit Vector (BV) algorithm to effectively compress the data representations and boost throughput. A tree-bitmap implementation of the BV algorithm is used for source and destination port lookup while a TCAM performs the lookup of the other header fields, which can be represented as a pre/x or exact value. The architecture eliminates the requirement for prefix expansion of port ranges. With the aid of a small embedded TCAM, packet classification can be implemented in a relatively small part of the available logic of an FPGA. The design is prototyped and evaluated in a Xilinx FPGA XCV2000E on the FPX platform. Even with the most difficult set of rules and packet inputs, the circuit is fast enough to sustain OC48 tra1c throughput. Using larger and faster FPGAs, the system can work at speeds greater than OC192.
Abstract We introduce the first algorithm that we are aware of to employ Bloom filters for Longest Prefix Matching (LPM). The algorithm performs parallel queries on Bloom filters, an efficient data structure for membership queries, in order to determine address prefix membership in sets of prefixes sorted by prefix length. We show that use of this algorithm for Internet Protocol (IP) routing lookups results in a search engine providing better performance and scalability than TCAM-based approaches. The key feature of our technique is that the performance, as determined by the number of dependent memory accesses per lookup, can be held constant for longer address lengths or additional unique address prefix lengths in the forwarding table given that memory resources scale linearly with the number of prefixes in the forwarding table. Our approach is equally attractive for Internet Protocol Version 6 (IPv6) which uses 128-bit destination addresses, four times longer than IPv4. We present a basic version of our approach along with optimizations leveraging previous advances in LPM algorithms. We also report results of performance simulations of our sys snapshots of IPv4 BGP tables and extend the results to IPv6. Using less than 2Mb of embedded RAM and a commodity SRAM device, our technique achieves average performance of one hash probe per lo and a worst case of two hash probes and one array access per lookup.
Abstract An extensible .rewall has been implemented that performs packet .ltering, content scanning, and per-.ow queuing of Internet packets at Gigabit/second rates. The .rewall uses layered protocol wrappers to parse the content of Internet data. Packet payloads are scanned for keywords using parallel regular expression matching circuits. Packet headers are compared to rules speci.ed in Ternary Content Addressable Memories (TCAMs). Per-.ow queuing is performed to mitigate the effect of Denial of Service attacks. All packet processing operations were implemented with recon.gurable hardware and .t within a single Xilinx Virtex XCV2000E Field Programmable Gate Array (FPGA). The singlechip .rewall has been used to .lter Internet SPAM and to guard against several types of network intrusion. Additional features were implemented in extensible hardware modules deployed using run-time recon.guration.
Abstract Internet protocol (IP) address lookup is a central processing function of Internet routers. While a wide range of solutions to this problem have been devised, very few simultaneously achieve high lookup rates, good update performance, high memory efficiency, and low hardware cost. High performance solutions using content addressable memory devices are a popular but high-cost solution, particularly when applied to large databases. We present an efficient hardware implementation of a previously unpublished IP address lookup architecture, invented by Eatherton and Dittia. Our experimental implementation uses a single commodity synchronous random access memory chip and less than 10% of the logic resources of a commercial configurable logic device, operating at 100 MHz. With these quite modest resources, it can perform over 9 million lookups/s, while simultaneously processing thousands of updates/s, on databases with over 100000 entries. The lookup structure requires 6.3 bytes per address prefix: less than half that required by other methods. The architecture allows performance to be scaled up by using parallel fast IP lookup (FIPL) engines, which interleave accesses to a common memory interface. This architecture allows performance to scale up directly with available memory bandwidth. We describe the Tree Bitmap algorithm, our implementation of it in a dynamically extensible gigabit router being developed at Washington University in Saint Louis, and the results of performance experiments designed to assess its performance under realistic operating conditions.
Abstract A module has been implemented in Field Programmable Gate Array (FPGA) hardware that scans the content of Internet packets at Gigabit/second rates. All of the packet processing operations are performed using recon/gurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content match- ing server automatically generates the Finite State Machines (FSMs) to search for regular expressions. The complete system is operated on the Field-programmable Port Extender (FPX) platform.
Abstract A network platform called the Field-programmable Port Extender (FPX) streamlines and simplifies network transmission processing directly in hardware.
Abstract A prototype platform has been developed that allows processing of packets at the edge of a multi-gigabit-per-second network switch. This system, the Field Programmable Port Extender (FPX), enables packet processing functions to be implemented as modular components in reprogrammable hardware. All logic on the on the FPX is implemented in two Field Programmable Gate Arrays (FPGAs). Packet processing functions in the system are implemented as dynamically-loadable modules.
Core functionality of the FPX is implemented on an FPGA called the Networking Interface Device (NID). The NID contains the logic to transmit and receive packets over a network, dynamically reprogram hardware modules, and route individual traffic flows. A full, non-blocking, switch is implemented on the NID to route packets between the networking interfaces and the modular components. Modular components of the FPX are implemented on a second FPGA called the Reprogrammable Application Device (RAD). Modules are loaded onto the RAD via reconfiguration and/or partial reconfiguration of the FPGA.
Through the combination of the NID and the RAD, the FPX can individually reconfigure the packet processing functionality for one set of traffic flows, while the rest of the system continues to operate. The platform simplifies the development and deployment of new hardware-accelerated packet processing circuits. The modular nature of the system allows an active router to migrate functionality from softare plugins to hardware modules.
Abstract: This paper describes the design, implementation and performance of an open, highperformance, dynamically reconfigurable Multi-Service Router (MSR) being developed at Washington University in St. Louis. This router provides an experimental platform for research on protocols, router software and hardware design, network management, quality of service and advanced applications. The MSR has been designed to be flexible, without sacrificing performance. It supports gigabit links and uses a scalable architecture suitable for supporting hundreds or even thousands of links. The MSRs flexibility makes it an ideal platform for experimental research on dynamically extensible networks that implement higher level functions in direct support of individual application sessions.