CSE 770 Paper Review

Reviewer: Jon Turner
Date: 2-16-2006

How would you rate this paper, relative to others we have read? bottom 50%

How would you rate your knowledge of the topic of this paper? expert

What problem or issue does the paper address? Why is it important?

This paper addresses the problem of high performance IP lookup, targeting link rates of 160 Gb/s. In spite of the authors' assertions, it is not clear this important. Very few links have rates greater than 10 Gb/s and this is not likely to change, because installed fiber is poorly suited to link speeds greater than 10 Gb/s. Links can deliver higher total throughput using WDM, but different WDM channels are generally handled by different router interfaces. There is not a big advantage to having a single high speed lookup engine handle these different interfaces.

What are the main contributions of the paper and why are they important?

The observation that there are better ways of mapping nodes to pipeline stages than the standard method.

How significant are these contributions relative to previous work?

They're actually a nice starting point, but the authors missed some fairly obvious opportunities to go much further. However, there is also a major error in their analysis, which invalidates most of their results. Most of the results can probably be salvaged, in a modified form, but the O(1) update claim is just wrong.

Give detailed comments justifying your view of the paper.

The big error in the paper is the authors' claim that prefixes need never be replicated. Consider the set of prefixes:p1=*, p2=000*, p3=010*, p4=100* p5=110*. The authors scheme requires applying leaf-pushing to p1, but in this case we get four copies of p1. In the worst-case, we could get 2^31 copies of p1. It seems likely that the total number of "extra" nodes created in this way is bounded by the number of prefixes, so the impact on total memory size may not be too bad. On the other hand, the O(1) update claim goes out the window. In fact, their data structure takes O(n) time to update in the worst-case, where n is the number of prefixes. (Sailesh Kumar and Manfred Georg first identified this flaw, in our discussion of the paper)

The authors map trie nodes to pipeline stages based on their height in the tree, rather than the more common technique of mapping them based on their depth. They argue that this provides better balancing of the storage across the different memory banks. However both approaches unnecessarily constrain the mapping of nodes to memory banks. Here is a simple alternative that I would expect to provide much better balance in almost any real database.

Let k be the number of memory banks and let B be the number of trie nodes that can fit in each bank. Now, do a breadth first traversal, of the trie assigning each node a "color" in 0,1,2,... The assignment of colors should satisfy two constraints:

The algorithm assigns each node the smallest color that satisfies both constraints, as it does the breadth-first search. When all nodes have been colored, each node with color c is assigned to bank c mod k.

Now the only performance question is what is the maximum number of times any memory bank can be accessed for a packet. This is determined by the largest color used. In particular, if Cmax is the largest color then ceiling(Cmax/k) is the worst-case number of accesses per memory module. Ideally, this is also equal to ceiling(W/k), where W is the maximum number of nodes on a path from the root to a leaf. If it is, then the coloring is optimal. While the algorithm can't guarantee an optimal coloring, I have no doubt that it works much better than the scheme in the paper.

The other big drawback of the paper is the authors' use of binary tries. This is silly, since it increases the number of memory accesses needed to process a packet. The authors overcome this by using more memory modules, but they could have done better by simply recognizing that the banking technique can be applied to other trie-like data structures with a longer stride (including tree bitmap). A tree bitmap data structure with a stride of four, will use one fourth the memory bandwidth needed by a binary trie, hence will need one fourth the memory modules to match any given level of performance for a binary trie.

The authors also seem to be naive in their technology assumptions. They are assuming a dedicated IP lookup chip with on-chip memory. This is an expensive option (more expensive than TCAM). A more economical solution uses commodity off-chip memory banks with a small on-chip lookup engine. If their method is applied in this environment, there is a signficant mismatch in assumption. With off-chip memories, we would normally have equal size memory modules, and the minimum access tends to be fairly large (64 bits or more) making binary trie nodes too small to use the space to best advantage.

The authors' assertions about lookup time are a bit absurd. While supporting dynamic updates is important, there is no realistic system context in which updates need to be done as fast as lookups. Real systems may need to do 100 updates per second. A conservative designer would engineer for 1000 updates per second. Almost any reasonable trie-based scheme can meet that objective without any difficulty.

The authors also appear to be unaware of techniques that use just a single off-chip memory access, by using on-chip Bloom filters to determine which of W hash tables to access. Indeed, their references to the literature are highly selective. Given the huge volume of work that has been done in this area, this is a strong negative.

In my view, this paper should never have been published.