Discussion report on: Gigabit Rate Packet Pattern-Matching Using TCAM
By Fang Yu,
Randy H. Katz, T. V. Lakshman, ICNP 2004
Topic:
A TCAM based, multiple-pattern matching algorithm for layer 7 (content) pattern matching problems at gigabit rate. Since signatures are more complex for intrusion detection than some of the other layer 7 problems such as HTTP load balancing, email SPAM filtering etc., the paper is geared towards that.
Discussion:
NOTE: Since there are many
limitations to point out in this paper, I bring out points at relevant places
rather than summarizing the paper first.
The paper starts out well. Authors briefly point out problems with existing solutions but some in the class raised concerns that they do not cover important results from recent research but specifics were not discussed.
Memory access is often a limiting
factor in intrusion detection systems. The proposed solution is based on TCAM but
TCAMs are known for being expensive and power
consuming, their use of it becomes one of the things to analyze first. The TCAM
size that they used is 240KB. But since TCAMs of 2MB
are common nowadays, 240KB is not bad. These large(r) TCAMs
cost ~$400-500 though. While this may not be much compared to the cost of good
intrusion detection systems, content matching is not the only component of such
systems. But the thing to note here is that the 240KB size that they mention is
only for their simulation of CalmAV virus database
which only contains simple patterns.
Along
with TCAMs, their solution involves SRAM access. For
every TCAM hit, they access the combined pattern table stored in memory once
and matching table as many times as there are entries in the PHL (Partial Hit
List).
As far as SRAM requirements, for
their rather limited simulation using only ClamAV
patterns, we are left to infer the SRAM size to be < 10MB. Unfortunately,
they do not discuss this for SNORT database that involves more complex
correlated patterns. For long correlated patterns, the PHL is grows as the
entries can not be removed after 4 bytes. That means many more memory accesses
because every TCAM hit involves a memory access to the combined-pattern table
and j number of accesses to matching table where j is the number of entries in
the PHL. To address this, authors recommend restricting the inter sub-pattern
distance in correlated patters. Though it sounds too restrictive, 128B for
pattern length is pretty large and so their recommendation may not be too bad.
But what is not good about their memory requirements is their high per byte
memory requirement.
Unfortunately, the large
sub-pattern mentioned above is not the only cause of increased memory access.
Negation patterns can cause them too – suppose we are looking for “user”
followed by say “login” after 50 bytes, a malicious attacker can say “user”
repeatedly, sequentially, thus forcing us to keep the PHL entries for 50 bytes
and we already saw the bad effect of more entries in PHL.
Although this and the paper makes
us believe that their system is vulnerable to malicious attacks, simulation on
SNORT database might have shown a different result but unfortunately the
authors don’t report that. But to be fair to the authors, currently, all
intrusion detection systems all vulnerable to this and solution using bloom
filter is even more vulnerable.
Finally, in the result for their
simulation on SNORT database, they do not mention the window size. It looks
like they used the smallest size of 20 and if that is the case,
the result would be bad if we used a bigger size of 200 mentioned in table 7.
In addition to all these, the
authors have completed missed another important contributor to memory: authors say that their work is applicable to
layer 7 but they never discuss flow inspection while layer 7 inspection
requires treating flow as the entity and not packets. Flow inspection involves
storing flow state information which will only add to memory requirement.
Worse, matching patterns across packets, leave alone across flows, will further
increate number of entries in PHL, increasing memory access more.
Authors highlight their ability to handle:
· Simple patterns
o Deterministic
o Non-deterministic: case insensitive alphabets and wildcard byte
· Complex/ composite patterns that include
o Arbitrarily long patterns
o Correlated patterns and
o Negation
· Regular expression like patterns.
Unfortunately, their discussion of correlated patterns is very limited. An eg. of a correlated pattern is P3 = P1 * P2 but authors don’t discuss long content between P1 and P2.
Since payload processing is often
done along with header processing, the paper could have discussed
considerations for integration with other such functionalities.
The good things about the paper
are: their analysis of effect of TCAM width on real and synthetic traces for
two pattern databases and the pipelining of TCAM and SRAM accesses.
As far as quality of the writing, they start out well but when it gets to subject matter, unfortunately, the writing is not so good. The worst part is when they explain the matching algorithm – there are many tables used but the use of their indices is not at all explained when that is fundamental to understand their matching algorithm. Similarly, there is also a problem with consistency of the names being used – for instance, they start out with the term ‘matching table’ but somewhere down the paper they switch to ‘mapping table’ without making any connection between them. Finally, they never explain “window size” though we can infer what it means.
In all, the idea presented in the
paper is good but not fundamentally novel or groundbreaking. Not surprisingly,
the class ranked it in the middle third.
Shobana