The Parallel Sieve Method for a Virus Scanning Engine
ABSTRACT This paper shows a new architecture for a virus scanning system, which is different from that of an intrusion detection system. The proposed method uses twostage matching: In the first stage, a hardware filter quickly scans the text to find partial matches, and in the second stage, the MPU scans the text to find a total match in the ClamAV 514,287 virus pattern set. To make the hardware filter simple, we use a finiteinput memory machine (FIMM). To reduce the memory size of the FIMM, we introduce the parallel sieve method. The proposed method is memorybased, so it is quickly reconfigurable and dissipates lower power than a TCAMbased method. The system is implemented on the Stratix III FPGA with three offchip SRAMs and an SDRAM, where all ClamAV 514,287 virus patterns are stored. Compared with existing methods, our method achieves 1.4131.36 times more efficient areathroughput ratio.
 [Show abstract] [Hide abstract]
ABSTRACT: This paper considers a method to realize index generation functions. The parallel sieve method developed by the authors efficiently implements an index generation function. Unfortunately, it requires many Index Generation Units (IGUs) with different sizes. This paper shows a design method that requires only four IGUs with the same size. The presented architecture can be used as a lowpower content addressable memory (CAM).  SourceAvailable from: lsicad.com[Show abstract] [Hide abstract]
ABSTRACT: This survey first introduces index generation func tions, which are useful for pattern matching in communication circuits. Then, it shows various methods to realize index gener ation functions using memories. A linear transformation is used to reduce the number of variables and thus memory size. An extension to the multiplevalued case is also presented. I. INDEX GENERATION FUNCTION This paper surveys recent results on index generation func tions. Applications of index generation functions include: IP address table lookup, packet filtering, terminal access con trollers, memory patch circuits, virus scan circuits, fault maps for memory, and pattern matching. In addition, this paper introduces an index generation unit that efficiently realizes an index generation function by a linear circuit and a pair of smaller memories. Due to space limitations, definitions of standard terminology used in switching circuit theory (11) are omitted.Proceedings of The International Symposium on MultipleValued Logic 01/2011; 
Conference Paper: A lowcost and highperformance virus scanning engine using a binary CAM emulator and an MPU
[Show abstract] [Hide abstract]
ABSTRACT: This paper shows a virus scanning engine using twostage matching. In the first stage, a binary CAM emulator quickly detects a part of the virus pattern, while in the second stage, the MPU detects the full length of the virus pattern. The binary CAM emulator is realized by four index generation units (IGUs). The proposed system uses four off chip SRAMs and a small FPGA. Thus, the cost and the power consumption are lower than the TCAMbased system. The system loaded 1,290,617 ClamAV virus patterns. As for the area and throughput, this system outperforms existing FPGAbased implementations.Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications; 03/2012
Page 1
The Parallel Sieve Method for a Virus Scanning Engine
Hiroki Nakahara †, Tsutomu Sasao †, Munehiro Matsuura †, and Yoshifumi Kawamura ††
†Kyushu Institute of Technology, Japan
††Renesas Technology Corp., Japan
Abstract
This paper shows a new architecture for a virus scanning
system, which is different from that of an intrusion detec
tion system. The proposed method uses twostage match
ing: In the first stage, a hardware filter quickly scans the
text to find partial matches, and in the second stage, the
MPU scans the text to find a total match in the ClamAV
514,287 virus pattern set. To make the hardware filter sim
ple, we use a finiteinput memory machine (FIMM). To
reduce the memory size of the FIMM, we introduce the
parallel sieve method. The proposed method is memory
based, so it is quickly reconfigurable and dissipates lower
power than a TCAMbased method. The system is imple
mented on the Stratix III FPGA with three offchip SRAMs
and an SDRAM, where all ClamAV 514,287 virus patterns
are stored. Compared with existing methods, our method
achieves 1.4131.36 times more efficient areathroughput
ratio.
1Introduction
A malware (a composite word frommalicious software)
intends to damage computer systems. With the wide use
of the Internet, users can easily access and download dan
gerous data. So, the risk of infection by the malware is in
creasing. Malware secretly installs a bot virus, a back door,
or a keylogger. As a result, the exploitation of the pass
word, the stealing of the information, and illegal remote
operation can do damage to computer users. Although a
softwarebased virus scanning system can clean and isolate
the malware, the throughput for softwarebased scanning is
at most tens of mega bits per second (Mbps) [16]. Thus,
the softwarebased approach cannot keep up with the mod
ern Internetthroughputwhich is more than one gigabits per
seconds (Gbps). Malware is becoming more prevalent and
more complex, and so virus scanning on computer systems
will be a bottleneck in the future. Recently, hardwarebased
virus scanning systems are attached to the gateway between
the Internet and the Intranet [22]. Fig. 1 shows an exam
ple of a virus scanning system. To detect the malware, first
the packet receiver assembles the data from the incoming
packets. Also, for compressed data, the packet receiver de
compresses it. Then, the virus scanning engine inspects the
data to see if it contains the malware. Finally, the packet
sender assembles the data to packets, and sends them to the
Intranet. The most important part of the virus scanning sys
temis the virusscanningengine. Otherpartscanbe realized
Virus Scanning
System
Firewall
Internet
Router
PHY/MAC Ports
Packet Receiver
Virus Scanning
Engine
Packet Sender
PHY/MAC Ports
Intranet
Figure 1. Virus Scanning System.
by the conventional technique. So, in this paper, we focus
on the virus scanning engine.
In a typical commercial virus scanning system[22], the
throughput is about 1.2 Gbps, the power dissipation is
450 W, and the price is around $10,000. Here, we consider
a virus scanning engine with the following features:
High throughput
It has a throughput with more than
one Gbps.
Low power
It is SRAMbased rather than TCAM
based [3, 24]. The power dissipation for the TCAM is
higher than that for the SRAM. Also, the number of tran
sistors per bit for the TCAM is larger than that for the
SRAM [14]. Table 1 [9] compares the SRAM with the
TCAM.
Highspeed reconfigurable
ization rather than the hardwired realization. Some virus
scanning software, e.g., Kaspersky [10], updates the virus
data every hour. Although, the random logic implementa
tion of the virus scanning circuit on the FPGA [6] is fast
and compact, the time for placeandrouteis longer than the
period for the virus pattern update. Thus, the hardwired
system is unsuitable for quick update.
Memory efficient
Various memorybased methods have
been proposed. For example, in [5], the patterns are em
bedded into the memory of a sequential circuit for an Aho
Corasick automaton. It requires 46 bytes per character. For
the current version (v.0.94.2)of ClamAV [7] (the most pop
ular open source antivirus software), the number of pat
terns is 514,287. To store all the patterns, tens of highend
It uses a memorybasedreal
1
Page 2
FPGAs are required. Thus, the method [5] is too expensive
for virus scanning.
Table
SRAM (18Mbit chip) [9]
1.ComparisonofTCAMwith
TCAM
266
1215
SRAM
400
≈ 0.1
Max. Freq. [MHz]
Power Dissipation [W]
# of transistors per a bit166
Since the number of virus patterns is large, conventional
methods cannot be used. In this paper, we show a new ar
chitecture for the virus scanning system, which is differ
ent from that of the intrusion detection system. The pro
posed method uses twostage matching [5, 24], which is
areathroughput efficient. That is, in the first stage, the
parallel hardware filter quickly scans the text to find par
tial matches, and in the second stage, the MPU scans the
text to find the total match. To make the hardware filter
simple, we use a finiteinput memory machine (FIMM). To
reducethe memorysize of the FIMM, we introducethe par
allel sieve method. The proposedmethodis memorybased,
so the power consumption is lower than the TCAMbased
method.
The rest of the paper is organized as follows: Chapter 2
introduces the virus scanning system; Chapter 3 describes
the finiteinputmemory machine (FIMM); Chapter 4 shows
the parallel sieve method that reduces the memory for the
FIMM; Chapter 5 shows implementation results on an Al
tera FPGA; and Chapter 6 concludes the paper.
2Virus Scanning System
2.1Virus Scanning Problem
A virus scanner detects the malware on executablecode
or data. Text denotes a string of characters in which there is
apossiblevirus. Themalwareisspecifiedbyapattern1that
consists of the subpattern. Virus scanning corresponds to
the pattern matching that detects variable length patterns in
the text.
2.2Virus Pattern in ClamAV
The current version of ClamAV (v.0.94.2) [7] contains
three types patterns: the MD5 checksum pattern, the basic
pattern, and the restricted regular expression pattern. The
MD5 checksum pattern and the basic pattern is represented
by characters, while the restricted regular expression pat
tern is represented by characters and meta characters. A
character is represented by 8 bits, or a pair of hexadecimal
numbers. The length of a pattern is the number of char
acters in the pattern. In this paper, k denotes the number
of patterns, and c denotes the length of a pattern. Table 2
shows meta characters used in the ClamAV, and Table 3
shows examples of virus patterns in ClamAV.
1Pattern is also called a signature
Table 2. Meta Characters Used in ClamAV.
Meaning
??
*
()

{n,m}
An arbitrary character
Repetition of more than zero (Kleene closure)
Specify the priority of the operation
Logical OR
Repetition (more than n and less than m)
Table 3. Virus Patterns in ClamAV.
Virus Name
Trojan.Bat.DelY3
Trojan.Bat.DelY
Trojan.Bat.MkDir.B
Pattern
64656c74726565{1}2f(5979)20633a5c2a2e2a
44454c54524545202f(5979)20633a5c2a2e2a
406d64202572616e646f6d25????676f74
6f20486f6f
736d74702e796561682e6e65*2d20474554
204f494351
6840484048688d5b0090eb01ebeb0a5ba9ed46
W32.Gop
Worm.Bagle67
2.3Virus Patterns in ClamAV
Table4compares the current versionofCla
mAV (v.0.94.2) with that of SNORT (v.2.8.3.2) [20].
It shows that regular expressions for ClamAV are simpler
than that for SNORT. However, the number of patterns in
ClamAV is much larger than that in SNORT. This implies
that, in the virus scanning system, a memory efficient
architecture is required.
Table 4. Comparison ClamAV with SNORT
ClamAV
514,287
32.9
0.081
Snort
3,533
193.7
46.7
# of patterns
average pattern length
average # of metacharacters
2.4Twostage Matching for the Virus
Scanning
ClamAV uses twostage matching to achieve a high
speedandmemoryefficientsystem. Thefirst stagesearches
subpatterns using an AhoCorasick (AC) automaton [1] to
find partial matches [5]. To obtain the AC automaton,
first, the given patterns are represented by a text tree (Trie).
Next, the failure paths that indicate the transitions for the
mismatches are attached to the text tree. Since the AC au
tomaton stores failure paths, no backtracking is required.
By scanning the text only once, the AC automaton can de
tect all the fixedlength patterns. The AC automaton can be
realized by a state machine consisting of a memory and a
register. The memory stores the state transition functions
and the output functions, while the register stores the state
variables2. The state transitions for the AC automaton is
complex, the size of memory tends to be large. To realize
the compact automaton, ClamAV uses the AC automaton
with the length c = 3. The second stage scans patterns us
2The number of bits for the register is much smaller than that for the
memory, so we ignore the bits for the register. In this paper, the amount of
memory denotes the memory that stores the state transition functions and
the output functions (in bits).
2
Page 3
Trojan.Bat.MkDir.B
= 40 6d 64 20 25 72 61 6e 64 6f 6d 25 ?? ?? 67 6f 74 6f 20 48 6f 6f
Text
?? 0c ed 40 6d 64 20 25 72 61 6e 64 6f 6d 25 3f 20 67 6f 74 6f 20 48 6f 6f ??
match ‘{40 6d}’
40 6d 64 20 25 72 61 6e 64 6f 6d 25 ?? ?? 67 6f 74 6f 20 48 6f 6f
match ‘Trojan.Bat.MkDir.B’
×× ×
1ststage
2ndstage
Figure 2. An Example of the Twostage
Matching.
ing PCRE (Perl Compatible Regular Expression library for
Clanguage)[15] to find total matches, while the first stage
detects the partial match. To perform the second stage, ad
ditional off chip memory and an MPU are used.
Example 2.1 Fig. 2 illustrates twostage matching that de
tects Trojan.Bat.MkDir.B in Table 3. First, it detects 406d,
then it detects the whole pattern of Trojan.Bat.MkDir.B.
2.5Profile Analysis for the Virus Scan
ning
To analyze the profile of twostage matching on a PC,
we selected 512patternsfromClamAV, and performedtwo
stage matching for 10 executable codes. A profile analy
sis shows that the first stage spends 83% of the CPU time,
while the second stage spends 17% of the CPU time. Thus,
to improve throughput, we should consider the following:
Highspeedautomaton
a hardware filter instead of software. For virus scanning,
since the packet can be inspected independently, parallel
processing using hardware filter is applied.
toimprovethefirststage. We use
Reduction of the partial matches in the first stage
reduce the load in the second stage. Increasing the length c
reduces the partial matches. This reduces the work for the
MPU. However, this increases the memory size in the first
stage.
From an experiment on a PC, the interval for the partial
match occurrence (the number of characters) is about 100,
when c = 3. For our virus engine, in the second stage, an
embeddedMPU on an FPGA is used. However, the embed
ded MPU is slower than a PC. When an interrupt happens
during the matching operation on the embedded MPU, the
system must suspend the hardware filter in the first stage.
To avoid suspension, we increase the length of patterns to
c = 4 to reduce the number of partial matches3.
to
3Virus Scanning Engine Using a Finite
Input Memory Machine and an MPU
In twostage matching for virus scanning, it is necessary
to reduce the hardware filter. To make the hardware filter
3We can increase the length more than c = 4. However, it increases
the amount of memory. Detail is shown in Section 4.4.
Memory
Input (text)
Output
(match number)
⎡
( log2
8
c
×
8
⎤ ) 1
+
k
Shift register (cregisters)
Reg.
8
Reg.
8
Reg.
8
Figure
chine (FIMM).
3.FiniteInput Memory Ma
compact, first we introduce the finiteinput memory ma
chine (FIMM). Also, in this section, we describe the two
stage matching using an FIMM and an MPU.
3.1FiniteInput Memory Machine
Since the state transitions for the AC automaton is com
plex, the size of memory tends to be large. For the intrusion
detecting system, bit partitioning is used to reduce the size
of the circuit [21]. However, for the virus scanning system,
the size of the circuit would be too large even if bit par
titioning method is used. By restricting the transitions, we
haveafiniteinput memorymachine(FIMM) [11]. Fig.3
shows the architecture of an FIMM that accepts k patterns
of length at most c. In Fig. 3, Reg denotes an 8 bit parallel
in parallelout register. The FIMM stores the past c inputs
in a shift register. The memory produces the match num
ber. The FIMM is smaller and faster than the state machine
for the ACautomaton, since no circuit for the state transi
tion functions is required. The FIMM has 28cstates. Let
MFIMMbe the size of memory to realize the output func
tion of the FIMM, k be the number of patterns, and c be the
length of the patterns. Then, we have
=28c?log2(k + 1)?.
Since the amount of memory for the state variables realized
by the shift register is much smaller than that for the output
functions, it is ignored.
MFIMM
3.2Virus Scanning Engine
Fig. 4 shows the virus scanning engine consisting of
an FIMM and an MPU. The FIFO stores indices for par
tial matches and the positions for the detected subpatterns.
When a partial match is detected, the FIFO sends an inter
rupt signal (IRQ) to the MPU. When the MPU accepts an
IRQ, it scans the full text to check if it is a total match or
not.
4 Realization of the FIMM using the Parallel
Sieve Method
For the virus scanning engine using twostage matching,
in the first stage, the FIMM scans the text of length c to find
partial matches, while in the second stage, the MPU scans
3
Page 4
FIMM
FIFO
System Bus
MPU
(NiosII/f)
DDR2
SDRAM
IRQ
match
num
Text
8 32
1ststage2ndstage
Figure 4. Virus Scanning Engine.
Table 5. Example of an Index Generation
Function.
x1
x2
x3
x4
000
010
001
001
000
111
010
x5
x6
f
1
2
3
4
5
6
7
0
0
0
1
0
0
1
1
1
1
1
0
1
1
0
0
0
0
1
1
1
full length text to find the total match. When the output
function of the FIMM is implemented by a single memory,
the necessary memory size MFIMMis
MFIMM
=28×c?log2(k + 1)? ? 251[bits],
(1)
when k = 514,287 and c = 4. So, a singlememory im
plementation is impractical. To reduce the memory size,
we use the parallel sieve method which uses multiple index
generation units (IGUs).
4.1Index Generation Function
Definition 4.1 A mapping F(?X) : Bn→ {0,1,...,k}, is
an index generation function with weight k. if F(? ai) = i
(i = 1,2,...,k) for k different registered vectors, and
F = 0 for other (2n− k) input vectors, where ? ai ∈ Bn
(i = 1,2,...,k). In other words, an index generation func
tion produces indices ranging from 1 to k for k different
registered vectors, and produces 0 for other vectors.
In virus scanning, a registered vector corresponds to a
virus pattern, while an index corresponds to the unique
number for each virus pattern.
Example 4.2 Table 5 shows an example of an index gener
ation function f, where n = 6 and k = 7.
An FIMM produces an index generation function as
shownin Fig. 3. A contentaddressablememory(CAM) im
plements the index generation function directly. However,
an FPGA realization of a CAM requires a large number of
elements [8, 13]. Also, the CAM dissipates more power
than the SRAM as shown in Table. 1. Thus, SRAMs are
used to implement the index generation function.
4.2Index Generation Unit
Let f(X1,X2) be an index generation function of
n variables with weight k.
2n?log2(k + 1)? bits.
A singlememory requires
Table
position Chart for
f(X1,X2).
6. Decom
0 0 0 0 1 1 1 1 x3
0 0 1 1 0 0 1 1 x2
0 1 0 1 0 1 0 1 x1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 2 0 3 0 0 0
0 0 0 0 4 0 0 0
5 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 6
0 0 7 0 0 0 0 0
000
001
010
011
100
101
110
111
x6x5x4
Table
position Chart for
ˆf(Y1,X2).
7.Decom
0 0 0 0 1 1 1 1 y3
0 0 1 1 0 0 1 1 y2
0 1 0 1 0 1 0 1 y1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
2 0 1 0 0 0 3 0
0 0 4 0 0 0 0 0
0 5 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 6 0 0 0
0 0 0 0 0 7 0 0
000
001
010
011
100
101
110
111
x6x5x4
Table
position Chart for
ˆf1(Y1,X2).
8.Decom
0 0 0 0 1 1 1 1 y3
0 0 1 1 0 0 1 1 y2
0 1 0 1 0 1 0 1 y1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
2 0 1 0 0 0 3 0
0 0 0 0 0 0 0 0
0 5 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 6 0 0 0
0 0 0 0 0 7 0 0
000
001
010
011
100
101
110
111
x6x5x4
Letˆf(Y1,X2) be the function whose variables X1 =
(x1,x2,...,xp) are replaced by Y1 = (y1,y2,...,yp),
where yi= xi⊕ xj, xj∈ {X2}, and p ≥ ?log2(k + 1)?.
Example 4.3 Table 6 shows a decomposition chart for
the index generation function shown in Example 4.2. The
column labeled X1 = (x1,x2,x3) denotes bound vari
ables, and row labeled X2 = (x4,x5,x6) denotes free
variables. The corresponding chart entry denotes the func
tion value.Table 7 shows the decomposition chart for
ˆf(Y1,X2), where Y1 = (x1⊕ x6,x2⊕ x5,x3⊕ x4). In
Table 7, column labels denote Y1 = (y1,y2,y3), and row
labels denote X2= (x4,x5,x6). If a column of a decom
position chart has two or more nonzero elements, then the
column has a collision. The number of collisions is three
in Table 6, while the number of collisions is only one in Ta
ble 7.
Table 9. Main Mem
oryˆh(Y1) forˆf1(Y1).
y3 0 0 0 0 1 1 1 1
y2 0 0 1 1 0 0 1 1
y1 0 1 0 1 0 1 0 1
ˆh 2 5 1 0 6 7 3 0
In Table 7, assume that the element ‘4’ in the column
(0,1,0)is realized by an other circuit. By removing‘4’ from
ˆf, we haveˆf1whose decomposition chart is shown in Ta
ble8,wherenocollisionoccurs. Notethat, wecanrepresent
the nonzero elements ofˆf1by a main memoryˆh whose
input is Y1. Table 9 shows the functionˆf1(Y1) of the main
memory. The main memory realizes a mapping from a set
of 2pelements to a set of at most 2pelements. The output
for the main memory does not always represent f, sinceˆf1
ignores X2. Thus, we must check whetherˆf1is equal to f
or not by using an auxiliary memory. To do this, we com
pare the input X2with the output for the auxiliary memory
4
Page 5
Main
Memory
AUX
Memory
Comparator
AND
X1
X2
Y1
p
np
np
np
p
f1(Y1,X2)
p
^
Programmable
Hash Circuit
⎡⎤ ) 1
+
( log2
=
kp
Figure 5. Index Generation Unit (IGU).
in
MUX
in
MUX
reg.
reg.
Figure 6. Programmable Hash Circuit.
by a comparator. The auxiliary memory stores the values
of X2when the output ofˆf1(Y1,X2) is nonzero. Fig. 5
shows the index generation unit (IGU) [18]. First, the
programmable hash circuit shown in Fig. 6 generates the
hashed inputs Y1from the primary inputs (X1,X2), where
X1 = Y1. Second, the main memory finds the possi
ble index correspondingto Y1. Third, the auxiliary memory
produces the corresponding inputs X?
parator checks whether X?
nally, the AND gates produce the correct valueˆf(Y1,X2).
2. Fourth, the com
2is equal to X2or not. And, fi
Example 4.4 Fig 7 shows the circuit of 6variable function
in Table 6. The element ‘4’ is realized by the AND gate,
whiletheotherelementsarerealizedbytheIGU.Thebinary
representation of ‘4’ is (1,0,0). The output of the function
is realized by ORing the most significant bit of the output of
IGU and the output of the AND gate.
4.3 Capability of the Index Generation
Unit
Theorem 4.1 [17] Let f be an nvariableindex generation
function with weight k. Let the nonzero elements of f be
uniformlydistributedinthe decompositionchartof f. Then,
the fraction of registered vectors realized by the index gen
eration unit (IGU) in Fig. 5 is given by
δ
?
1 −1
2(k
2p) +1
6(k
2p)2,
where p = Y1 denotes the number of bound variables in
the decomposition chart for f(Y1,X2), and k ≤ 2p.
000000 111111
000000110110
111111 101101
101 101100 100
011011 011011
001 001010 010
110 110001 001
010 010000 000
f1f2f3
y1y2y3
f1f2f3
y1y2y3
x1
x6
x2
x5
x3
x4
y1
y2
y3
111111 111 111
011011 110 110
001001 101 101
000000 100100
010010011 011
010 010010 010
010 010001001
000 000000000
X’2
X’2
f1f2f3
f1f2f3
x4
x5
x6
x1x2x3x4x5x6
f1
f2
f3
Main Memory
Hash
Circuit
AUX
Memory
Comparator
Figure 7. A realization of 6variable function
shown in Table 6.
Example 4.5 When
this case, the main memory realizes 66.6% of the registered
vectors. Note that the programmable hash circuit is used to
make f uniformly distributed.
k
2p =1
1, we have δ =2
3? 0.666. In
Experimental results show that, by increasing the num
ber of inputs for the main memory, we can store virtually
all vectors.
Conjecture 4.1 [19] Consider a set of uniformly dis
tributed index generation functions with weight k. In most
cases, an index generation function can be represented by
an IGU with the main memory having at most
p
=2?log2(k + 1)? − 1
inputs.
4.4The Parallel Sieve Method
From Theorem 4.1, given p and k, we can estimate
the number of vectors realized by the IGU. Consider virus
scanning where the total number of registered vectors k
is 514,287, and the length of the patterns c is 4.
shownin (1),thesinglememoryrealizationrequires24×8×
514,287 ? 251[bits],whichis impractical. Supposethatall
the registeredvectors are stored in a single IGU. From Con
jecture 4.1, the number of inputs for the main memory is
p = 2?log2(514,287+1)?−1 = 37, whichis also toolarge,
since the memorysize is 237[bits]. Thus, we implementthe
function by applying Theorem 4.1 many times. Consider
the case, where
2p
?
19, which is practical for modern SRAMs. In this case, we
have, δ =
rately in the main memories of IGUs shown in Fig. 8, we
can implement most of the vectors in a series of main mem
ories of IGUs. A small number of remaining vectors can be
implementedby an additionalIGU by using Conjecture4.1.
As
514,287
1
1. The number of inputs p is
2
3? 0.666. By storing registered vectors sepa
Example 4.6 When the number of remaining vectors is
200, by Conjecture 4.1, an IGU havinga main memory with
15 inputs can implement all the remaining vectors.
5
Page 6
Main
Mem
AUX
Mem
comp.
IGU1
Main
Mem
AUX
Mem
comp.
Main
Mem
AUX
Mem
comp.
&
&
&
⎡⎤ ) 1
+
(log2
k
Programmable
Hash Circuit
Programmable
Hash Circuit
IGU2
IGUt
Figure 8. Parallel Sieve Method.
Definition 4.2 The parallel sieve method is an implemen
tation of an index generation function using the circuit con
sisting of multiple IGUs as shown in Fig. 8. IGUi+1 is
used to realize a part of the registered vectors not realized
by IGU1, IGU2, ..., or IGUi. The OR gate in the output
combines the indices to form a single output.
4.5 Number of IGUs in the Parallel Sieve
Method
Let k be the number of registered vectors, and p be the
number of inputs for the first main memory. From Theo
rem 4.1, when
of the registered vectors. In this case, the fraction for the
remaining vectors is γ = 1 − δ =
that the second main memory stores δ =
ing vectors. Then, the fraction of vectors realized by the
second main memory is γδ =1
fraction for the vectors realized by two main memories is
2
3+2
9
=8
For each step, we choose the smallest integer pj such
that 2pjis greater than the number of remaining registered
vectors. By generalizing this, we have the following:
Theorem 4.2 Let k be the total number of registered vec
tors, t be the number of index generation units (IGUs), and
r be the number of vectors not realized by these t IGUs.
Then,
t
=
k
2p = 1, the main memory stores δ =
2
3
1
3. Next, choose p?so
2
3of the remain
3×2
3=2
9. Therefore, the
9=6+2
9.
?logγ(r
k)?,
(2)
where δ is given by Theorem 4.1, γ+δ = 1, andthe number
of inputs for each main memory pj satisfies the relation:
(the number of remaining registered vectors) ≤ 2pj.
(Proof) Let δ be the fraction of the vectors realized by
the ith IGU, and let γ be the fraction for the remaining
vectors. Then, the fraction of vectors realized by t IGUs is
δ + γδ + γ2δ + ··· + γt−1δ = δ1 − γt
1 − γ
= 1 − γt.
Since we try to realize k vectors, and r is the number of
vectors not realized by t IGUs, we have
r
=
k − k(1 − γt) = kγt.
By solving the above expression for t, we have
=
t logγ(r
k).
Since t is an integer, we have (2).
Given the number of remaining vectors r, Theorem 4.2
shows the necessary number of IGUs. To store most vec
tors in IGUs, the number of IGUs should be large. This
makes the circuit complicated.
vectors in the IGUs using offchip SRAMs until the re
maining vector can be stored in the embedded RAMs of
the FPGA. Let the number of remaining registered vectors
be kt+1. By Conjecture 4.1, the number of inputs pt+1
for the (t + 1)th main memory must satisfy the relation
pt+1 ≤ 2?log2(kt+1+ 1)? − 1. Note that, the size of the
embedded memory for the Altera FPGA is 9 kbit. When
kt+1≤ 255, we have pt+1≤ 15, which is a practical value
for the embedded memory. In this way, we can store all the
remaining vectors in an embedded memory of an FPGA.
(Q.E.D.)
In this paper, we store
Example 4.7 Let kjbe the number of registered vectors to
be implemented by IGUj,IGUj+1,..., and IGUt+1, pjbe
the number of inputs for the jth main memory, t be the
number of index generationunits (IGUs), and r be the num
ber of vectors not realized by these t IGUs. Consider virus
scanning with k = 514,287. When 2pj= kj, we have
γ =
necessary IGUs is
t
=
?log 1
Finally, we use an additional IGU to store the remaining
255 vectors. Thus, eight IGUs are used to store all the pat
terns.
1
3. From Theorem 4.2, when r = 255, the number of
3
255
514,287? = 7.
Example 4.8 Table 10 compares the estimated values of
stored vectors with that for the experimental values. In Ta
ble 10, IGU j denotes the jth index generation unit;ˆkjde
notes the number of vectors implemented by IGUjconsists
of four characters rather than the original registered vector
k, whereˆk < k; and rjdenotes the number of remaining
vectors. We can see that the necessary number of IGUs is
eight which is derived in Example 4.7.
In the parallel sieve method, we consider the patterns
consisting of four characters. Thus, the number of unique
patterns isˆk = 497,172, rather than k = 514,287.
5Experimental Results
5.1Implementation of a Virus Scanning
Engine
We implemented a virus scanning engine using the par
allel sieve method on an Altera FPGA. In our implementa
tion, we used StratixIII EP3SL340H1152C3NE5 (contain
ing 270,400ALUTs, and 1,040M9ks). To performthe total
6
Page 7
Table 10. Comparison of the Estimated Value with the Experimental Value.
Estimated Value
jpj
IGU 11 19331,414 165,757
IGU 22 18110,493 55,263
IGU 33 16 36,83818,424
IGU 44 1512,2816,142
IGU 55 134,0942,047
IGU 66111,364
IGU 7710454
IGU 88 15 227
Experimental Value
ˆkj
19321,659
18128,398
1633,791
149,267
122,641
11 1,055
9277
984
ˆkj
rj
pj
rj
175,513
47,115
13,324
4,057
1,416
682
227
361
84
00
match,weusedtheembeddedprocessorNiosII/f. We stored
the executable code and the full length of k = 514,287
patterns into the 1 Giga bytes DDR2SDRAM. Note that,
we also stored?8
For the FPGA synthesis tool, we used QuartusII (v.8.0).
Table 11 shows the memories used in the parallel sieve
method. In Table 11, IGU j denotes the jth index gen
eration unit;ˆkj denotes the number of stored vectors in
IGUj; INOUT denotes the number of inputs (i) and out
puts (o); Memory denotes the type of the memory, where
SRAM denotes the 8MB offchip SRAM, M144k and M9k
denote the embedded memory on the Altera FPGA, respec
tively. In our implementation, the parallel sieve method
uses three offchip SRAMs, 3,790 ALUTs, 31 M9ks, and
13 M144ks4. Note that these values do not include the re
source for the NiosII/f. The FPGA operates at 370.1 MHz.
However due to a limitation on the clock frequency by the
offchip SRAM, we set the system clock to 200 MHz. Our
virus engine scans one character in every clock. Thus, the
throughput is 0.200 × 8 = 1.600 Gbps. The system stores
514,287 patterns, while the parallel sieve method stores
unique497,172patternsconsistingof fourcharacters. From
Table 11, the total amount of memory is 3,500,880 Bytes.
Let the memory utilization coefficient (MUC) be the nec
essary amount of memory per a character. Then, MUC for
the parallel sieve method is
j=1ˆkj = 497,172 patterns consisting of
four characters into SRAMs for the parallel sieve method.
3,500,880
497,172×4= 1.7 Bytes/Char.
5.2Comparison with Existing Methods
Table 12 compares existing regular expression match
ing methods. They use various methods in different tech
nologies: FPGAs and ASICs. To make a fair compari
son, we use the normalized throughput Th/MUC. In
Table 12, Th denotes the throughput for a pattern match
ing engine [Gbps]; # of patterns denotes the number of
stored patterns; MUC denotes the memory utilization co
efficient [Bytes/Char]; and Th/MUC denotes the normal
ized throughput.
Table 12 shows that only our method can store 497,172
viruspatternsonasingleFPGA andoffchipSRAMs. Also,
4Excludes the MD5 checksum hardware.
5Since the offchip SRAM has 72 outputs, three main memories are
stored in a single SRAM.
as for Th/MUC, our method is about 470.5 times better
than the AC method [23], and is 1.431.3 times better than
other methods. The reason is that the MUC for our method
is quite small, since the parallel sieve method stores pat
terns in compact memories. As for Th/MUC, Yu et al.’s
method [24] is second to our method. However, they use a
TCAMwhichisquiteexpensiveanddissipatesmuchpower.
If we consider the cost of TCAM in Table 1, our method is
much better than Yu et al.[24].
6 Conclusion and Comments
In this paper, we showed a new architecture for a virus
scanning system, which is different from that of the intru
siondetectionsystem. Theproposedmethoduses twostage
matching: In the first stage, the hardware filter quickly
scans the text to find partial matches, and in the second
stage, the MPU scans the full text to find the total match.
To make the hardware filter simple, we used a finiteinput
memory machine (FIMM). To reduce the memory size for
the FIMM, we introduced the parallel sieve method. The
proposed method is memorybased, so it is quickly recon
figurable and dissipates lower power than a TCAMbased
method. The system for 514,287 virus patterns was imple
mented on the Stratix III FPGA, three offchip SRAMs and
an SDRAM. Compared with existing methods, our method
achieved 1.4131.36 times more efficient areathroughput
ratio.
Our virus scanning engine has a vulnerability. When the
attacker sends a sequence of subpatterns stored in our en
gine (performance attack), it generates an IRQs for every
clock and overflows the MPU. Kumar et al. [12] has pro
posed a method to protect against performanceattack. It at
taches a flow counter to the FIFO in Fig. 4. When the value
of the counter exceeds the threshold, the circuit detects the
performance attack. Kumer’s method can be applied to our
virus scanning engine.
7Acknowledgments
This research is supported in part by the Grants in Aid
for Scientific Research of JSPS, and the grant of Innovative
Cluster Project of MEXT (the second stage). Discussions
with Prof. Jon T. Butler and Mr. Hisashi Kajiwara were
7
Page 8
Table 11. Memories used in the Sieve Method.
Main Memory
jINOUT
1 321,659i=19,o=19
2 128,398i=18,o=18
3 33,791 i=16,o=16
49,267 i=14,o=14
52,641i=12,o=12
61,055 i=11,o=11
7 277i=9,o=9
8 84i=9,o=7
AUX Memory
INOUT
i=19,o=13
i=18,o=14
i=16,o=16
i=14,o=18
i=12,o=20
i=11,o=21
i=9,o=23
i=9,o=23
ˆkj
Memory
SRAM5
SRAM5
SRAM5
2 M144k
6 M9k
3 M9k
1 M9k
1 M9k
Memory
SRAM
SRAM
8 M144k
3 M144k
10 M9k
6 M9k
2 M9k
2 M9k
IGU 1
IGU 2
IGU 3
IGU 4
IGU 5
IGU 6
IGU 7
IGU 8
Table 12. Comparison with Existing Methods.
Method
Th
# of
Patterns
1,533
1,542
1,533
1,533
MUC
Th/MUC
Comment
[Gbps][Bytes/char]
2,896.2 AhoCorasick Method (AC method)[23]
Aldwari et al.[2]
Bitmap compressed AhoCorasick[23]
Path compressed AhoCorasick[23]
Alicherry et al.[3]
Yu et al.[24]
USC RegExpController[5]
Hardware Bloom Filter[4]
The parallel sieve Method
6.0
14.0
8.0
8.0
20.0
2.0
1.4
0.5
1.6
0.0020
0.1111
0.0519
0.1333
0.4166
0.6666
0.0304
0.3333
0.9417
ASIC implementation
FPGA+SRAM
ASIC implementation
ASIC implementation
with TCAM
FPGA+TCAM+MPU
FPGA+MPU
FPGA+SDRAM
FPGA+MPU+SRAM+SDRAM
126.0
154.0
60.0
48.0
3.0
46.0
1.5
1.7
100
1,768
1,316
35,475
497,172
quite useful.
References
[1] A. V. Aho and M. J. Corasick, “Efficient string matching: an aid to
bibliographic search,” Communications of the ACM, 18(6):333340,
1975.
[2] M. Aldwairi, T. Conte, and P. Franzon, “Configurable string match
ing hardware for speeding up intrusion detection,” SIGRACH.
Compt. Archit. News, vol. 33, no. 1, pp.99107, 2005.
[3] M. Alicherry, M. Muthuprasanna, and V. Kumar, “High speed pat
tern matching for network IDS/IPS,” IEEE Int. Conf. on Network
Protocols (ICNP’06), pp.187196, 2006.
[4] M. Attig, S. Dharmapurikar, and J. Lockwood, “Implementation
results of bloom filters for string matching,” IEEE Symposium
on FieldProgrammable Custom Computing Machines (FCCM’04),
pp.322323, 2004.
[5] Z. K. Baker, H. Jung, and V. K. Prasanna, “Regular expression soft
ware deceleration for intrusion detection systems,” 16th Int. Conf.
on Field Programmable Logic and Applications (FPL’06), pp. 28
30, 2006.
[6] J. Bispo, I. Sourdis, J. M. P. Cardoso, and S. Vassiliadis, “Regu
lar expression matching for reconfigurable packet inspection,” 16th
Int. Conf. onField Programmable Logic andApplications (FPL’06),
pp.119126, 2006.
[7] Clam AntiVirus, http://www.clamav.net/
[8] J. Ditmar, K. Torkelsson, and A. Jantsch, “A dynamically reconfig
urable FPGAbased content addressable memory for internet proto
col,” International Conference on Field Programmable Logic and
Applications 2000, (FPL2000), pp.1928.
[9] W. Jiang, Q. Wang, and V. K. Prasanna, “Beyond TCAMs: An
SRAMbased parallel multipipeline architecture for terabit IP
lookup,” 27th IEEE Int. Conf. on Computer Communications (IN
FOCOM2008), pp.17861794, 2008.
[10] Kaspersky, http://www.kaspersky.com/
[11] Z. Kohavi, Switching and Finite Automata Theory, McGrawHill
Inc., 1979.
[12] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese, “Cur
ing regular expressions matching algorithms from insomnia, amne
sia, and acalculia,” 3rd ACM/IEEE Symposium on Architecture for
networking and communications systems (ANCS’07), pp. 155164,
2007.
[13] K. McLaughlin, N. O’Connor, and S. Sezer, “Exploring CAM de
sign for network processing using FPGA technology,” Proceed
ings of the Advanced Int’l Conference on Telecommunications and
Int’l Conference on Internet and Web Applications and Services
(AICT/ICIW 2006), p.84.
[14] K. Pagiamtzis and A. Sheikholeslami, “A Lowpower content
addressable memory (CAM) using pipelined hierarchical search
scheme, ” IEEEJournal of SolidState Circuits, Vol. 39. No. 9, Sept.
2004, pp.15121519.
[15] PCRE: Perl Compatible Regular Expressions, http://www.pcre.org/
[16] H. C. Roan, W. J. Hawang, and C. T. Dan Lo., “Shiftor circuit
for efficient network intrusion detection pattern matching,” Proc.
Int. Conf. onField Programmable Logic andApplications (FPL’06),
pp.785790, 2006.
[17] T. Sasao, “A Design method of address generators using hash mem
ories,” IWLS2006, Vail, Colorado, U.S.A, June 79, 2006, pp.102
109.
[18] T.Sasao and M. Matsuura, “Animplementation of an address gener
ator using hash memories,” DSD 2007, 10th EUROMICRO Confer
ence on Digital System Design, Architectures, Methods and Tools,
Aug. 27  31, 2007, Lubeck, Germany, pp.6976.
[19] T.Sasao, “On the number ofvariables torepresent sparse logic func
tions,” ICCAD2008, San Jose, California, USA, Nov.1013, 2008,
pp. 4551.
[20] SNORT, http://www.snort.org/
[21] L. Tan, and T. Sherwood, “A high throughput string matching archi
tecture for intrusion detection and prevention,” Proceedings of the
32nd Int. Symp. on Computer Architecture (ISCA’05), pp.112122,
2005.
[22] TrendMicro, Network
http://us.trendmicro.com/.
[23] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Determinis
tic memoryefficient string matching algorithms for intrusion detec
tion,” 23th Annual Joint Conference of the IEEE Computer and
Communications Societies (INFOCOM’04), pp.333340, 2004.
[24] F. Yu, R. H. Katz, and T. V. Lakshman, “Gigabit rate packet pat
tern matching using TCAM,” IEEE Int. Conf. on Network Proto
cols (ICNP’04), pp.174183, 2004.
VirusWallEnforcer,
8