ArticlePDF Available

A Large-Scale Hardware Timer Manager

Authors:
  • IBM Research, Switzerland, Zurich

Abstract and Figures

Timers are used throughout in network protocols, particularly for packet loss detection and for connec-tion management. Thus, at least one timer is used per connection. Internet servers and gateways need to serve several thousands of simultaneously open con-nections, therefore a multiplicity of timers have to be managed simultaneously. To achieve scalable timer management, we present a large-scale hardware timer manager that can be implemented as a coprocessor in any network processing unit. This coprocessor uses on-and off-chip memory to handle the timers. The on-chip memory functions like a processor cache to reduce the number of external memory accesses and therefore, to decrease operation latency. To sort the timers according to their expiration time, the data structure of the timer manager is based on the d-heap structure. We have simulated the model in SystemC to measure the performance of the timer operations: start, stop and expire. In this paper present a hard-ware concept for a large-scale timer manager and we discuss the simulation results, to show its efficiency.
Content may be subject to copyright.
A Large-Scale Hardware Timer Manager
Silvio Dragone
Andreas D¨oring
Rainer Hagenau

IBM Research GmbH
Zurich Research Laboratory
CH-8803 R¨
uschlikon, Switzerland
sid, ado
@zurich.ibm.com

Institute of Computer Engineering
University of L¨
ubeck
D-23538 L¨
ubeck, Germany
hagenau@iti.uni-luebeck.de
Abstract
Timers are used throughout in network protocols,
particularly for packet loss detection and for connec-
tion management. Thus, at least one timer is used
per connection. Internet servers and gateways need to
serve several thousands of simultaneously open con-
nections, therefore a multiplicity of timers have to be
managed simultaneously. To achieve scalable timer
management, we present a large-scale hardware timer
manager that can be implemented as a coprocessor in
any network processing unit. This coprocessor uses
on- and off-chip memory to handle the timers. The
on-chip memory functions like a processor cache to
reduce the number of external memory accesses and
therefore, to decrease operation latency. To sort the
timers according to their expiration time, the data
structure of the timer manager is based on the d-heap
structure. We have simulated the model in SystemC
to measure the performance of the timer operations:
start, stop and expire. In this paper present a hard-
ware concept for a large-scale timer manager and we
discuss the simulation results, to show its efficiency.
1. Introduction
Many protocols use “timers” to recover from all
kinds of errors or failures, such as packet loss detec-
tion. When a packet is received, an acknowledgment
is sent back to the sender. If the sender does not re-
ceive this acknowledgment within a certain time limit,
it is assumed that the packet is lost and the appropriate
action, for example packet retransmission, is taken. A
retransmission timer [6] is used to measure this time
limit: it is started upon the packet transmission and
stopped when an acknowledgment has been received
for the packet, but when it expires, the packet is as-
sumed to be lost. Most communication methods also
set up and break down connections using timers [6].
Such protocols rely on timer-based connection mana-
gement.
As in certain network nodes, such as Internet
servers and gateways, up to several thousands of si-
multaneously open connections are served, the num-
ber of simultaneously needed timers is a multiple of
the active open connections.
The trend in network engineering increasingly is
to use programmable units, like the so-called net-
work processors [2], instead of Application Specific
Integrated Circuits (ASICs) to integrate the protocols.
Therefore, the question arises whether the timerfacili-
ties should be handled in software or by hardware.
Because most network processors do not have a
hardware large-scale timer manager, the timer mana-
ger is usually implemented in software. Varghese pre-
sented an efficient data structure for a software im-
plementation in [7]. The data structure is based on
a hashed and hierarchical list (timing wheel). How-
ever, this approach has the intrinsic problem of bad
data locality because operations with the same timer
instance are temporally unrelated. With an increasing
discrepancy between processor cycle time and mem-
ory latency, the resulting high processor cache miss-
rate will result in low performance. Our experience
has shown that for one given application an entire pro-
cessor thread of a network processor had to be pro-
grammed only for managing the timers, to achieve the
required performance.
As timers do not require much flexibility for the dif-
ferent kind of network protocols, we decided to inve-
stigate the possibility of a hardware timer manager for
large scale. This has the advantage that the hardware
unit can handle the timer related memory operations
in parallel to the programmable processor’s operation
with a much smaller cost than a separate processor
core.
In this paper, we first explain the basic functionality
of a timer and its implementation in current hardware
solutions. Then we introduce the d-heap data structure
we use in our hardware approach, and how a possi-
ble hardware architecture for this data structure could
look like. Finally, we report and discuss our measured
results.
1.1. Basic Timer Definitions
Although used informally in the literature, the no-
tions for the timer facilities are terms that are not gen-
erally used in protocol specifications. Therefore, we
will use the notation as introduced in [4] throughout
this paper.
Timeout denotes an event. Such an event is pro-
duced by a timer, and can be consumed by a protocol
machine. A timer is a device that generates timeouts.
The point in the time when a timer has to produce the
timeout is specified in terms of an absolute or relative
time quantity. In conjunction with a timer, this quan-
tity is called time value.
Atimer manager is a device which contains multi-
ple timers. Every timer can be accessed over a single
interface and the timer manager autonomously serves
all the timers.
The definition of the timer in terms of operation
primitives exchanged over the entity boundary of the
protocol machine [4] is as follows:
Primitive Parameter
start time value, timer identity
stop timer identity
expire timer identity
where
time value is the time duration
of the timer specified,
timer identity is the name or label
of a related timer.
The first two primitives are commands to initiate and
terminate the operation of a single timer. In the proto-
col code, they appear as normal procedure/instruction
calls. The third primitive is the notification of the ac-
tual timeout, and communicated to the protocol entity
as an interface event. From the protocol entity’s point
of view, the timer manager can be characterized as an
abstract data-type “timer”. The first two operations
create and delete instances of timers. The third ope-
ration is the “expired” message produced by a timer
upon a timeout.
1.2. Related Work
Today’s General Purpose Processors (GPP) have
some hardware decrementers on-chip. A decrementer
is a register of width wthat decrements at the same rate
as the time base increments. The decrementer is ac-
cessed using start and stop instructions. When a
non-zero time value is written to the decrementer with
the start instruction, it starts to decrement with the
next time base clock. A timeout is signaled when the
value of 0 is reached. This is a very expensive im-
plementation silicon-area wise for a large-scale timer
manager. Therefore, it is not a solution for network
processing units.
Heddes described a large-scale hardware timer ma-
nager in his PhD thesis [3]. His timer manager stores
the time values of every timer in a data structure, on
which the primitives start,stop and expire ope-
rate. The start primitive inserts a time value into
the data structure and the stop primitive removes it
again. If a timer expires, a timeout event is generated
and the according time value is removed. The data
structure used in [3] is based on a hierarchical list. The
hierarchy is established by rounding the time value of
each new started timer to a fixed duration. In an on-
chip Content Addressable Memory (CAM), each entry
then represents the head of the list of the timers with
the same duration. Thus, every new timer merely has
to be linked to the end of such a duration list. The en-
tries of the CAM are continuously compared with the
time base, and a match marks the timeout of a specific
timer.
The disadvantage of this approach for many
System-on-Chip (SoC) designers is that it assumes that
CAMs are always available. If the CAM is replaced by
standard logic, such as registers and comparators, the
silicon area needed would be large. Furthermore, to
have constant stop operation time the duration lists
have to be double linked lists. To do so, two pointers
have to be stored per list entry which point to either the
previous or the next entry of the same list. Thus, the
pointers produces a memory overhead for each entry.
There are two disadvantages because of the round-
ing of the time value of each initiated timer. First,
if the time values for a certain application are about
the same, silicon area is wasted because of the unused
CAM entries, independent of the number of allocated
timers. Second, the lost time value precision causes
problems for certain network applications like voice
data transmission and traffic shaping. If the time val-
ues are not rounded, during a timer initialization the
entire list would have to be searched to find the appro-
priate place for a new time value.
2. Data Structure
Our timer manager is also based on a data structure.
To choose such a data structure for a timer manager,
two aspects must be combined:
the sorting of the timers according to their time
values,
and the look up of an arbitrary timer at a stop
operation.
Therefore, the data structure has to represent both as-
pects to allow an efficient algorithm. To cover the
sorting aspect our timer manager is based on a d-heap
structure. This data structure is derived from the effi-
cient heapsort algorithm [1]. A binary heap is defined
as a binary tree with a key in each node/leaf such that:
1. All leaves are on, at most, two adjacent levels.
2. All leaves on the lowest level occur to the left, and
all levels except the lowest one are completely
filled.
3. The key in the root node is smaller than or equal
to all its children, and the left and right subtrees
are again binary heaps.
For any integer d, the d-heap structure is similar
to the binary heap structure, except that each node
has at most dchildren. The tree has a depth of
 
, where
represents the num-
ber of elements of the tree. The
heap elements are
stored in array elements

 "!
; the root element
is in position
, and the elements have the following
relationships:
Parent
$#%'& (
#)
+*
Children
$#%'&,#-./#0213.

./#-4
In contrast to a linear list no pointers are needed, be-
cause of these relationships between parent and child
nodes. Therefore, memory space is saved.
The complexity of the elementary operations in a
software solution is the following:
siftup(t)
5
&76 8)
This takes an element
9
whose key has been
decreased and moves it up the tree to the
appropriate place.
siftdown(t)
5
&:6;< -)
This takes an element
9
whose key has been
increased and moves it down the tree to the
appropriate place.
In our approach we do not store all
heap elements
in a single array but in
  
arrays,
i.e. that each tree level mis stored within its own array
(see Figure 1).
. . .
d
}
}
}
min min
d
. . .
d
. . . . . .
}
}
min min
2nd Level
1st Level
}
}
d
. . .
. . .
d
. . .
. . .
d
. . .
. . .
d
. . .
}
3rd Level
}
Block
}
min min
}
absolute min
Figure 1. d-heap split up in level arrays
The level arrays are divided into blocks
=?>A@ B
of
elements. There are
C
&ED


>AF-G
H
blocks
within a level m. The root element of the d-heap is
not stored in any of the blocks, because it is a single
element. In Figure 1 the root element is indicated as
“absolute minimum”. The dchildren of the root el-
ement are placed in the
JIK
level. Because the first
level only has dentries it is equal to the block
=LGM@ N
.
Again, the children of the first level are placed within
the next higher level, the
1O
level. The following re-
lationship exists between the block where a parent el-
ement is placed, a parent block, and the blocks where
the children are placed, a child block:
Parent
$#%QP
=
>A@SR"TUV
Children
$#%QP
=W>YXZGM@
B
.

.
=W>YXZGM@
B[X
F-G
Thus, a single entry of a parent block is allocated to
a single child block. The minimum of a block
=?>A@ B
is copied into the specific entry of the parent block
=
>AF-GM@SR TUV
. To determine the minimum of a block
=W>A@ B
, the block elements in our approach are struc-
tured as a binary heap (see Figure 2).
The complexity of the elementary operations for
such an approach in software is the following:
siftup(x)
5
&\6;$ ^]  $)
siftdown(x)
5
&\6;$ ^]  $)
Thus, the complexity for sorting the timer is small and
accomplishes the requirement we posed in the entry
of this section. The second posed requirement, that
the data structure needs to allow fast arbitrary look up,
can be accomplished by making the following modifi-
cation.
In contrast to the traditional heapsort operation, the
elements of the tree are not movedbut copied between
levels. That means that e.g. the root element of the d-
heap is represented in all levels, and it is also the root
t(j)
t(j+1) t(j+2)
t(j+3) t(j+4) t(j+5) t(j+6)
t(l+1)
t(l)
t(l+2)
t(l+3) t(l+4) t(l+5) t(l+6)
. . .
t(i)
t(i+1) t(i+2)
t(i+6)t(i+5)t(i+4)t(i+3)
= 7d
. . .
d. . .
d
1st−level
2nd−level
3rd−level
. . .
d
Figure 2. Binary heap structure within the
level blocks
of every binary heap of the children blocks it is stored
in. Every element is also represented as a leaf of the d-
heap which are placed in the highest-level array: e.g.
in Figure 1 in the third-level array.
This has the advantage that if we have to look up an
arbitrary element of the heap we know that it is stored
within the highest level. Therefore, the block address
of the highest level array stays for every element until
it is removed from the d-heap.
The operations for our modified d-heap are per-
formed in the following way. We suppose that the
block
=W>A@ B
of the highest level has relements in it:
insert(t): This places the element
9
in the

K

position in the binary heap of the block
=?>A@ B
and
performs a siftup(t) within the block. If the new
element
9
becomes the root of the block heap, the copy
of the old root in the parent block
=
>AF-GM@SR TUV
has to be
found and replaced by the new root of the block
=?>A@ B
.
Then within the parent block either a siftup(t) or
asiftdown(t) has to be performed, according to
the key of the element
9
.
delete_min(t): This removes the root ele-
ment
9
of the d-heap. Thereby, all roots of all
corresponding children blocks
= >A@
with
&

 !
are removed, because
they are copies of the same element
9
. The
K

element
of the highest-level block
= >A@ B
is moved to the root
position of the block
= >A@ B
, and siftdown(s) is
performed. Then the new block root of
=?>A@ B
is copied
to the root position of the parent block
=
>AF-GM@SR"TUV
,
where again a siftdown(s) is performed. This is
done recursively until a new root element for the d-
heap is determined.
The introduced modifications for the d-heap are not
performing well in a software solution, but as we will
see in the following section the hardware will.
3. Hardware Architecture
Our timer manager is thought to be implemented
as a coprocessor in a SoC design (see Figure 3). The
protocol machine is implemented in a programmable
unit, such as a processor, that initiates and terminates
timers and consumes timeout events. In contrast to
most timer implementations, our hardware timer ma-
nager generates the timer identity by initiating a timer
and returns the timer identity to the protocol stack. For
the other timer primitives, however, the timer manager
works as described in the Introduction.
3rd Level Array
External Memory
On−Chip Interconnect
D−$
CPU
Core
Processor
2nd−level
Control
Control
3rd−level
1st−level
Control
"abs. min. T"
Time−Base
SoC
SRAM
Cache
SRAM
I−$ 3rd−level
1st−level
2nd−level
Timer Manager
Coprocessor
Figure 3. Block diagram of the hardware
architecture
As shown in Figure 3, the timer manager has a
fix number of levels. Every level array of the d-
heap is stored in a private memory (SRAM) within the
timer manager, except for the highest-level array. The
highest-level array, the

level in Figure 3, is placed
in an external memory at a statical address. The timer
manager can access this external memory via an on-
chip interconnect. Therefore, all the time values of the
active timers are stored in the external memory. To
cache part of the highest-level array in the timer ma-
nager itself, a memory array of the size dis imple-
mented. Owing to the fixed number of implemented
d-heap levels
and the fix cache size
, also the max-
imum number of simultaneous timers in the timer ma-
nager is fixed to
>
.
For the lower-level arrays (
IK
and
1O
level in the
Figure 3) we provide on-chip memories that are cus-
tomized for the timer purpose, i.e. the word width of
each memory is customized and can only be accessed
by the timer coprocessor itself.
In Figure 3, the

level cache contains a copy of a
single block from the external memory. An appropri-
ate cache controller manages the content of the cache.
If a specific block of the external memory has to be ac-
cessed by the timer manager and this block is currently
not in the cache, the cache controller first stores the old
content back into the origin block and then copies the
content of the requested block into cache.
The content of the cache does not have to be ex-
changed by a start initiated timer if there is an
empty entry available in the current loaded block.
Thus, the d-heap is not balanced. However, before
the new time value is insert into the current block, it
is added to the current base time. The timers use the
absolute time value. The time base is a 32-bit register
that increments once during each period of the source
clock, and provides the absolute time reference for the
timer coprocessor.
In contrast to the introduced d-heap structure, the
elements of the blocks of the external memory are
not sorted by a binary heap, but placed randomly. In
loading a specific block into the cache, the entries are
compared with each other to determine which entry
contains the lowest time value. This element is then
marked by the cache controller. Thus, within the timer
coprocessor the time values are structured as intro-
duced in the previous section.
The advantage of not sorting the elements of the
blocks in the external memory is that each element
is uniquely identified by its external memory address.
Therefore, the timer identity can be set to the appropri-
ate external memory address of each time value. This
allows a fast look up for any arbitrary timer.
If an operation primitive such as start,stop or
expire changes the content of the cache, the con-
troller sends the new minimum of the current block to
the next lower-level controller (see Figure 3).
In all lower-level arrays the blocks are structured as
a binary heap. The appropriate controllers have their
private memory in which all timers of the specific level
are stored. Not only the time value for a timer is stored
within the level array but also the the timer identity
because a timer is no longer identified by its address.
Every time the content of a block changes, the level
controller sends an update of the new root element of
the block to the next lower-level controller, which then
has to update the corresponding parent block.
The first-level array contains the “absolute mini-
mum time” of current timers in the timer manager. The
time value of this timer is stored in a register (see Fig-
ure 3), which is continuously compared with the time
base. If this timer expires a timeout event for the pro-
cessor that contains the protocol machine is initiated.
In addition, the next higher-level controller is asked
for a new minimum of the children blocks of the ex-
pired timer. This of course triggers a recursive chain
of actions up to the highest-level controller, the cache
controller, which has to load a new minimum of the
leaf/block in which the old “absolute minimum time”
has been stored from the external memory, and simul-
taneously the timer with the “absolute minimum time”
has to be deleted from its block.
At a stop instruction, the cache controller has to
load the concerned block from the external memory
according to the timer identity to delete it and to check
whether the deleted timer had the minimum time value
of the block. Simultaneously, all other level controllers
can check the parent blocks according to the timer
identity to verify whether the timer is stored in the ap-
propriate array. As the timer has to be loaded from
the external memory anyway, we do not waste time by
linearly searching for the timer identity in the block.
Otherwise the timer identity should be stored at a fixed
position within a block with a pointer to the appropri-
ate time value sorted as a binary heap. Then the search
for a timer within a block would take only one read
cycle and not, like in our case, at most dcycles.
3.1. Multi-Core Operation
Many network processors employ a number of pro-
cessor cores to achieve the required performance with
low chip area requirements and low power consump-
tion. If the timer coprocessor is used on such a chip,
each processor core should have access to the timer
manager functions. There are many options to provide
this, two simple ones are the use of one single, central
timer coprocessor with multiple interfaces, or the use
of independent coprocessors for each processor core.
While the first option is hard to scale with the proces-
sor performance, the second one has disadvantage that
each processor only “sees” its own timers. Therefore,
either the processors have to communicate on a soft-
ware level or the load distribution has to assign arriv-
ing tasks consistently.
In Table 1 we discuss several options. The middle
column marks coherency, i.e., all processors see the
same set of timers. Which of the options is best for a
given application domain depends on the given charac-
teristic and performance goals, e.g. throughput, accu-
racy of expiration. Furthermore, technological aspects
such as granularity of available memory arrays have to
be considered.
3.2. Optimization Approach
For a certain duration, the least significant bits are
less important the longer a timer value is. If the entire
Table 1. Options for distributed timer operation
Description C Discussion
Central coprocessor with multiple interfaces y Performance can be scaled in a limited way by
multithreading, interleaving of memories, queu-
ing and reordering of operations. Best use of
caches and averaging effect compensates unbal-
anced use of timer functions by processor cores.
Multiple coprocessors with cache coherency pro-
tocol y Software implemented functional extensions are
possible easily. System can be optimized for low
latency by controlling locality.
Distributed coprocessors accessing central on-
chip memory y Similar to the first option, the memory latency can
be critical for the performance of the heap-update
operations.
Distributed coprocessors with private memory n Lowest design effort.
32-bit timer period is split into
overlapping periods,
see the example in Figure 4, the timer value loses accu-
racy when fitted into one of these
periods. However,
by overlapping the periods, the maximum error for a
single period can be reduced.
T(3) = [2^21 ... 2^32−1]*t
T(1) = [2^6 ... 2^17 −1]*t
T(2) = [2^13 ... 2^24−1]*t
T(0) = [2^0 ... 2^11−1]*t
06101316212331
Figure 4. Example for rounding periods
The maximum error of a certain timer period will
be incurred at the lower boundary, i.e. in switching
from one period to the next higher one. The maximum
errors for the examples of Figure 4 are,
G
&
1


1
G/G
21


1
F

&

.
]
&
1
G


1
G

21
G



1
F

&

1

.
&
1
]
G

1
]
21
]
G


1
F

& 1
Any incoming 32-bit timer value is first checked to de-
termine into which period it fits. Additional bits are
added to the new value that identifies the period. For
the example in Figure 4 this means that
1 &
bits are actually stored for a timer.
The rounding of the timers has two significant ad-
vantages. First it reduces the on-chip memory space.
Therefore, less silicon area is used. Second, the per-
formance of the timer manager increases because of
the decrease in data volume to be stored to and loaded
from the external memory.
The rounding periods have to be fixed during ope-
ration, but before starting the timer manager,the posi-
tion of the periods can be configured. This allows the
user of the timer manager to optimize the roundingfor
a specific application.
It is not necessary to store the timer identity in full
length in the on-chip memories. Each block represents
a one-to-one memory address space of the highest-
level memory. A block
=
of the d-heap level
covers
the address space


>
!
. For a heap of the
max. depth of
& 
the timer identity can be
rounded for each level by

0
>
2
. Thus, the area
needed for the on-chip memory can be further reduced.
Of course, the operation performance of each level
can be increased by implementing a d-ported SRAM
or a CAM. The lowest-level array can easily be re-
placed by a CAM. Thus, its entries do not have to be
heap-sorted but can be compared with the time base
continuously. For the higher-level arrays, a d-ported
SRAM can be implemented to avoid the heap-sort ope-
ration within a block
=
or the cache. A d-wide com-
parator is then needed in each level controller to deter-
mine the minimum of a block
=
. Note that this is an
area-expensive approach and that it assumes nonstan-
dard macros.
4. Simulation Results
The timer manager model is implemented in Sy-
stemC [5]. The model has been simulated as a stan-
dalone module. Thus, a testbench stimulated the co-
processor model with stimuli and compared its output
with the expected results. The stimuli were syntheti-
cally generated with a Poisson distribution of start
and stop events. As at this stage of the timer copro-
cessor design the network application is not known,
no traces or more appropriate stimuli models are avail-
able.
The following results show the cycles used per ope-
ration. The integer dhas been varied between 32 and
50, but the number of levels is fixed to three. Thus,
the maximum number for
&
1
is 32,768 and for
&
it is 125,000.
Figure 5 shows the average number of cycles
needed per start operation. It is interesting that if
the timer manager is filled up by more than 75 80%,
the duration of the operation no longer increases loga-
rithmically but exponentially. This is due to the imple-
mentation trade-off we made in our model. To decide
which block of the third level has to be loaded if the
cache is full and a new timer has to be started, we keep
the number of free entries of each block listed in a sep-
arate list. To find a third-level block with a free entry,
the list has to be searched linearly. After a certain fill-
ing level has been reached, the chance for a cache miss
is quite high (over 20% with 80% filled). Thus, the
searching takes effect to the average cycle time of the
start operation.
10
15
20
25
30
0 20,000 40,000 60,000 80,000 100,000
Execution Cycles (Average)
Number of active Timers
Start
blocksize = 32
blocksize = 40
blocksize = 50
Figure 5. Start Operation
This effect can be deferred to higher filling levels,
but not completely avoided, by sorting the free list in
advance so that the block with free entries is almost
always known. Therefore, this requires more memory
space and combinatorial logic.
The graphs in Figure 6 show quite constant num-
bers of stop time cycles for the filling levels of the
timer manager. In contrast to the start operation,
this decreases the operation time, but here this effect
is caused by the stimuli itself. To simulate the higher
filling levels for the timer manager, the average expi-
ration time for the timer has been increased. Thus, the
chance that a timer that has to be stopped is on-chip
decreases slightly for higher expiration time. There-
fore, the operation time decreased.
80
100
120
140
020,000 40,000 60,000 80,000 100,000
Execution Cycles (Average)
Number of active Timers
Stop
blocksize = 32
blocksize = 40
blocksize = 50
Figure 6. Stop Operation
Remarkable for the stop operation is that almost
all operations access the external memory. In the si-
mulation model, a constant bus latency is added to the
dread and write cycles for the cache to account for the
external memory access. Depending on d, 89 125
cycles are added constantly in the graph in Figure 6.
Thus, 87 95% of the cycles are used for the exter-
nal memory access. This is of course not optimal and
has to be reduced. One optimization might be to make
use of the entire bus width. Today’s on-chip busses
have widths of 128 bits and more. Therefore, in our
example a reduction by a factor of two or more can be
expected for the fixed access time. Furthermore, if we
round the timer accuracy, as proposed in the Section
3, the access time for the external memory will de-
crease again. In the example of the 32-bit block size,
the access time would decrease from 89 to 33 cycles
by using the full bus width and inaccurate timers.
The expire operation is the most expensive ope-
ration time-wise (see Figure 7). Like the stop opera-
tion, the expire operation suffers from the fact that
almost every operation accesses the external memory.
It would benefit from the same optimization as the
stop operation. Moreover, this operation would ben-
efit from a completely new alternative, as has been dis-
cussed above, because the heap sort effort within the
block would be reduced significant.
120
140
160
180
0 20,000 40,000 60,000 80,000 100,000
Execution Cycles (Average)
Number of active Timers
Expire
blocksize = 32
blocksize = 40
blocksize = 50
Figure 7. Expire Operation
5. Conclusion
In this paper a large-scale hardware timer manager
based on a d-heap structure is introduced. Because
the timer manager is designed with standard memory
parts, the theoretical d-heap operation performance
cannot be achieved. This is because our method sorts
the children of a parent node as a binary-heap to deter-
mine which child key is the minimum. However, the
simulation results show that worst bottleneck in our
approach still is the external memory access. The ac-
cess time for the external memory has to be optimized
first.
We showed how the access time for the exter-
nal memory can be reduced significantly without
changing the main architecture of the coprocessor. In
addition, we also showed how the sort operation of the
d-heap can be optimized by adapting the architecture.
Our results clearly show that the architecture here
introduced is a simple but efficient architecture for a
large-scale hardware timer manager.
6. Acknowledgments
We express our thanks to Prof. Dr.-Ing. Maehle and
the members of his institute of computer engineering
for the cooperation during the developmentof the Sy-
stemC model. In particular we thank Philipp Roß, who
did the implementation work of the SystemC model
during his student project.
References
[1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Struc-
tures and Algorithms. Addison-Wesley, 1983.
[2] L. Gwennap and B. Wheeler. A Guide to Network Pro-
cessors. The Linley Group, 2001.
[3] M. Heddes. A Hardware/Software Codesign Strategy
for the Implementation of High-Speed Protocols. PhD
thesis, Technical University of Eindhoven, The Nether-
lands, 1995.
[4] E. Mumprecht, D. Gantenbein, and R. F. Hauser.
Timers in OSI Protocols: Specification versus Imple-
mentation”. In Proc. Int’l Zurich Seminar on Digital
Communications, pages 93–98, Mar. 1988.
[5] SystemC. SystemC: User’s Guide, 2.0 edition.
[6] A. S. Tanenbaum. Computer Networks. Prentice-Hall
International, Inc., 1996.
[7] G. Varghese and A. Lauck. Hashed and Hierarchi-
cal Timing Wheels: Efficient Data Structures for Imple-
menting a Timer Facility”. In IEEE/ACM Trans. Net-
working, pages 824–834, Dec. 1997.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
The authors discuss modeling and implementation of timers in OSI protocols. They examine the Estelle, Lotos, and SDL specification languages with emphasis on the features that can be used to express timing. They develop the corresponding models for timer facilities, and discuss possible mechanisms that would support them. As an example drawn from full implementations of the ISO transport and network layers, the authors describe their particular implementation approach for providing timer support
Article
The performance of timer algorithms is crucial to many network protocol implementations that use timers for failure recovery and rate control. Conventional algorithms to implement an operating system timer module take O(n) time to start or maintain a timer, where n is the number of outstanding timers: this is expensive for large n. This paper shows that by using a circular buffer or timing wheel, it takes O(1) time to start, stop, and maintain timers within the range of the wheel. Two extensions for larger values of the interval are described. In the first, the timer interval is hashed into a slot on the timing wheel. In the second, a hierarchy of timing wheels with different granularities is used to span a greater range of intervals. The performance of these two schemes and various implementation tradeoffs are discussed. We have used one of our schemes to replace the current BSD UNIX callout and timer facilities. Our new implementation can support thousands of outstanding timers without much overhead. Our timer schemes have also been implemented in other operating systems and network protocol packages
SystemC: User's Guide, 2
  • Systemc
SystemC. SystemC: User's Guide, 2.0 edition.