Dynamically Maintaining Duplicate-Insensitive and Time-Decayed Sum Using Time-Decaying Bloom Filter.
ABSTRACT The duplicate-insensitive and time-decayed sum of an arbitrary subset in a stream is an important aggregation for various
analyses in many distributed stream scenarios. In general, precisely providing this sum in an unbounded and high-rate stream
is infeasible. Therefore, we target at this problem and introduce a sketch, namely, time-decaying Bloom Filter (TDBF). The
TDBF can detect duplicates in a stream and meanwhile dynamically maintain decayed-weight of all distinct elements in the stream
according to a user-specified decay function. For a query for the current decayed sum of a subset in the stream, TDBF provides
an effective estimation. In our theoretical analysis, a provably approximate guarantee has been given for the error of the
estimation. In addition, the experimental results on synthetic stream validate our theoretical analysis.
- [Show abstract] [Hide abstract]
ABSTRACT: Detecting duplicates in click data streams is an important task to fight against click fraud, which is the act of generating false clicks in internet advertising. Revenue generation advertising models, that charge advertisers for each click, leave space for individuals or rival companies to generate false clicks. The extent of click fraud's damage to online advertising has grown tremendously over the years. In this paper, we consider the problem of detecting duplicates in click data streams. Our solution uses a modified version of the counting Bloom filter. The temporal stateful Bloom filter (TSBF) extends the standard counting Bloom filter by replacing the bit-vector with an array of counters of states. These counters are dynamic and decay with time. We conducted a comprehensive set of experiments using synthetic and real world data. Results are compared with buffering techniques used in NetMosaics, a click fraud detection and prevention solution. Our results show that TSBF approach achieves 99% accuracy on duplicate detection, while keeping its space requirement a constant.International Journal of Data Analysis Techniques and Strategies 11/2012; 4(4):340-377.