Probabilistic Lossy Counting: An efficient algorithm for finding heavy hitters

Xenofontas Dimitropoulos, Paul Hurley, and Andreas Kind
Appears in: 
CCR January 2008

Knowledge of the largest traffic flows in a network is important for many network management applications. The problem of finding these flows is known as the heavy-hitter problem and has been the subject of many studies in the past years. One of the most efficient and well-known algorithms for finding heavy hitters is lossy counting [29].

In this work we introduce probabilistic lossy counting (PLC), which enhances lossy counting in computing network traffic heavy hitters. PLC uses on a tighter error bound on the estimated sizes of traffic flows and provides probabilistic rather than deterministic guarantees on its accuracy. The probabilistic-based error bound substantially improves the memory consumption of the algorithm. In addition, PLC reduces the rate of false positives of lossy counting and achieves a low estimation error, although slightly higher than that of lossy counting.

We compare PLC with state-of-the-art algorithms for finding heavy hitters. Our experiments using real traffic traces find that PLC has 1) between 34.4% and 74% lower memory consumption, 2) between 37.9% and 40.5% fewer false positives than lossy counting, and 3) a small estimation error.

Public Review By: 
Pablo Rodriguez

Imagine that you see a large number of individual transactions (such as Amazon book sales), and you want to calculate what are the top sellers today. Or imagine that you are monitoring network traffic and you want to know which hosts/subnets are responsible for most of the traffic. This is a problem of finding heavy hitters given a stream of elements.
A straight forward method to tackle this problem is to store each element identifier with a corresponding counter monitoring the number of occurrences of that element. Then, you sort the elements accordingly to their counters and you can easily get the most frequent elements. However, in many real scenarios, this simple solution is not efficient of computationally feasible. For instance, consider the case of tracking the pairs of IP address that generate the most traffic over some time period. You need 16,384 PBytes of memory and a lot of time to sort and scan that memory array, which often makes the problem non-computationally feasible.
For these reasons, during recent years, techniques to computing heavy hitters using limited memory resources have been investigated. A characteristic of these is that they cannot find exact heavy hitters, but instead approximate the heavy hitters of a data stream. The approximation typically lies in that the computed heavy hitters may include false negatives.
Lossy counting is a well-known algorithm for finding heavy hitters using limited memory. In lossy counting, an important parameter for each distinct element identifier in the table is its error bound. Such error bound reflects the potential error on the estimated frequency of an element. Its importance resides in the fact that elements with small error bounds are more likely to be removed from the lossy counting process than equal-frequency elements having a larger error bound.
This paper proposes to make the error bound substantially smaller than the deterministic error bound in existing lossy counting algorithms. By lowering the error bound, elements will stay in the lossy counting table for less time and the memory requirements will be lower. The authors call such system: Probabilistic Lossy Counting. The reason it works is because there is often a large number of small flows with large error bounds, which require a lot of table entries. Decreasing the error bound of such small flows can drastically reduce the table size.
The authors show how for the case where data streams exhibit Zipfian distribution, one can find a tighter probabilistic error bound on the estimated frequency of an element. For such scenarios, the probabilistic lossy counting algorithm provides good results. However, if the distribution cannot be approximated by a Zipf model, then, the error bound does not hold. The paper focuses on analyzing the system complexity in terms of the required memory size. However, more work is required to derive the required processing time, i.e., the number of memory accesses per each packet, and evaluate this. Overall, the evaluation experiments of the paper show that the probabilistic lossy counting algorithm exhibits substantially better memory consumption than lossy counting and multistage filters.