Toward scalable internet traffic measurement and analysis with Hadoop

By: 
Yeonhee Lee, Youngseok Lee
Appears in: 
CCR January 2013

Internet traffic measurement and analysis has long been used to characterize network usage and user behaviors, but faces the problem of scalability under the explosive growth of Internet traffic and high-speed access. Scalable Internet traffic measurement and analysis is difficult because a large data set requires matching computing and storage resources. Hadoop, an open-source computing platform of MapReduce and a distributed file system, has become a popular infrastructure for massive data analytics because it facilitates scalable data processing and storage services on a distributed computing system consisting of commodity hardware. In this paper, we present a Hadoop-based traffic monitoring system that performs IP, TCP, HTTP, and NetFlow analysis of multi-terabytes of Internet traffic in a scalable manner. From experiments with a 200-node testbed, we achieved 14 Gbps throughput for 5 TB files with IP and HTTP-layer analysis MapReduce jobs. We also explain the performance issues related with traffic analysis MapReduce jobs.

Public Review By: 
Sharad Agarwal

Network trace analysis is like bread and butter for the ACM SIGCOMM community. I relied on network traces for my Ph.D. thesis, which was not too long ago of course. I and my colleagues relied on a carefully constructed SAN, with a hand-tuned filesystem, and custom libraries for efficient processing. In contrast, for our bulk processing needs today, such as for online web services, we rely on clusters of several hundred commodity machines with a commodity network and distributed filesystem. This paper presents a Hadoop-based solution for packet or Netflow trace analysis. With a 200-node Hadoop testbed, the authors achieve 14 Gbps throughput for 5 TB files. Their solution includes a binary format for accessing traces concurrently, MapReduce algorithms for Netflow, IP, TCP and HTTP analysis, and a Hive-based system for simplifying queries. The authors have built several tools on top of this, from standard 5-tuple flow statistics to TCP re-transmission statistics to DDoS analysis. The reviewers agree that this is a solid contribution, though not necessarily controversial or discussion provoking. It is worth informing the rest of the community since trace analysis is often core to what we do. The authors have worked diligently with the reviewers to improve their paper. In particular they have evaluated their system against the RIPE solution, with the main difference arising from data locality on disk.