Pitfalls for Testbed Evaluations of Internet Systems

By: 
David R. Choffnes and Fabian E. Bustamante
Appears in: 
CCR April 2010

Today’s open platforms for network measurement and distributed system research, which we collectively refer to as testbeds in this article, provide opportunities for controllable experimentation and evaluations of systems at the scale of hundreds or thousands of hosts. In this article, we identify several issues with extending results from such platforms to Internet wide perspectives. Specifically, we try to quantify the level of inaccuracy and incompleteness of testbed results when applied to the context of a large-scale peer-to-peer (P2P) system. Based on our results, we emphasize the importance of measurements in the appropriate environment when evaluating Internet-scale systems.

Public Review By: 
Pablo Rodrigues

This paper discusses the benefits of utilizing end-based measurement systems as opposed to test-beds based systems to accurately evaluate internet applications and capture internet metrics. In particular the paper discusses how end-based measurement system can provide a more accurate picture of the Internet topology, and more accurate properties of the Internet graph like latency and bandwidth. This is potentially a promising research area that could generate interesting comparisons and discussions.
More precisely, the paper analyzes pings, traceroutes, as well as pairwise available bandwidth measurements obtained from several hundred thousands ONO plugins installed in popular BitTorrent clients. This dataset is compared to “testbed-based data” from eg. PlanetLab, RouteViews, commercial speedtest services, etc. with few hundreds measurement points.
Doing such a point-by-point comparison the authors show that their end-system dataset appears to report significantly different values across a range of metrics, including delay, path info, and bandwidth. Therefore, the paper makes a call to caution when it comes to generalizing measurements and conclusions obtained from limited scale testbeds. It also urges the community to put more effort into developing edge measurement platforms or releasing edge-based datasets.
The limitations of testbed-based experiments have been frequently debated in the research community and this paper does an extensive data and detailed point by point analysis. Still, the paper could benefit from a more refined study, including more edge- and testbed-based measurements, and also from drawing more clarity around which applications and metrics are best suited to be explored with testbeds vs edge-based measurements. Doing the above would help understanding the tradeoff between the simplicity and controllability of testbeds with the extensiveness of more hard to gather edge-based measurements.