Longitudinal Study of BGP Monitor Session Failures

By: 
Pei-chun Cheng, Xin Zhao, Beichuan Zhang, and Lixia Zhang
Appears in: 
CCR April 2010

BGP routing data collected by RouteViews and RIPE RIS have become an essential asset to both the network research and operation communities. However, it has long been speculated that the BGP monitoring sessions between operational routers and the data collectors fail from time to time. Such session failures lead to missing update messages as well as duplicate updates during session re-establishment, making analysis results derived from such data inaccurate. Since there is no complete record of these monitoring session failures, data users either have to sanitize the data discretionarily with respect to their specific needs or, more commonly, assume that session failures are infrequent enough and simply ignore them. In this paper, we present the first systematic assessment and documentary on BGP session failures of RouteViews and RIPE data collectors over the past eight years. Our results show that monitoring session failures are rather frequent, more than 30% of BGP monitoring sessions experienced at least one failure every month. Furthermore, we observed failures that happen to multiple peer sessions on the same collector around the same time, suggesting that the collector’s local problems are a major factor in the session instability. We also developed a web site as a community resource to publish all session failures detected for RouteViews and RIPE RIS data collectors to help users select and clean up BGP data before performing their analysis.

Public Review By: 
Jitendra Padhye

Many researchers use BGP routing data collected by RouteView and RIPE servers as a starting point for their research. The data is affected by failure of BGP sessions between the operational routers and the data collectors, and hence must be sanitized before being used. This sanitization is often done in an ad-hoc manner by individual researchers to suit their needs.
To remedy this situation, the authors have systematically catalogued the session failures in the RouteView and RIPE data gathered over past eight years. The primary contribution of the paper is the database of these failures, which the authors have made available to the public. Furthermore, the authors plan to keep the failure database updated as new data comes in. This database will be a valuable resource to the researchers working in this area.
The authors also draw some basic conclusion from the failure data they gather. They point out that BGP session resets are quire frequent, although the downtime is often less than 10 minutes. Based on correlation between session failures, they conclude that often it is the collector that is at fault. Unfortunately, they are unable to shed any light on why the collectors fail. Some information in this regard may have been useful to improve to the collector’s performance.
The paper makes one wonder whether the flaws in these data sets may have influenced the conclusions of (many!) research studies based on them. The authors (or others) may want to consider it as part of their future work.