Matthew Caesar

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar
Appears in: 
CCR August 2015

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them.

Towards Understanding Bugs in Open Source Router Software

Zuoning Yin, Matthew Caesar, and Yuanyuan Zhou
Appears in: 
CCR July 2010

Software errors and vulnerabilities in core Internet routers have led to several high-profile attacks on the Internet infrastructure and numerous outages. Building an understanding of bugs in open-source router software is a first step towards addressing these problems. In this paper, we study router bugs found in two widely-used open-source router implementations. We evaluate the root cause of bugs, ease of diagnosis and detectability, ease of prevention and avoidance, and their effect on network behavior.

Public Review By: 
S. Saroiu

This paper presents a study of bugs found in open-source routers. It characterizes a random sample of bugs present in the bugs databases of Quagga and XORP, two routers with open-source implementations, as well as Cisco IOS/security advisories and the Linux IP stack.
The paper presents many results, of which two stand out in my opinion:
1. Despite the huge success of tools that detect copy-and-paste errors in the Linux kernel, these tools were not very successful when applied to router software.
2. 4% of the code contains more than a quarter of the bugs! Lines of code is not a good metric of “bugginess.” In the router software stacks examined in this paper, the code implementing policy-related logic (4% of the codebase) had 28% of the bugs.
I hope I piqued your interest in reading this bugs characterization study. There are many more results described in the paper.
To summarize the reviewers' feedback and criticism – the paper offers little beyond its analysis of the data. Reviewers also wondered whether the bugs found in Quagga and XORP's codebases are representative of the more popular router software stacks, such as Cisco's and Juniper’s. Finally, the reviewers were looking for more insights into why tools like CPMiner failed to find bugs in the context of routers’ codebases: is it just because finding data races is inherently hard, is it something special about routers’ software stacks, did the tools find more bugs in certain parts of the software codebase?

Dynamic route recomputation considered harmful

Matthew Caesar, Martin Casado, Teemu Koponen, Jennifer Rexford, and Scott Shenker
Appears in: 
CCR April 2010

This paper advocates a different approach to reduce routing convergence—side-stepping the problem by avoiding it in the first place! Rather than recomputing paths after temporary topology changes, we argue for a separation of timescale between offline computation of multiple diverse paths and online spreading of load over these paths. We believe decoupling failure recovery from path computation leads to networks that are inherently more efficient, more scalable, and easier to manage.

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

Changhoon Kim, Matthew Caesar, and Jennifer Rexford
Appears in: 
CCR October 2008

IP networks today require massive effort to configure and manage. Ethernet is vastly simpler to manage, but does not scale beyond small local area networks. This paper describes an alternative network architecture called SEATTLE that achieves the best of both worlds: The scalability of IP combined with the simplicity of Ethernet. SEATTLE provides plug-and-play functionality via flat addressing, while ensuring scalability and efficiency through shortest-path routing and hash-based resolution of host information.

Achieving Convergence-Free Routing using Failure-Carrying Packets

Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Tom Anderson, Scott Shenker, and Ion Stoica
Appears in: 
CCR October 2007

Current distributed routing paradigms (such as link-state, distancevector, and path-vector) involve a convergence process consisting of an iterative exploration of intermediate routes triggered by certain events such as link failures. The convergence process increases router load, introduces outages and transient loops, and slows reaction to failures. We propose a new routing paradigm where the goal is not to reduce the convergence times but rather to eliminate the convergence process completely.

Syndicate content