Ming Zhang

Packet-Level Telemetry in Large Datacenter Networks

By: 
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, Haitao Zheng
Appears in: 
CCR August 2015

Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task.

Congestion Control for Large-Scale RDMA Deployments

By: 
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, Ming Zhang
Appears in: 
CCR August 2015

Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 us per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness.

Dynamic scheduling of network updates

By: 
Xin Jin, Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, Roger Wattenhofer
Appears in: 
CCR August 2014

We present Dionysus, a system for fast, consistent network updates in software-defined networks. Dionysus encodes as a graph the consistency-related dependencies among updates at individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different switches. This dynamic scheduling is the key to its speed; prior update methods are slow because they pre-determine a schedule, which does not adapt to runtime conditions.

A network-state management service

By: 
Peng Sun, Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin
Appears in: 
CCR August 2014

We present Statesman, a network-state management service that allows multiple network management applications to operate independently, while maintaining network-wide safety and performance invariants. Network state captures various aspects of the network such as which links are alive and how switches are forwarding traffic. Statesman uses three views of the network state. In observed state, it maintains an up-to-date view of the actual network state. Applications read this state and propose state changes based on their individual goals.

Duet: cloud scale load balancing with hardware and software

By: 
Rohan Gandhi, Hongqiang Harry Liu, Y. Charlie Hu, Guohan Lu, Jitendra Padhye, Lihua Yuan, Ming Zhang
Appears in: 
CCR August 2014

Load balancing is a foundational function of datacenter infrastructures and is critical to the performance of online services hosted in datacenters. As the demand for cloud services grows, expensive and hard-to-scale dedicated hardware load balancers are being replaced with software load balancers that scale using a distributed data plane that runs on commodity servers.

Traffic engineering with forward fault correction

By: 
Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter
Appears in: 
CCR August 2014

Network faults such as link failures and high switch configuration delays can cause heavy congestion and packet loss. Because it takes time for the traffic engineering systems to detect and react to such faults, these conditions can last long—even tens of seconds. We propose forward fault correction (FFC), a proactive approach for handling faults. FFC spreads network traffic such that freedom from congestion is guaranteed under arbitrary combinations of up to k faults.

Understanding Data Center Traffic Characteristics

By: 
Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang
Appears in: 
CCR January 2010

As data centers become more and more central in Internet communications, both research and operations communities have begun to explore how to better design and manage them. In this paper, we present a preliminary empirical study of end-to-end traffic patterns in data center networks that can inform and help evaluate research and operational approaches. We analyze SNMP logs collected at 19 data centers to examine temporal and spatial variations in link loads and losses. We find that while links in the core are heavily utilized the ones closer to the edge observe a greater degree of loss.

Towards Highly reliable Enterprise Network Services via Inference of Multi-level Dependencies

By: 
Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang
Appears in: 
CCR October 2007

Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is welladapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults.

Syndicate content