CCR Papers from August 2015

  • Paul Tune, Matthew Roughan

    Traffic matrices describe the volume of traffic between a set of sources and destinations within a network. These matrices are used in a variety of tasks in network planning and traffic engineering, such as the design of network topologies. Traffic matrices naturally possess complex spatiotemporal characteristics, but their proprietary nature means that little data about them is available publicly, and this situation is unlikely to change. Our goal is to develop techniques to synthesize traffic matrices for researchers who wish to test new network applications or protocols. The paucity of available data, and the desire to build a general framework for synthesis that could work in various settings requires a new look at this problem. We show how the principle of maximum entropy can be used to generate a wide variety of traffic matrices constrained by the needs of a particular task, and the available information, but otherwise avoiding hidden assumptions about the data. We demonstrate how the framework encompasses existing models and measurements, and we apply it in a simple case study to illustrate the value.

  • Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, Alex C. Snoeren

    Large cloud service providers have invested in increasingly larger datacenters to house the computing infrastructure required to support their services. Accordingly, researchers and industry practitioners alike have focused a great deal of effort designing network fabrics to efficiently interconnect and manage the traffic within these datacenters in performant yet efficient fashions. Unfortunately, datacenter operators are generally reticent to share the actual requirements of their applications, making it challenging to evaluate the practicality of any particular design. Moreover, the limited large-scale workload information available in the literature has, for better or worse, heretofore largely been provided by a single datacenter operator whose use cases may not be widespread. In this work, we report upon the network traffic observed in some of Facebook's datacenters. While Facebook operates a number of traditional datacenter services like Hadoop, its core Web service and supporting cache infrastructure exhibit a number of behaviors that contrast with those reported in the literature. We report on the contrasting locality, stability, and predictability of network traffic in Facebook's datacenters, and comment on their implications for network architecture, traffic engineering, and switch design.

  • Liang Zheng, Carlee Joe-Wong, Chee Wei Tan, Mung Chiang, Xinyu Wang

    Amazon's Elastic Compute Cloud (EC2) uses auction based spot pricing to sell spare capacity, allowing users to bid for cloud resources at a highly reduced rate. Amazon sets the spot price dynamically and accepts user bids above this price. Jobs with lower bids (including those already running) are interrupted and must wait for a lower spot price before resuming. Spot pricing thus raises two basic questions: how might the provider set the price, and what prices should users bid? Computing users' bidding strategies is particularly challenging: higher bid prices reduce the probability of, and thus extra time to recover from, interruptions, but may increase users' cost. We address these questions in three steps: (1) modeling the cloud provider's setting of the spot price and matching the model to historically offered prices, (2) deriving optimal bidding strategies for different job requirements and interruption overheads, and (3) adapting these strategies to MapReduce jobs with master and slave nodes having different interruption overheads. We run our strategies on EC2 for a variety of job sizes and instance types, showing that spot pricing reduces user cost by 90% with a modest increase in completion time compared to on-demand pricing.

  • Hirochika Asai, Yasuhiro Ohara

    Internet of Things leads to routing table explosion. An inexpensive approach for IP routing table lookup is required against ever growing size of the Internet. We contribute by a fast and scalable software routing lookup algorithm based on a multiway trie, called Poptrie. Named after our approach to traversing the tree, it leverages the population count instruction on bit-vector indices for the descendant nodes to compress the data structure within the CPU cache. Poptrie outperforms the state-of-the-art technologies, Tree BitMap, DXR and SAIL, in all of the evaluations using random and real destination queries on 35 routing tables, including the real global tier-1 ISP's full-route routing table. Poptrie peaks between 174 and over 240 Million lookups per second (Mlps) with a single core and tables with 500- 800k routes, consistently 4-578% faster than all competing algorithms in all the tests we ran. We provide the comprehensive performance evaluation, remarkably with the CPU cycle analysis. This paper shows the suitability of Poptrie in the future Internet including IPv6, where a larger route table is expected with longer prefixes.

  • Matthew K. Mukerjee, David Naylor, Junchen Jiang, Dongsu Han, Srinivasan Seshan, Hui Zhang
  • Brandon Schlinker, Radhika Niranjan Mysore, Sean Smith, Jeffrey C. Mogul, Amin Vahdat, Minlan Yu, Ethan Katz-Bassett, Michael Rubin

    The design space for large, multipath datacenter networks is large and complex, and no one design fits all purposes. Network architects must trade off many criteria to design costeffective, reliable, and maintainable networks, and typically cannot explore much of the design space. We present Condor, our approach to enabling a rapid, efficient design cycle. Condor allows architects to express their requirements as constraints via a Topology Description Language (TDL), rather than having to directly specify network structures. Condor then uses constraint-based synthesis to rapidly generate candidate topologies, which can be analyzed against multiple criteria. We show that TDL supports concise descriptions of topologies such as fat-trees, BCube, and DCell; that we can generate known and novel variants of fat-trees with simple changes to a TDL file; and that we can synthesize large topologies in tens of seconds. We also show that Condor supports the daunting task of designing multi-phase network expansions that can be carried out on live networks.

  • Pan Hu, Pengyu Zhang, Deepak Ganesan

    Backscatter provides dual-benefits of energy harvesting and low-power communication, making it attractive to a broad class of wireless sensors. But the design of a protocol that enables extremely power-efficient radios for harvesting-based sensors as well as high-rate data transfer for data-rich sensors presents a conundrum. In this paper, we present a new fully asymmetric backscatter communication protocol where nodes blindly transmit data as and when they sense. This model enables fully flexible node designs, from extraordinarily powerefficient backscatter radios that consume barely a few micro-watts to high-throughput radios that can stream at hundreds of Kbps while consuming a paltry tens of micro-watts. The challenge, however, lies in decoding concurrent streams at the reader, which we achieve using a novel combination of time-domain separation of interleaved signal edges, and phase-domain separation of colliding transmissions. We provide an implementation of our protocol, LF-Backscatter, and show that it can achieve an order of magnitude or more improvement in throughput, latency and power over state-of-art alternatives.

  • Alok Kumar, Sushant Jain, Uday Naik, Anand Raghuraman, Nikhil Kasinadhuni, Enrique Cauich Zermeno, C. Stephen Gunn, Jing Ai, Bj?rn Carlin, Mihai Amarandei-Stavila, Mathieu Robin, Aspi Siganporia, Stephen Stuart, Amin Vahdat

    WAN bandwidth remains a constrained resource that is economically infeasible to substantially overprovision. Hence, it is important to allocate capacity according to service priority and based on the incremental value of additional allocation. For example, it may be the highest priority for one service to receive 10Gb/s of bandwidth but upon reaching such an allocation, incremental priority may drop sharply favoring allocation to other services. Motivated by the observation that individual flows with fixed priority may not be the ideal basis for bandwidth allocation, we present the design and implementation of Bandwidth Enforcer (BwE), a global, hierarchical bandwidth allocation infrastructure. BwE supports: i) service-level bandwidth allocation following prioritized bandwidth functions where a service can represent an arbitrary collection of flows, ii) independent allocation and delegation policies according to user-defined hierarchy, all accounting for a global view of bandwidth and failure conditions, iii) multi-path forwarding common in trafficengineered networks, and iv) a central administrative point to override (perhaps faulty) policy during exceptional conditions. BwE has delivered more service-efficient bandwidth utilization and simpler management in production for multiple years.

  • Keon Jang, Justine Sherry, Hitesh Ballani, Toby Moncaster

    Many cloud applications can benefit from guaranteed latency for their network messages, however providing such predictability is hard, especially in multi-tenant datacenters. We identify three key requirements for such predictability: guaranteed network bandwidth, guaranteed packet delay and guaranteed burst allowance. We present Silo, a system that offers these guarantees in multi-tenant datacenters. Silo leverages the tight coupling between bandwidth and delay: controlling tenant bandwidth leads to deterministic bounds on network queuing delay. Silo builds upon network calculus to place tenant VMs with competing requirements such that they can coexist. A novel hypervisor-based policing mechanism achieves packet pacing at sub-microsecond granularity, ensuring tenants do not exceed their allowances. We have implemented a Silo prototype comprising a VM placement manager and a Windows filter driver. Silo does not require any changes to applications, guest OSes or network switches. We show that Silo can ensure predictable message latency for cloud applications while imposing low overhead.

  • Mosharaf Chowdhury, Ion Stoica

    Inter-coflow scheduling improves application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information and ignore cluster dynamics like pipelining, task failures, and speculative executions, which limit their applicability. Schedulers without prior knowledge compromise on performance to avoid head-of-line blocking. In this paper, we present Aalo that strikes a balance and efficiently schedules coflows without prior knowledge. Aalo employs Discretized Coflow-Aware Least-Attained Service (D-CLAS) to separate coflows into a small number of priority queues based on how much they have already sent across the cluster. By performing prioritization across queues and by scheduling coflows in the FIFO order within each queue, Aalo's non-clairvoyant scheduler reduces coflow completion times while guaranteeing starvation freedom. EC2 deployments and trace-driven simulations show that communication stages complete 1.93x faster on average and 3.59x faster at the 95th percentile using Aalo in comparison to per-flow mechanisms. Aalo's performance is comparable to that of solutions using prior knowledge, and Aalo outperforms them in presence of cluster dynamics.

  • Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, Minlan Yu

    As clusters continue to grow in size and complexity, providing scalable and predictable performance is an increasingly important challenge. A crucial roadblock to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers. However, speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. In this work, we present Hopper, a job scheduler that is speculationaware, i.e., that integrates the tradeoffs associated with speculation into job scheduling decisions. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation.

  • David Naylor, Kyle Schomp, Matteo Varvello, Ilias Leontiadis, Jeremy Blackburn, Diego R. L?pez, Konstantina Papagiannaki, Pablo Rodriguez Rodriguez, Peter Steenkiste

    A significant fraction of Internet traffic is now encrypted and HTTPS will likely be the default in HTTP/2. However, Transport Layer Security (TLS), the standard protocol for encryption in the Internet, assumes that all functionality resides at the endpoints, making it impossible to use in-network services that optimize network resource usage, improve user experience, and protect clients and servers from security threats. Re-introducing in-network functionality into TLS sessions today is done through hacks, often weakening overall security. In this paper we introduce multi-context TLS (mcTLS), which extends TLS to support middleboxes. mcTLS breaks the current "all-or-nothing" security model by allowing endpoints and content providers to explicitly introduce middleboxes in secure end-to-end sessions while controlling which parts of the data they can read or write. We evaluate a prototype mcTLS implementation in both controlled and "live" experiments, showing that its benefits come at the cost of minimal overhead. More importantly, we show that mcTLS can be incrementally deployed and requires only small changes to client, server, and middlebox software.

  • Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, Haitao Zheng

    Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task. We present Everflow, a packet-level network telemetry system for large DCNs. Everflow traces specific packets by implementing a powerful packet filter on top of "match and mirror" functionality of commodity switches. It shuffles captured packets to multiple analysis servers using load balancers built on switch ASICs, and it sends "guided probes" to test or confirm potential faults. We present experiments that demonstrate Everflow's scalability, and share experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.

  • Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, Ming Zhang

    Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 us per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness. To alleviates these problems, we introduce DCQCN, an end-to-end congestion control scheme for RoCEv2. To optimize DCQCN performance, we build a fluid model, and provide guidelines for tuning switch buffer thresholds, and other protocol parameters. Using a 3-tier Clos network testbed, we show that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. DCQCN is implemented in Mellanox NICs, and is being deployed in Microsoft's datacenters.

  • Sam Burnett, Nick Feamster

    Despite the pervasiveness of Internet censorship, we have scant data on its extent, mechanisms, and evolution. Measuring censorship is challenging: it requires continual measurement of reachability to many target sites from diverse vantage points. Amassing suitable vantage points for longitudinal measurement is difficult; existing systems have achieved only small, short-lived deployments. We observe, however, that most Internet users access content via Web browsers, and the very nature of Web site design allows browsers to make requests to domains with different origins than the main Web page. We present Encore, a system that harnesses crossorigin requests to measure Web filtering from a diverse set of vantage points without requiring users to install custom software, enabling longitudinal measurements from many vantage points. We explain how Encore induces Web clients to perform cross-origin requests that measure Web filtering, design a distributed platform for scheduling and collecting these measurements, show the feasibility of a global-scale deployment with a pilot study and an analysis of potentially censored Web content, identify several cases of filtering in six months of measurements, and discuss ethical concerns that would arise with widespread deployment.

  • Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, Bruno Sinopoli

    User-perceived quality-of-experience (QoE) is critical in Internet video applications as it impacts revenues for content providers and delivery systems. Given that there is little support in the network for optimizing such measures, bottlenecks could occur anywhere in the delivery system. Consequently, a robust bitrate adaptation algorithm in client-side players is critical to ensure good user experience. Previous studies have shown key limitations of state-of-art commercial solutions and proposed a range of heuristic fixes. Despite the emergence of several proposals, there is still a distinct lack of consensus on: (1) How best to design this client-side bitrate adaptation logic (e.g., use rate estimates vs. buffer occupancy); (2) How well specific classes of approaches will perform under diverse operating regimes (e.g., high throughput variability); or (3) How do they actually balance different QoE objectives (e.g., startup delay vs. rebuffering). To this end, this paper makes three key technical contributions. First, to bring some rigor to this space, we develop a principled control-theoretic model to reason about a broad spectrum of strategies. Second, we propose a novel model predictive control algorithm that can optimally combine throughput and buffer occupancy information to outperform traditional approaches. Third, we present a practical implementation in a reference video player to validate our approach using realistic trace-driven emulations.

  • Manikanta Kotaru, Kiran Joshi, Dinesh Bharadia, Sachin Katti

    This paper presents the design and implementation of SpotFi, an accurate indoor localization system that can be deployed on commodity WiFi infrastructure. SpotFi only uses information that is already exposed by WiFi chips and does not require any hardware or firmware changes, yet achieves the same accuracy as state-of-the-art localization systems. SpotFi makes two key technical contributions. First, SpotFi incorporates super-resolution algorithms that can accurately compute the angle of arrival (AoA) of multipath components even when the access point (AP) has only three antennas. Second, it incorporates novel filtering and estimation techniques to identify AoA of direct path between the localization target and AP by assigning values for each path depending on how likely the particular path is the direct path. Our experiments in a multipath rich indoor environment show that SpotFi achieves a median accuracy of 40 cm and is robust to indoor hindrances such as obstacles and multipath.

  • Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar

    To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn's capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.

  • Sanjit Biswas, John Bicket, Edmund Wong, Raluca Musaloiu-E, Apurv Bhartia, Dan Aguayo

    Meraki is a cloud-based network management system which provides centralized configuration, monitoring, and network troubleshooting tools across hundreds of thousands of sites worldwide. As part of its architecture, the Meraki system has built a database of time-series measurements of wireless link, client, and application behavior for monitoring and debugging purposes. This paper studies an anonymized subset of measurements, containing data from approximately ten thousand radio access points, tens of thousands of links, and 5.6 million clients from one-week periods in January 2014 and January 2015 to provide a deeper understanding of realworld network behavior. This paper observes the following phenomena: wireless network usage continues to grow quickly, driven most by growth in the number of devices connecting to each network. Intermediate link delivery rates are common indoors across a wide range of deployment environments. Typical access points share spectrum with dozens of nearby networks, but the presence of a network on a channel does not predict channel utilization. Most access points see 2.4 GHz channel utilization of 20% or more, with the top decile seeing greater than 50%, and the majority of the channel use contains decodable 802.11 headers.

  • Dinesh Bharadia, Kiran Raj Joshi, Manikanta Kotaru, Sachin Katti

    We present BackFi, a novel communication system that enables high throughput, long range communication between very low power backscatter IoT sensors and WiFi APs using ambient WiFi transmissions as the excitation signal. Specifically, we show that it is possible to design IoT sensors and WiFi APs such that the WiFi AP in the process of transmitting data to normal WiFi clients can decode backscatter signals which the IoT sensors generate by modulating information on to the ambient WiFi transmission. We show via prototypes and experiments that it is possible to achieve communication rates of up to 5 Mbps at a range of 1 m and 1 Mbps at a range of 5 meters. Such performance is an order to three orders of magnitude better than the best known prior WiFi backscatter system [27, 25]. BackFi design is energy efficient, as it relies on backscattering alone and needs insignificant power, hence the energy consumed per bit is small.

  • Stevens Le Blond, David Choffnes, William Caldwell, Peter Druschel, Nicholas Merritt

    Effectively anonymizing Voice-over-IP (VoIP) calls requires a scalable anonymity network that is resilient to traffic analysis and has sufficiently low delay for high-quality voice calls. The popular Tor anonymity network, for instance, is not designed for the former and cannot typically achieve the latter. In this paper, we present the design, implementation, and experimental evaluation of Herd, an anonymity network where a set of dedicated, fully interconnected cloud-based proxies yield suitably low-delay circuits, while untrusted superpeers add scalability. Herd provides caller/callee anonymity among the clients within a trust zone (e.g., jurisdiction) and under a strong adversarial model. Simulations based on a trace of 370 million mobile phone calls among 10.8 million users indicate that Herd achieves anonymity among millions of clients with low bandwidth requirements, and that superpeers decrease the bandwidth and CPU requirements of the trusted infrastructure by an order of magnitude. Finally, experiments using a prototype deployment on Amazon EC2 show that Herd has a delay low enough for high-quality calls in most cases.

  • Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash

    Rack-scale computers, comprising a large number of microservers connected by a direct-connect topology, are expected to replace servers as the building block in data centers. We focus on the problem of routing and congestion control across the rack's network, and find that high path diversity in rack topologies, in combination with workload diversity across it, means that traditional solutions are inadequate. We introduce R2C2, a network stack for rack-scale computers that provides flexible and efficient routing and congestion control. R2C2 leverages the fact that the scale of rack topologies allows for low-overhead broadcasting to ensure that all nodes in the rack are aware of all network flows. We thus achieve rate-based congestion control without any probing; each node independently determines the sending rate for its flows while respecting the provider's allocation policies. For routing, nodes dynamically choose the routing protocol for each flow in order to maximize overall utility. Through a prototype deployed across a rack emulation platform and a packet-level simulator, we show that R2C2 achieves very low queuing and high throughput for diverse and bursty workloads, and that routing flexibility can provide significant throughput gains.

  • Hitesh Ballani, Paolo Costa, Christos Gkantsidis, Matthew P. Grosvenor, Thomas Karagiannis, Lazaros Koromilas, Greg O'Shea

    Many network functions executed in modern datacenters, e.g., load balancing, application-level QoS, and congestion control, exhibit three common properties at the data plane: they need to access and modify state, to perform computations, and to access application semantics -- this is critical since many network functions are best expressed in terms of application-level messages. In this paper, we argue that the end hosts are a natural enforcement point for these functions and we present Eden, an architecture for implementing network functions at end hosts with minimal network support. Eden comprises three components, a centralized controller, an enclave at each end host, and Eden-compliant applications called stages. To implement network functions, the controller configures stages to classify their data into messages and the enclaves to apply action functions based on a packet's class. Our Eden prototype includes enclaves implemented both in the OS kernel and on programmable NICs. Through case studies, we show how application-level classification and the ability to run actual programs on the data-path allows Eden to efficiently support a broad range of network functions at the network's edge.

  • Maria Konte, Roberto Perdisci, Nick Feamster

    Bulletproof hosting Autonomous Systems (ASes)--malicious ASes fully dedicated to supporting cybercrime--provide freedom and resources for a cyber-criminal to operate. Their services include hosting a wide range of illegal content, botnet C&C servers, and other malicious resources. Thousands of new ASes are registered every year, many of which are often used exclusively to facilitate cybercrime. A natural approach to squelching bulletproof hosting ASes is to develop a reputation system that can identify them for takedown by law enforcement and as input to other attack detection systems (e.g., spam filters, botnet detection systems). Unfortunately, current AS reputation systems rely primarily on data-plane monitoring of malicious activity from IP addresses (and thus can only detect malicious ASes after attacks are underway), and are not able to distinguish between malicious and legitimate but abused ASes.

    As a complement to these systems, in this paper, we explore a fundamentally different approach to establishing AS reputation. We present ASwatch, a system that identifies malicious ASes using exclusively the control-plane (i.e., routing) behavior of ASes. ASwatch's design is based on the intuition that, in an attempt to evade possible detection and remediation efforts, malicious ASes exhibit "agile" control plane behavior (e.g., short-lived routes, aggressive re-wiring). We evaluate our system on known malicious ASes; our results show that ASwatch detects up to 93% of malicious ASes with a 5% false positive rate, which is reasonable to effectively complement existing defense systems.

  • Renaud Hartert, Stefano Vissicchio, Pierre Schaus, Olivier Bonaventure, Clarence Filsfils, Thomas Telkamp, Pierre Francois

    SDN simpli??~Aes network management by relying on declarativity (high-level interface) and expressiveness (network ??~Bexibility). We propose a solution to support those features while preserving high robustness and scalability as needed in carrier-grade networks. Our solution is based on (i) a two-layer architecture separating connectivity and optimization tasks; and (ii) a centralized optimizer called DEFO, which translates high-level goals expressed almost in natural language into compliant network con??~Agurations. Our evaluation on real and synthetic topologies shows that DEFO improves the state of the art by (i) achieving better trade-o??~@s for classic goals covered by previous works, (ii) supporting a larger set of goals (re??~Aned tra??~Cc engineering and service chaining), and (iii) optimizing large ISP networks in few seconds. We also quantify the gains of our implementation, running Segment Routing on top of IS-IS, over possible alternatives (RSVP-TE and OpenFlow).

  • Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, Varugis Kurien

    Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for largescale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators.

  • Stefano Vissicchio, Olivier Tilmans, Laurent Vanbever, Jennifer Rexford

    Centralizing routing decisions offers tremendous flexibility, but sacrifices the robustness of distributed protocols. In this paper, we present Fibbing, an architecture that achieves both flexibility and robustness through central control over distributed routing. Fibbing introduces fake nodes and links into an underlying linkstate routing protocol, so that routers compute their own forwarding tables based on the augmented topology. Fibbing is expressive, and readily supports flexible load balancing, traffic engineering, and backup routes. Based on high-level forwarding requirements, the Fibbing controller computes a compact augmented topology and injects the fake components through standard routing-protocol messages. Fibbing works with any unmodified routers speaking OSPF. Our experiments also show that it can scale to large networks with many forwarding requirements, introduces minimal overhead, and quickly reacts to network and controller failures.

  • Yasir Zaki, Thomas P?tsch, Jay Chen, Lakshminarayanan Subramanian, Carmelita G?rg

    Legacy congestion controls including TCP and its variants are known to perform poorly over cellular networks due to highly variable capacities over short time scales, self-inflicted packet delays, and packet losses unrelated to congestion. To cope with these challenges, we present Verus, an end-to-end congestion control protocol that uses delay measurements to react quickly to the capacity changes in cellular networks without explicitly attempting to predict the cellular channel dynamics. The key idea of Verus is to continuously learn a delay profile that captures the relationship between end-to-end packet delay and outstanding window size over short epochs and uses this relationship to increment or decrement the window size based on the observed short-term packet delay variations. While the delay-based control is primarily for congestion avoidance, Verus uses standard TCP features including multiplicative decrease upon packet loss and slow start. Through a combination of simulations, empirical evaluations using cellular network traces, and real-world evaluations against standard TCP flavors and state of the art protocols like Sprout, we show that Verus outperforms these protocols in cellular channels. In comparison to TCP Cubic, Verus achieves an order of magnitude (> 10x) reduction in delay over 3G and LTE networks while achieving comparable throughput (sometimes marginally higher). In comparison to Sprout, Verus achieves up to 30% higher throughput in rapidly changing cellular networks.

  • Ramakrishnan Durairajan, Paul Barford, Joel Sommers, Walter Willinger

    The complexity and enormous costs of installing new longhaul fiber-optic infrastructure has led to a significant amount of infrastructure sharing in previously installed conduits. In this paper, we study the characteristics and implications of infrastructure sharing by analyzing the long-haul fiber-optic network in the US. We start by using fiber maps provided by tier-1 ISPs and major cable providers to construct a map of the long-haul US fiber-optic infrastructure. We also rely on previously underutilized data sources in the form of public records from federal, state, and municipal agencies to improve the fidelity of our map. We quantify the resulting map's1 connectivity characteristics and confirm a clear correspondence between long-haul fiber-optic, roadway, and railway infrastructures. Next, we examine the prevalence of high-risk links by mapping end-to-end paths resulting from large-scale traceroute campaigns onto our fiber-optic infrastructure map. We show how both risk and latency (i.e., propagation delay) can be reduced by deploying new links along previously unused transportation corridors and rights-of-way. In particular, focusing on a subset of high-risk links is sufficient to improve the overall robustness of the network to failures. Finally, we discuss the implications of our findings on issues related to performance, net neutrality, and policy decision-making.

  • Fangfei Chen, Ramesh K. Sitaraman, Marcelo Torres

    Content Delivery Networks (CDNs) deliver much of the world's web, video, and application content on the Internet today. A key component of a CDN is the mapping system that uses the DNS protocol to route each client's request to a "proximal" server that serves the requested content. While traditional mapping systems identify a client using the IP of its name server, we describe our experience in building and rollingout a novel system called end-user mapping that identifies the client directly by using a prefix of the client's IP address. Using measurements from Akamai's production network during the roll-out, we show that end-user mapping provides significant performance benefits for clients who use public resolvers, including an eight-fold decrease in mapping distance, a two-fold decrease in RTT and content download time, and a 30% improvement in the time-to-first-byte. We also quantify the scaling challenges in implementing enduser mapping such as the 8-fold increase in DNS queries. Finally, we show that a CDN with a larger number of deployment locations is likely to benefit more from end-user mapping than a CDN with a smaller number of deployments.

  • Justine Sherry, Peter Xiang Gao, Soumya Basu, Aurojit Panda, Arvind Krishnamurthy, Christian Maciocco, Maziar Manesh, Jo?o Martins, Sylvia Ratnasamy, Luigi Rizzo, Scott Shenker

    Network middleboxes must offer high availability, with automatic failover when a device fails. Achieving high availability is challenging because failover must correctly restore lost state (e.g., activity logs, port mappings) but must do so quickly (e.g., in less than typical transport timeout values to minimize disruption to applications) and with little overhead to failure-free operation (e.g., additional per-packet latencies of 10-100s of us). No existing middlebox design provides failover that is correct, fast to recover, and imposes little increased latency on failure-free operations. We present a new design for fault-tolerance in middleboxes that achieves these three goals. Our system, FTMB (for Fault-Tolerant MiddleBox), adopts the classical approach of "rollback recovery" in which a system uses information logged during normal operation to correctly reconstruct state after a failure. However, traditional rollback recovery cannot maintain high throughput given the frequent output rate of middleboxes. Hence, we design a novel solution to record middlebox state which relies on two mechanisms: (1) 'ordered logging', which provides lightweight logging of the information needed after recovery, and (2) a 'parallel release' algorithm which, when coupled with ordered logging, ensures that recovery is always correct. We implement ordered logging and parallel release in Click and show that for our test applications our design adds only 30us of latency to median per packet latencies. Our system introduces moderate throughput overheads (5-30%) and can reconstruct lost state in 40-275ms for practical systems.

  • Justine Sherry, Chang Lan, Raluca Ada Popa, Sylvia Ratnasamy

    Many network middleboxes perform deep packet inspection (DPI), a set of useful tasks which examine packet payloads. These tasks include intrusion detection (IDS), exfiltration detection, and parental filtering. However, a long-standing issue is that once packets are sent over HTTPS, middleboxes can no longer accomplish their tasks because the payloads are encrypted. Hence, one is faced with the choice of only one of two desirable properties: the functionality of middleboxes and the privacy of encryption. We propose BlindBox, the first system that simultaneously provides both of these properties. The approach of BlindBox is to perform the deep-packet inspection directly on the encrypted traffic. BlindBox realizes this approach through a new protocol and new encryption schemes. We demonstrate that BlindBox enables applications such as IDS, exfiltration detection and parental filtering, and supports real rulesets from both open-source and industrial DPI systems. We implemented BlindBox and showed that it is practical for settings with long-lived HTTPS connections. Moreover, its core encryption scheme is 3-6 orders of magnitude faster than existing relevant cryptographic schemes.

  • Dong Zhou, Bin Fan, Hyeontaek Lim, David G. Andersen, Michael Kaminsky, Michael Mitzenmacher, Ren Wang, Ajaypal Singh

    This paper presents ScaleBricks, a new design for building scalable, clustered network appliances that must "pin" flow state to a specific handling node without being able to choose which node that should be. ScaleBricks applies a new, compact lookup structure to route packets directly to the appropriate handling node, without incurring the cost of multiple hops across the internal interconnect. Its lookup structure is many times smaller than the alternative approach of fully replicating a forwarding table onto all nodes. As a result, ScaleBricks is able to improve throughput and latency while simultaneously increasing the total number of flows that can be handled by such a cluster. This architecture is effective in practice: Used to optimize packet forwarding in an existing commercial LTE-to-Internet gateway, it increases the throughput of a four-node cluster by 23%, reduces latency by up to 10%, saves memory, and stores up to 5.7x more entries in the forwarding table.

  • Omid Abari, Deepak Vasisht, Dina Katabi, Anantha Chandrakasan

    Electronic toll collection transponders, e.g., E-ZPass, are a widely-used wireless technology. About 70% to 89% of the cars in US have these devices, and some states plan to make them mandatory. As wireless devices however, they lack a basic function: a MAC protocol that prevents collisions. Hence, today, they can be queried only with directional antennas in isolated spots. However, if one could interact with e-toll transponders anywhere in the city despite collisions, it would enable many smart applications. For example, the city can query the transponders to estimate the vehicle flow at every intersection. It can also localize the cars using their wireless signals, and detect those that run a redlight. The same infrastructure can also deliver smart streetparking, where a user parks anywhere on the street, the city localizes his car, and automatically charges his account. This paper presents Caraoke, a networked system for delivering smart services using e-toll transponders. Our design operates with existing unmodified transponders, allowing for applications that communicate with, localize, and count transponders, despite wireless collisions. To do so, Caraoke exploits the structure of the transponders' signal and its properties in the frequency domain. We built Caraoke reader into a small PCB that harvests solar energy and can be easily deployed on street lamps. We also evaluated Caraoke on four streets on our campus and demonstrated its capabilities.

  • Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica

    Low latency analytics on geographically distributed datasets (across datacenters, edge clusters) is an upcoming and increasingly important challenge. The dominant approach of aggregating all the data to a single datacenter significantly inflates the timeliness of analytics. At the same time, running queries over geo-distributed inputs using the current intra-DC analytics frameworks also leads to high query response times because these frameworks cannot cope with the relatively low and variable capacity of WAN links. We present Iridium, a system for low latency geo-distributed analytics. Iridium achieves low query response times by optimizing placement of both data and tasks of the queries. The joint data and task placement optimization, however, is intractable. Therefore, Iridium uses an online heuristic to redistribute datasets among the sites prior to queries' arrivals, and places the tasks to reduce network bottlenecks during the query's execution. Finally, it also contains a knob to budget WAN usage. Evaluation across eight worldwide EC2 regions using production queries show that Iridium speeds up queries by 3x - 19x and lowers WAN usage by 15% - 64% compared to existing baselines.

  • Chaithan Prakash, Jeongkeun Lee, Yoshio Turner, Joon-Myung Kang, Aditya Akella, Sujata Banerjee, Charles Clark, Yadi Ma, Puneet Sharma, Ying Zhang

    Software Defined Networking (SDN) and cloud automation enable a large number of diverse parties (network operators, application admins, tenants/end-users) and control programs (SDN Apps, network services) to generate network policies independently and dynamically. Yet existing policy abstractions and frameworks do not support natural expression and automatic composition of high-level policies from diverse sources. We tackle the open problem of automatic, correct and fast composition of multiple independently specified network policies. We first develop a high-level Policy Graph Abstraction (PGA) that allows network policies to be expressed simply and independently, and leverage the graph structure to detect and resolve policy conflicts efficiently. Besides supporting ACL policies, PGA also models and composes service chaining policies, i.e., the sequence of middleboxes to be traversed, by merging multiple service chain requirements into conflict-free composed chains. Our system validation using a large enterprise network policy dataset demonstrates practical composition times even for very large inputs, with only sub-millisecond runtime latencies.

  • Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, Aditya Akella

    Datacenter networks deal with a variety of workloads, ranging from latency-sensitive small flows to bandwidth-hungry large flows. Load balancing schemes based on flow hashing, e.g., ECMP, cause congestion when hash collisions occur and can perform poorly in asymmetric topologies. Recent proposals to load balance the network require centralized traffic engineering, multipath-aware transport, or expensive specialized hardware. We propose a mechanism that avoids these limitations by (i) pushing load-balancing functionality into the soft network edge (e.g., virtual switches) such that no changes are required in the transport layer, customer VMs, or networking hardware, and (ii) load balancing on fine-grained, near-uniform units of data (flowcells) that fit within end-host segment offload optimizations used to support fast networking speeds. We design and implement such a soft-edge load balancing scheme, called Presto, and evaluate it on a 10 Gbps physical testbed. We demonstrate the computational impact of packet reordering on receivers and propose a mechanism to handle reordering in the TCP receive offload functionality. Presto's performance closely tracks that of a single, non-blocking switch over many workloads and is adaptive to failures and topology asymmetry.

  • Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs H?lzle, Stephen Stuart, Amin Vahdat

    We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale networks. Second, much of the general, but complex, decentralized network routing and management protocols supporting arbitrary deployment scenarios were overkill for single-operator, pre-planned datacenter networks. We built a centralized control mechanism based on a global configuration pushed to all datacenter switches. Third, modular hardware design coupled with simple, robust software allowed our design to also support inter-cluster and wide-area networks. Our datacenter networks run at dozens of sites across the planet, scaling in capacity by 100x over ten years to more than 1Pbps of bisection bandwidth.

  • Dave Levin, Youndo Lee, Luke Valenta, Zhihao Li, Victoria Lai, Cristian Lumezanu, Neil Spring, Bobby Bhattacharjee

    There are several mechanisms by which users can gain insight into where their packets have gone, but no mechanisms allow users undeniable proof that their packets did not traverse certain parts of the world while on their way to or from another host. This paper introduces the problem of finding "proofs of avoidance": evidence that the paths taken by a packet and its response avoided a user-specified set of "forbidden" geographic regions. Proving that something did not happen is often intractable, but we demonstrate a lowoverhead proof structure built around the idea of what we call "alibis": relays with particular timing constraints that, when upheld, would make it impossible to traverse both the relay and the forbidden regions. We present Alibi Routing, a peer-to-peer overlay routing system for finding alibis securely and efficiently. One of the primary distinguishing characteristics of Alibi Routing is that it does not require knowledge of--or modifications to--the Internet's routing hardware or policies. Rather, Alibi Routing is able to derive its proofs of avoidance from user-provided GPS coordinates and speed of light propagation delays. Using a PlanetLab deployment and larger-scale simulations, we evaluate Alibi Routing to demonstrate that many source-destination pairs can avoid countries of their choosing with little latency inflation. We also identify when Alibi Routing does not work: it has difficulty avoiding regions that users are very close to (or, of course, inside of).

  • Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, David Zats

    Datacenter transports aim to deliver low latency messaging together with high throughput. We show that simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. First, we show that advances in NIC hardware have made RTT measurement possible with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. Then we describe how TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. We implement our design in host software running over NICs with OS-bypass capabilities. We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X. To the best of our knowledge, TIMELY is the first delay-based congestion control protocol for use in the datacenter, and it achieves its results despite having an order of magnitude fewer RTT signals (due to NIC offload) than earlier delay-based schemes such as Vegas.

Syndicate content