Detailed Diagnosis in Enterprise Networks

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl
Appears in: 
CCR October 2009

By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of themimpacting one another in the present.We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as themost likely culprit in 80% of the cases and is almost always in the list of top five culprits.