Luca Salgarelli

GT: picking up the truth from the ground for internet traffic

By: 
F. Gringoli, Luca Salgarelli, M. Dusi, N. Cascarano, F. Risso, and k. c. claffy
Appears in: 
CCR October 2009

Much of Internet traffic modeling, firewall, and intrusion detection research requires traces where some ground truth regarding application and protocol is associated with each packet or flow. This paper presents the design, development and experimental evaluation of gt, an open source software toolset for associating ground truth information with Internet traffic traces. By probing the monitored host’s kernel to obtain information on active Internet sessions, gt gathers ground truth at the application level.

Public Review By: 
Pablo Rodriguez

Traffic classification has received widespread attention in the last few years. This can be explained by the continuous tussle between network operators that sometimes try to ‘peek’ into their client’s application usage and network services and applications that add layers of evasion to escape such eavesdropping. Accurately assigning applications to observed flows can also help with management, security as well as provisioning of IP networks. A plethora of traffic classification techniques have consequently been developed to address each of the layers of evasion added by applications. All such techniques need reliable inputs to quantify their effectiveness. Such input comes in the form of previously labeled traffic traces and is usually referred to as ground truth.
Two main techniques were used so far to produce traffic that provides such ground truth. The first one manually or programmatically triggers applications on different machines and labels the corresponding generated flows. This has limitations, since the traffic traces can still contain background traffic and the generated workload is not similar to a workload generated by human users. The second technique employs Deep Packet Inspection and tries to match signatures inside each packet. However, multiple signatures might match and also this approach breaks when dealing with encrypted traffic.
This paper presents a client tool called gt that helps to provide ground truth information to evaluate different traffic classification methods by monitoring a host's kernel. This is extremely valuable for validation purposes. The authors show that the gt tool developed addresses some of above limitations: it seemingly integrates with a user’s normal computer usage, keeping a low CPU load (less than 5%), and achieves close to 100% completeness in flow tagging on all operating systems. The gt tool can also help augment exiting classification techniques like DPI to give better results. In fact, the gt tool can be used to address the limitations of existing Deep Packet Inspection techniques both by reducing the number of signatures that need to be matched and by enhancing the accuracy of the matches. One potential avenue for further research that the authors could explore is to evaluate and characterize existing traffic classification methods such as BLINC using the ground truth information generated with the gt tool, thus proving invaluable to help finetune such approaches.

On the Stability of the Information Carried by Traffic Flow Features at the Packet Level

By: 
Alice Este, Francesco Gringoli, and Luca Salgarelli
Appears in: 
CCR July 2009

This paper presents a statistical analysis of the amount of information that the features of traffic flows observed at the packet-level carry, with respect to the protocol that generated them. We show that the amount of information of the majority of such features remain constant irrespective of the point of observation (Internet core vs. Internet edge) and to the capture time (year 2000/01 vs. year 2008). We also describe a comparative analysis of how four statistical classifiers fare using the features we studied.

Public Review By: 
Konstantina Papagiannaki

Traffic classification in the Internet has been one of the most challenging research topics in the recent past. Our inability to confidently map flow information to the generating application complicates the management of IP networks due to the fact that network operators have limited visibility into what the network is used for. The heavy use of peer-to-peer applications that further try to obfuscate their nature, using random port numbers, makes traffic classification an interesting research domain.
This paper studies the information content of the different fields in an IP flow that could assist in the identification of the application. More importantly, the authors test how the information content of those fields varies depending on the location where the measurements are collected and across time. This is the first work, to the best of my knowledge, that asks this question and demonstrates that packet size is not only a highly discriminating field, as shown before, but also that its value appears to be constant regardless of the point of observation and time. One of the things I would have liked to see in this paper is a rationale as to why one should have expected such a variation. One interesting question to ask, then, is: Given that packet size is probably the easiest flow property one can change (say for evasion purposes), would there be any other fields that could provide similar discriminative power while being robust across time and space?

Comparing Traffic Classifiers

By: 
Luca Salgarelli, Francesco Gringoli, and Thomas Karagiannis
Appears in: 
CCR July 2007

Many reputable research groups have published several interesting papers on traffic classification, proposing mechanisms of different nature. However, it is our opinion that this community should now find an objective and scientific way of comparing results coming out of different groups. We see at least two hurdles before this can happen. A major issue is that we need to find ways to share full-payload data sets, or, if that does not prove to be feasible, at least anonymized traces with complete application layer meta-data.

Traffic Classification through Simple Statistical Fingerprinting

By: 
Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli
Appears in: 
CCR January 2007

The classification of IP flows according to the application that generated them is at the basis of any modern network management platform. However, classical techniques such as the ones based on the analysis of transport layer or application layer information are rapidly becoming ineffective. In this paper we present a flow classification mechanism based on three simple properties of the captured IP packets: their size, inter-arrival time and arrival order.

Public Review By: 
Chadi Barakat

The main contribution of this paper is a statistical method for the classification of Internet flows that does not require to look at packet headers or to parse payload data. Indeed, classifying flows using packet headers and payload data is less and less effective of our days for the simple reasons that applications use more and more encryption and try to avoid the use of standard port numbers that can be easily recognized. What the authors propose to counter this effect is a classification method based on the size of the first packets of a flow and the time between their capture at the monitoring device. They show over real measurement data that monitoring the first few packets is enough to classify flows generated by three typical applications: HTTP, SMTP and POP3. The consideration of other applications is left for future work.
As in any classification work, there are two phases. In the first phase, called the training phase, known flows from the applications to classify are analyzed to establish classification rules. In this paper, rules (called fingerprints) are simple histograms for the observed variables that can be easily stored. Based on these histograms, the authors propose an anomaly score that gives a value between 0 and 1 and that indicates how bad it is to assume that a flow belongs to some application, and as a consequence, that gives the best application to which a flow belongs. Once established, the classification rules are exported to the monitoring device and they are applied over collected traffic in real time.
This work is clearly not complete and one evidence is that only three applications are considered, which are among the easiest to classify. It is strength however is in the novelty of the idea and the many open issues it raises. All reviewers agreed on the timeliness and sound of the contribution and were enthusiasts to see more work done in this direction to prove the generality of the solution. Among the points to be studied one can find the consideration of more applications, the transportability of the fingerprints from network point to another, and the sensitivity to network perturbations as the reordering of packets and the oscillations in the round-trip time. CCR asks for papers that address hot topics, that present solid contributions and that open the door to more research and trigger interesting discussions. This paper is exactly of this type!
Enjoy its reading.

Syndicate content