This paper presents a statistical analysis of the amount of information that the features of traffic flows observed at the packet-level carry, with respect to the protocol that generated them. We show that the amount of information of the majority of such features remain constant irrespective of the point of observation (Internet core vs. Internet edge) and to the capture time (year 2000/01 vs. year 2008). We also describe a comparative analysis of how four statistical classifiers fare using the features we studied.
Traffic classification in the Internet has been one of the most challenging research topics in the recent past. Our inability to confidently map flow information to the generating application complicates the management of IP networks due to the fact that network operators have limited visibility into what the network is used for. The heavy use of peer-to-peer applications that further try to obfuscate their nature, using random port numbers, makes traffic classification an interesting research domain.
This paper studies the information content of the different fields in an IP flow that could assist in the identification of the application. More importantly, the authors test how the information content of those fields varies depending on the location where the measurements are collected and across time. This is the first work, to the best of my knowledge, that asks this question and demonstrates that packet size is not only a highly discriminating field, as shown before, but also that its value appears to be constant regardless of the point of observation and time. One of the things I would have liked to see in this paper is a rationale as to why one should have expected such a variation. One interesting question to ask, then, is: Given that packet size is probably the easiest flow property one can change (say for evasion purposes), would there be any other fields that could provide similar discriminative power while being robust across time and space?