Automating Pcap Parsing: Case Study

Published 01-Aug-202020 min
image

Preamble

In Internet measurement studies, we often need to deal with raw packets. While there are numerous options on how packet capturing can be done: amongst researchers, the most dominant ones are mainly either tcpdump or Wireshark. Both tools are powerful and relatively easy to use. Chosing one over the other mostly comes from prior experience and the scope of the problem. For example, Wireshark comes with GUI and a powerful analytics DSL, while tcpdump is lightweight, can be ran without requiring any Xserver, and can be setup with a single command. One might think that with many and different options in packet capturing, the process of packet analysys/packet parsing will also be well explored and offer multiple solutions to pick from. Well, if you thought so, I have bad news for you: when it comes to (automated) packet parsing, things are not as easy as chosing your favorite "off-the-shelf" solution and proceeding with it. If you do not trust me, ask around anyone who has done such packet parsing (good luck finding these people first) and chances are they will all suggest different solutions, most of them customly made by the scientist.

In this post, rather than trying to convince you that the solution that I came up is the best (it really is not), I will marely reflect on my most recent experience for packet parsing. I will present a few viable options, along with why I chose these ones in particular, and will then report their performance.

Side note: Here I was working with large files that compressed were sometimes over 10Gb. For small files that can fit into memory and inspection can be done manually I still think that Wireshark's packet filtering and analysis features are superior and recommend that.

Coming up with solutions

I will use a packet parsing example and stick with it until the end of the article to make my point clearer. Let us say we have a large pcap file and we would like to know how many disctinct TCP connections were made that used port 443.

After a bit of research I found that the following options might prove useful for our problem:

  1. Terminal tools (tsshark, tcpdump, tcptrace, etc).
  2. Python with scapy.
  3. C++ with PcapPlusPlus.

Faced with this issue, (my files were still small and I was a fool), I thought: "I am neither a disnosaur nor a terminal ninja, so let us try to steer away from the terminal solutions. Also, it is 2020 after all and data science and analytics is now 'cool' to be done in Python, so there must be a module that does that." So I chose scapy.

Initially scapy worked great. My files were less than a gigabyte and python managed to crunch through those in 100-150 seconds. That was ok as I would start the parsing process, edit look at results from previous batch and by the time I am done the task will be complete.

Later files got larger (around 2.5Gb) and it took 968.2424 seconds on average or about 16m 8s. That wasn't so great but was workable. I would find something productive to do for about 15 minutes and the results would be almost done. Later... Files got even larger 10Gb+. At this point scapy was taking over 6 hours (around 8 if memory serves correct). Needless to say, this was not good... It was bad enough that I would get my results on the next day but what was worse was that if I ever made a mistake in the data, or something happened to the scrtipt while it was running everything was ruined. For me that was the clear red light that showed me it was time to try and replace Scapy. Honestly I did not think that such a task can take 8 hours, even when there were hundreds of milions of packets to be analysed per file.

Eventually I found that I was right, but that was not before I wasted some time trying out other non-optimal solutions (which waste of time ironically became a good base for this blogpost - more time spend on non-research). So convinced in the power of PYTHON as any other good data scientist fanatic (or frankly would rather have avoided writing propper code if I had the chance to), I decided to check if there were not any experimental more performant python modules to do that. I was in luck: found a module called TODO: which was advertised as more performant than scapy. Great just what I was looking for .

... Bottom line is: it was in fact performing better, sadly, not by much...

At this point I realised "Wait a minute, I am a networks researcher, not a data scientist. I do not need to throw Python at every solution. In fact, I am not scared to write in those "dark-art" programming languages that use bane concepts like pointers, references, borrowing etc". So I stepped it up a bit and tried a "real engineers'" language such as C++. No surpises there C++, being the swiss-army-knife of programming languages, provided an option for packet analysis. So I was ready to work with the solution, after short research in finding it and much longer and more painful experience with cursing at the monitor, and crying in the void while compiling and making sure said solution was properly setup. Also unsurprisingly, the library I found was open source, badly documented and seems to have been last updated a few months ago... Anyway with a library at our disposal we were ready write some C++ code to do our task and check its performance.

Drumroll please Once again, C++ has proven quite a useful friend even for this task. Seems like C++ can be really friendly when it comes down to performance. The tricky part is befriending C++ initially. To make a fair comparison, I will report the results for the files that took Python around 16 minutes, C++ managed to crunch through those in 1.578s (To some extent I was guilty for about 80% of that time, as the script was very verbose).

Final words

So, what we have learnt from this (quite opinionated) blogpost was that Python can get us so far when it comes to data analysis. And unfortunately in network systems we cannot put that "data scientist" cap while analysing our data, got to leave that mostly to AI and ML people for now... We also saw that scalability can be a big struggle for some tools. And if you are faced with a similar issue, I would recommend you use PcapPlusPlus if performance is what you are after. In future I may try to provide a Python port for PcapPlusPlus using FFI so that hopefully one day we (network engineers) too will be able to pick the easy route for data analytics.