SANS 360 Lightning Talk: Factory Forensics

Posted on December 14th, 2011 in Blog, HomePage.

Last night’s SANS 360 session was a blast. It was much more intense than a normal lightning/work-in-progress talk, and the speakers were great. Big props to Rob Lee and the SANS crew for organizing it.

For those who couldn’t make it, the rest of this blogpost is a recreation of my talk, “Factory Forensics.” Geoff and I will be speaking about the technology in a bit more depth at the 2012 DoD Cybercrime Conference in Atlanta, on Wednesday, January 25, at 1100 in the Learning Center room (“Forensic Clusters”). We’re up against Jesse Kornblum in the same time slot, but, hey, he always puts his slides up online, right?

Factory Forensics

Watchful guardChris Pogue has talked a lot this past year about his approach to casework, which he calls “Sniper Forensics.” I haven’t had the pleasure of meeting Chris in person, but his blogposts about Sniper Forensics are great (Parts One, Two, Three, Four, and Five). The basic idea behind Sniper Forensics is that you keep in mind what question you’re trying to answer in your case, and you work backwards from it [Ed.: admittedly a gross simplification, but, hey, it was a 6 minute talk. Read Chris's posts if you haven't already.]. That way, you stay on target and don’t get overwhelmed by the evidence or fall down a rabbit hole.

But

The problem with snipers is that they can’t deal with every situation. They need backup. We’re dealing with more raw data as input than ever before, and, as the public has grown aware of computer forensics, we’re now more in demand than ever before. Working smarter (a la Sniper Forensics) and doing less (“triage”) will only take you so far: we have to increase our productivity and get more done. The tool to do that is the assembly line.

According to Henry Ford, there are three main principles of the assembly line. Kind of wordy, but they are:

  1. Place the tools and the men in the sequence of the operation so that each component part shall travel the least possible distance while in the process of finishing.
  2. Use work slides or some other form of carrier so that when a workman completes his operation, he drops the part always in the same place–which place must always be the most convenient place to his hand–and if possible have gravity carry the part to the next workman for his operation.
  3. Use sliding assembling lines by which the parts to be assembled are delivered at convenient distances.

And here’s how I interpret them:
  1. Make a sequence of operations, in the right order
  2. The output from one stage is the input to the next
  3. The flow between stages should be automatic

Ford really wasn’t the inventor of the assembly line, but its refinement at Ford Motor Company early in the twentieth century allowed him to cut the price of the Model T by a third, produce over a thousand of ‘em a day, and put competitors out of business.

So, I went looking around for a way to build an assembly line for forensics, something that would let me process lots of evidence reliably. I looked into how other tech companies were dealing with large data sets, and the most popular solution they use for storing and processing large unstructured data sets is an open source framework, Apache Hadoop.

Hadoop is software that you can run on a cluster of 1 or 2 U servers, your typical entry-level servers with a few disks and a few cores, nothing fancy. It scales up to thousands of machines storing petabytes of data, but clusters can be built incrementally without much hassle.

Hadoop has two main components. The first is HDFS, a distributed filesystem. This breaks up files into blocks and it stores the blocks on different machines. The blocks are automatically replicated and checksummed, so it’s fault-tolerant. [These images come from Cloudera's Hadoop Overview, which is a great place to start learning about Hadoop.]

The second component is called MapReduce, which is a batch processing service. You write your program to a specific API, and then MapReduce sends it to all the nodes in the cluster and starts running it. The nodes only process the blocks of data they have locally, so the disk accesses remain local, and then the system collates all the output automatically. Network traffic is thus conserved, and Hadoop generally does large streaming reads and writes to disk instead of seeking all over the place. The important thing to remember is that it’s faster to send the program to the data than it is to send the data to the program.

So, earlier this year we worked on creating a system for doing forensics on top of Hadoop. There are three main steps. First, we have to ingest the data into the system, doing all the filesystem stuff. This is kind of complicated, but using the Sleuthkit was a big help. Second, we have all our processing tasks, like text extracting, keyword searching, and some other cool things. Finally, the output from the system is a set of static HTML reports that contain all the results, the idea being that we could produce useful information automatically for use as a starting point without having to learn another GUI.

Here are some screenshots of the output reports. The first one gives an overview of the evidence and the search results.

The second one shows you some results from our image clustering routine, where we arrange similar images into groups.

The third shows you key frames that we’ve extracted from a video file, so you can just look over the gallery view of key frames without watching the whole video.

How’d it end up? Well, the good news is that it basically works. We ran it with 5 nodes, 10 nodes, and 20 nodes and all the processing tasks worked. Performance was pretty good and it scaled. The bad thing we found is that we had to learn a lot about Hadoop and use some complicated techniques to get good performance—you can write pretty simple code with Hadoop but performance won’t be as good as it should be. The ugly part of it was that Hadoop is all Java and, while Java is very fast, resolving the dependencies between all the different open source libraries, working with their build systems, and configuring it all on a cluster was a big pain. We ended up contributing a patch to Apache to make this easier.

Bull'z Eyez The other big thing we realized after we created the initial prototype was that it’s not really enough to produce static output for you to use. It’s a good start, but if we’re going to incorporate factory processes into forensics, what we need to output are tools for you to use to help you zero in on your cases (like the snipers you are). Initial results always spur further questions, and if you have to go back to the drawing board to answer them, then we haven’t helped you as much as we could. So, we’re working on that.

This is the machine shop at the first assembly line at Ford. It looks pretty primitive to us now. But it changed history. Many people think that the golden age of forensics is behind us, what with full disk encryption, cellphones, the cloud, and a host of other challenges. But we think we’re just getting started in forensics, and that we have a very bright future in front of us.

Machine shop at Highland Park