Pavel Gladyshev discusses his research at DFRWS EU 2018.
Pavel: Thank you. Good evening, good afternoon, ladies and gentlemen. Today, I’d like to briefly tell you about an open-source tool and general approach that we invented about a year ago. I’m going to talk about what we call decision-theoretic file carver, and that’s a collaboration between myself and Joshua James.
The original idea was to speed up file carving, and to give you an idea of what we did, I’ll ask you, do you know this game? Have you ever played Battleship game? You put ships on the map, and then you try to hit in different places to discover where the ships are. Then we thought, and we realized that file carving is like Battleship game on the Suez Canal, where you have all the files along the canal, and you can bomb them with a file carver [test] in particular, these blocks.
Now, the strategy that’s guaranteed to find everything is to start at block #1 and methodically bomb every square on the canal. But that’s not how humans play it, and there is a good reason for that. Humans are more random, and then once they hit a ship, they bomb neighboring square, trying to discover where the ship is. What we tried to build is a specialized file carver that is looking for a particular kind of files. Because when we do file carving, we don’t analyze all of the files that are carved out. Typically, we are only interested, say, in child abuse material pictures or videos or something else. And there are situations – for example, when a parole officer needs to monitor computer of a child abuser released early from jail to see that he didn’t produce any fresh pictures of child abuse material – in such situations, he has limited period of time to do the test, and he doesn’t really have the opportunity to take disk image of the suspect’s computer and do proper forensics on it. So, ideally, it should be a simple program that you plug into the computer, let it run for 40 minutes, and if there is something, it quickly discovers it.
The approach for building such programs is what we call decision-theoretic forensic algorithms. The aim of such an algorithm, of which file carving is just one example, is to find some sort of relevant data as quickly as possible. And maybe you don’t have time to process everything, but you still want to get relevant data as fast as you can. So, how do we do this? We propose to use “greedy approach”. So, basically, input is a collection of data items to be processed, say, all of the disk blocks on the drive. At each step, we pick the item for processing with the maximal expected utility of the information that it probably contains. And then, we keep doing this, processing items one by one, until there is no more items to process.
The beauty is that if we always go after the most likely place where the relevant data is, we can stop at any moment, and we will be certain that we have used up whatever time we had to the best possible extent.
Then the question arises: How do we actually determine expected utility of, say, this block, which we have not yet processed? If we now focus specifically on carving jpeg files from suspect machine, to understand how we can determine utility in this particular case, let’s look at the graph, at the total amount of recovery data over time.
A sequential carver that processes drive space from start to finish, for it, the graph looks like this, there are some steep climbs, [one that finds the relevant] data, and there are long plateaus when it goes through areas of the drive that contain no relevant data. And clearly, if we want to improve the situation, we want the graph to climb all the way to the top at the beginning, and then just maybe stay flat until the very end.
For us, when we build this carver, the utility was the slope of the graph. Now, think about it – to process a particular data block, we need some time. And the time required to process the block consists of the time we need to reposition read/write head of the drive, read the cluster, process it in RAM. And there are two options – either it contains no relevant data, in which case its utility is zero, or it contains the full disk block of relevant data. And so, the utility function that we use is basically zero if there is no relevant data, and one divided by the processing time if there is some relevant data. So, basically, the longer time we need to process, the less is the utility of the data.
Now, when we consider it, specifically the problem of finding raw jpeg images, like those produced by a digital camera, we look at the distribution of file sizes. And in my personal collection of raw images, this is what I have – 5,000 images, and as you can see, file sizes are spread somewhere between 1MB and 6MB, each spike corresponding to a different digital camera that I owned over the past 10 years.
But it’s clear that there is certain minimal size – let’s call it ‘l’, length – above which all of the raw jpeg images are. And then, we did some small-scale simulations to try and understand which data blocks on the drives are most likely to contain relevant data. So, we made the following assumptions, simplifying assumptions that, first of all, there is no fragmentation on our drive, all files that we are looking for are at least l disk blocks in size, files are randomly, uniformly distributed through the disk space, and very importantly, we have a detection algorithm which can accurately and quickly determine if a particular data block contains relevant information.
So, when we did some simulations, this is the probability distribution we obtained. To me, it looks a little bit like a cat’s head. And the position of the ear is exactly l blocks from, say, the start of the drive or from the current position of the last examined block. So, now, different drive technologies require different processing time to get to a particular data block. But in most cases, if our l is not very big – say a thousand blocks or so – it doesn’t really matter whether we have hard disk drive, solid state drive, or [RAM] drive.
And so, our algorithm that we came up with works like this. First, we examine block which is exactly l blocks from the start of the drive. And then, we examine another block, which is l blocks from the last examined position. And we keep doing this until we find a block with some relevant data. At that point, we step back, and we use normal header and footer signature detection to identify boundaries of the file. Once we reach the end of the file and there is no further jpegs there, we revert back to [10:11] with the length l. And l is given as a parameter to our carver.
It’s an open-source project on Bitbucket, I’ll show the URL at the end of the presentation. Currently, it runs on Mac or Linux, but there should be no problem compiling it in minimalist GNU for Windows, MinGW, or anything … [any C] compiler that supports … yeah, [any C] compiler. Our tool relies on libtsk to access disk images, and so you need to have libtsk in your compiler to be able to build it.
We did some experiments, we used different types of drives – hard disk drives, solid-state drives, [RAM] drives, and different allocations of the files. So, we’re saying the worst case is an entire drive, 100% filled with jpegs, the best case is when the last 5% are filled with jpegs, and different combinations in between.
If we consider drive which is 100% filled with target data, then sequential carver is obviously optimal, because it starts at the beginning and just carves everything out. And as you can see, our DECA carvers, before on par with Photorec and with Scalpel. Now, those green and yellow graphs, which look like happy worms standing on their tails, are actually Scalpel. And the reason why it looks like this is that because Scalpel first identifies locations of different files, and then it extracts them in one go at the end. So, that’s the worst case, and in the worst case, our carver performs just as well as sequential carvers.
In the best case, more happy worms. You can see that this first group of worms on the left is our DECA carver. And we managed to discover about 80% of all files on the drive in about 1/5th of the time it takes Photorec and Scalpel to recover the same amount of files. The three on the … the three last ones are Scalpel 2.1, Photorec, and Scalpel 1.6. And the preceding ones are … sorry, and foremost is the pink one. And then, everything on the left is our DECA carver with different length of step, with different values of l. L – 64, 200, 300, and so on.
We think that this approach of treating forensic search for relevant data as a [decision … utility … maximal expected utility, a principle-driven greedy] algorithm is good, and we welcome you to have a look at the code repository, play with it, and let us know what you think. Thank you.[applause]
End of transcript