On Efficiency Of Artifact Lookup Strategies In Digital Forensics

Harald Baier presents his research at DFRWS EU 2019.Host: Hello, welcome to the last session of the day on forensic analysis techniques in general. We have a mix of three techniques here, two of them are spanning a couple of abstraction layers, so we’ve got application level analysis, and file system level analysis, and also some general techniques around lookup strategies in digital forensics.

We’ve got a bus to catch at 16:45, so we have to be quite strict on time on these, so with no further ado, can I introduce Harald Baier, who is going to talk about the efficiency of artifact lookup strategies in digital forensics.

Harald: So welcome to our joint paper. Mainly it was coordinated by Laurence Liebler, who gave a talk in the previous session about APX bin, and the scope is the same, so he’s the main author. Patrick Schmitt was a Master’s student of Laurence who hails from the technical university of Darmstadt. My affiliation is Hochschule Darmstadt, I’m leading the IT security research group and I’m also involved in this security research centre, CRISP Darmstadt. And finally, we also have a joint work with Frank Breitinger who is also very familiar with this topic.

The agenda of my talk is to give a small motivation, although the scope already was mentioned a few times in the previous session. We investigated three candidates for the lookup problem; I will highlight them in the second part of my talk, then I will give you some insights in the requirements for the candidates and their previous and final capabilities. And then I will give you some information about extensions which we provided in our work. I will sketch innovation, and then I will conclude.

So the motivation in the invited talk tomorrow was called ‘data density’, so there was this exorbitant number of [XR bytes per DNA gram]; we call this data overload, and what you see in the two pictures – maybe the picture from some years ago on the left side, and also a picture from some years ago, because a hard drive is not that common today.

But the motivation for our work is that, in an ordinary case for a forensic investigation, you have some storage media. You have to deal with different terabytes of data – not yet petabytes, but maybe terabytes, this is our use case. And if you think, going through that, finding relevant artifacts, I think the following picture is very helpful: 1TB of data, if you think of print-out text, and you print it on a page where one page holds 5,000 characters, then you have twenty million pages. And if you look at the weight of this paper, then you have [indecipherable]. So what you keep in mind is that you cannot work with that by hand.

So you have the forensic investigator at work, it’s for instance this person, and when you start you have bulk data – this is the haystack – and then your task is to find relevant data, and relevant data means that you have to find the needle. This is what you are searching for.

For instance, you have the relevant question, you have different terabytes of data, and do you find any artifacts of child abuse pictures? Then the child abuse picture, or a deleted fragment of that, would be the needle which you are looking for.

And then you need automated pre-filtering, and the community so far discussed about two approaches to find the needle in the haystack. The first approach is to decrease that which is not relevant – to decrease the haystack, which means reduce the haystack – or, what in my perception is more relevant, is to increase the needle. Which means filtering in artifacts which are candidates for those artifacts that you are actually looking for.

In this use case, approximate matching is a topic which is now, let’s say, 15 years old. In the previous session you had this malware analysis as one use case, and the general pipeline is, as you see now: first, you have a dataset of the needles, of data which you are looking for. Because it’s a needle, what you’re looking for is a blacklist.

And the first phase is the construction phase, which in the use case of approximate matching means that you take an input file, you extract features or blocks – a block is a characteristic block, typically – and for space conditions, you hash them – you cannot store the whole data as it is, but you hash them – and then you put this in this “database”. And “database” is in quotation marks because you should not think about a classical database. In this malware analysis in the last session it was I think in SQLite, but in the general case you cannot expect that it is really a database format.

And we are talking about the fuzzy nature of input data. Fuzzy nature means that you cannot sort it in a straightforward way. So this is what you should keep in mind. And therefore approximate matching: we are looking for needles which are similar to other ones.

So this is where you construct your database. Your dataset which you will use later on to decide if, during the lookup phase – the second phase – you have then a device, which is maybe relevant in your case; and then you proceed similar. You take an input data stream, for instance a file, deleted file or a fragment; then you extract the stream again, you hash them, and then you look in your database if you find something where you think it’s similar or exactly the same which you are looking for. And this is the automated process.

But once again, we have a fuzzy nature of input data, and this means it’s not that obvious how to decide about membership. And if you take the naive general approach, you have to take something on your seized device against all elements from your database from the construction phase. And this doesn’t scale very well. So this is what we call the database lookup problem: this is about membership, but being better than a brute force comparison.

So, to summarise again: we are looking in this paper for efficient – only efficiency, we don’t care about accuracy or any classfication matrices – we are looking for efficiency in terms of time performance, and the use case white/blacklisting, one of our candidates comes from a carding use case. And we generally think that we have a large corpus with which we are dealing.

And then we looked at what else would be convenient to provide, candidates which are already somehow related to the digital forensics community, or which are so interesting that we can take them into account.

Further goals – this was also already mentioned in the previous session – that we are looking for blocks which are somehow characteristic of an artifact. If you think of the identification of a malware portable executable, you have some sections where you have static file information inside, which is not relevant for a specific file, and this is what is meant by the deduplication phase in our work.

And then, once the construction phase is finished, we can also think about how static is this database, or to take it the other way, is it possible to add or remove elements after the construction phase is finished?

So. The candidates.

Three widespread candidates, I said. The first one is one from Simpson & Garfinkel, coming from the DFRWS workshop four years ago, and this is also used in the well-known bulk extractor tool. And one main reason we chose it is because it goes through the whole pipeline I just mentioned.

The second candidate are hierarchical Bloom filter trees. This is a concept which is now about five years old, and which we use in our software for this time.

And the final one, which is from the database community, something which has the best lookup consistency – a constant one – is a hash table, and last year at the C++Now conference, Malte discovered that he has the best performing hash map, which he called the flat hash map.

So this was our preselection.

So we investigated these three candidates, some more detailed on these candidates: the hash db, or the hash-based carving from Garfinkel actually uses a database format, which is mapped in the memory in a large way, so this is the LMDB format. In the paper four years ago, they take two data sets, a small one and a large one which contains about one million files. And it features multi-threading by design, and it’s therefore read-optimised. And it has a built-in deduplication, which means that Garfinkel emphasised that he is interested in characteristic artifacts of a specific file. So he wants to prevent what we call multi-hit, or what you can call a common block: something which is shared within several files.

In the hash db you can add these files after the construction phase; and in the block building, which I will explain in a minute, they use a fixed sliding window.

In our work, we often make use of Bloom filters. This image, if you are not aware of Bloom filters, should give you a small example of how a Bloom filter works. A Bloom filter is very space-efficient. So this was the reason why it is used in some approximate matching algorithms. So for example, Vassil Youssef used it in SDHash, for storing his SDHash fingerprint.

And what you can see is, you have three elements, X, Y, Z; and these are represented by 18 bits. 18 bits. Less than three bytes. And the proceeding is very simple: you insert the first element X into the Bloom filter by simply computing three values. Typically it’s a hash value, and then you have an index within your Bloom filter. So this means the X sets this one at this position, at this position, and at this position. And you proceed with Y in exactly the same way.

What you also see on this image is that this position is switched. The Bloom filter originally has only zeroes, and if you insert an element you switch to the one only if it is reached. So what you see here is that this one is actually reached by two different elements, and this is what the Bloom filter is unaware of. The Bloom filter itself doesn’t know how often a certain condition would be switched to one. Actually, at least it was one element, but it could be no elements.

So then later, in the lookup phase, if you want to decide if the percent element W is in the set – in this set here, you simply take the same approach. You compute the three positions, and then you look, because if W is in the set, then all three positions must be set to one. This is ok here; this is ok here; but at this position here, this is not set to one, so we know in this element, W cannot be part of the set.

However, if W would match, let’s say, to this one, this can also happen by this Y element. So the Bloom filter has false positives – it doesn’t have false negatives, but it has false positives.

So when we came up with Bloom filters for deciding about membership, then this is now I think four or five years ago, the idea is to take a large Bloom filter – the root Bloom filter – decides about the actual membership. But then you don’t know which individual file is it, and to come up with the individual file you have to traverse this Bloom filter, and in these leaf notes here you have a small set of files, which you can compare by hand. So this is the main idea of the hierarchical Bloom filter tree.

And this concept was implemented two years ago by some people from the University College Dublin, and this is some sort of proof of concept. This works.

Interesting for the lookup complexity is that, instead of O(n) – n is the number of elements in the dataset – this is decreased to the log of n, which is OK, but you have false positives, I mentioned that already.

And one real problem with Bloom filters is inserting or deleting elements. Inserting is somehow possible, but deleting is not possible, because you don’t know if the one in the Bloom filter is also set by some other element, so you cannot really delete elements. And this is what you should keep in mind when you later on talk about the evaluation.

Next slide is some information about the flat hash table. Once again, the hash table itself has a constant lookup complexity, so at first glance you would say you must make use of a hash database, but you are not aware how long does it take to construct it, what are the memory consumptions, things like that.

The hash table of [indecipherable] has different features which make it very attractive for us. One is that, due to the Robin Hood hashing, you can ensure that an element actually is close to its ideal position. And it has no false positives.

OK. You have a rough understanding of the three candidates. They are either very attractive, like the flat hash net, or they are already discussed in the forensic community.

I will show you only a very short part of what we did in the requirements and capabilities section, but what is important in our perception is that you actually have discriminative blocks. Or, to put it another way, a multi-hit is a block which appears to be in different files: for instance, parts of the file header. Or if you have a popular software library which is statically linked to some. But then you will find this code, or this information, in different files; and this is not really discriminative.

So what we really want to avoid is multi-hit, and this is also supported by this hash db approach of Garfinkel.

So actually, this is the core result of our capability and requirements analysis: what you see in the left column are the main requirements, and some further information. You see the storing technique, for instance. Block building means how to extract these characteristic blocks, all the blocks from a given file, a given artifact. And the semantics is as follows: if a cell is green, then this means originally this was not part of this concept, but we successfully implemented it, or provided the concept and implemented it. For the block building this was not really complicated because we used it for our different approximate matching approaches.

The block hashing, you see, is in this row. An important aspect to our perception is that, at least at the lookup phase, it’s possible to perform multi-threading, simply for performance reasons. You don’t want to wait hours or days for the results, but the real time should be decreased by some parallelisation.

We have some multihit handling. What you see here actually should be a red colour, this has somehow changed here. What I already said, that for the hierarchical Bloom filter trees it’s hard to insert elements – and it’s practically impossible to introduce a deletion in the Bloom filter after the construction phase. This is practically not possible, without losing the good properties of the hierarchical Bloom filter tree.

And you have some more information; if you’re interested, you can have a look at the paper.

So, which extensions? I just catch it in one slide and give you an information for only one part of that.

The multi-hit prevention was not possible for the hierarchical Bloom filter tree or the flat hash map. We provided the concept and also an implementation for that. What you can see is that for the hierarchical Bloom filter tree we have to call this and we evaluated that, and the global filter phase one was the better one.

The parallelisation of block building, I will give you a short insight into the actual problem and the concept. I talked about characteristic blocks which you want to extract. So, think of this as a file which has eight characteristic blocks. Then you have some algorithm which finds C1 – the first characteristic block – then find its boundary, then you go to C2, and so on. If you have no parallelisation, you start at the beginning of the file and here you will end that. So the problem is, if you now want to parallelise this, then if you have four threads then you take four segments of the file, but then if you started in this position it’s no problem, but if you started in this position here, then you don’t know where is the next boundary for the characteristic block. But that’s not really hard, because you go then to the next point; you start there; and you go as long as you are in the next segment. So actually from a conceptional point of view this is not hard, and therefore we integrated it in our implementation.

The runtime for that: we have the real time in the first row, and the processor time in the bottom row. And what you see here is the wall clock time, if you only make use of one thread for a dataset of about 2GB, is roughly 44 seconds – the CPU time is a bit lower. But what is of interest to us is the real time decreases to one third if we make use of multi-threading. So actually, this works, and this decreases the overall time.

Then finally, we provide some evaluation with respect to different aspects. The first one is the memory consumption: the larger the dataset, the larger also will be the memory consumption. So this is one important aspect, and this also shows that the flat hash map and the hierarchical Bloom filter trees are approximately the same. The Bloom filters are a bit better for the footprint.

Then the runtime of the construction phase, we evaluated in both modes, if we only have a single thread or if we have more threads – there have been eight in our evaluation. And the same, we did for the deduplication, and also for the lookup phase.

And I will show you only some results for the runtime performance of our lookup, of the lookup phase. What you see here: this is only runtime measuring. The runtime is here, on the vertical axis, the runtime in seconds. And what you see here is the question: How does the runtime depend on the matching of th artifacts?

And what we did is, we compared identical files against each other, and then you have this picture. And you see that the hash db is the slowest one, if you compare full identical files. This number here means that, if per design the seized information is only 75% of the original ones, then you have a decrease of the runtime of the hash db. This is due to pre-filtering. And so on.

Interesting here is if we look at the actual relevant picture, means that we have a multi-threaded use case, and you see that the gap between the hierarchical Bloom filter trees and the flat hash map is not as big as you may expect. Because the theoretical run time of the Bloom filter tree and the flat hash map is locked into constant, and the hash db is much slower.

And this is the overall picture, where we did some assessment of the different aspects. And if you go through the different columns, you see that the hash db has all the features before, we implemented that for HPFT and flat hash map, but what you see here is the actual system which you are interested in later when finding the needle; the flat hash map is the best one; the hierarchical group of the tree is the second one, and the hash db suffers from some difficulties.

This is the conclusion: the flat hash map outperforms both other candidates. The hierarchical Bloom filter tree is per design as it is – we cannot improve it without losing its original conceptual advantages, and therefore we will make it of the flat hash map in our memory carving engine.

Get in contact if you’re interested in this or similar things, if you’re still at a junior level and you want to do an internship at CRISP, you’re welcome to contact us. Thanks.