Nugget: A Digital Forensics Language

Vassil Roussev discusses Christopher Stelly’s research at DFRWS EU 2018.

Transcript

Vassil: Thank you, everyone. We’re behind schedule, so we’ll try to rush through it … maybe you won’t notice the details. And before we start, since we mentioned that … first of all, I want to thank Daryl and everybody on the host team for putting up this event. I know it’s a lot of work. And finally, we’ve reached a point where I can just come and enjoy it, and just present. In fact, the way I had it set up, I would have my student present, so I can just enjoy [00:41].

On a more serious note, actually, Chris, he was set up to come and he was registered, and he had a family emergency, and it’s a very unpleasant situation for him. So, I’m filling in, so I can … in terms of credit, I will talk about the concept behind this, because I think that’s the most important part that I can explain. In terms of technical implementation, it’s all Chris. Chris is my PhD student, and apart from that, he’s a professional [plane] tester for Lockheed Martin, so he’s a very impressive technical guy.

Since we’re in Florence, if you need a metaphor, I gave him kind of a sketch on the back of an envelope, and he’s painting like a three-panel masterpiece. The first panel was presented [01:41] in August, in Austin. So, this is the second piece, and there’s one more piece, and then we’ll call in Dr. Stelly.

Alright, so what is Nugget and why do we need it?

Instead of me walking you step by step … it’s a language, every language has users, and here I see about a hundred potential users. So, since user studies are very difficult to set up, I’m going to use the opportunity, I’m going to run a one-minute user study. What I’m going to do is I’m going to show you five lines of Nugget code, and I’ll give you one minute to try to make sense out of it, and then we’ll have a quiz. Alright.

[silence]

Alright, time’s up, and this is really an informative experiment for me. How many of you thought that this was an instance of known file filtering? Okay. I’ll take that in one minute.

How many of you expected to see hashes come out of the output? Alright, bridge too far.

This is actual [executable] code, if we have the time, I will run it for you, or at least I’ll show you … I’ll tell you how to run it later. But you can see that … this is a domain-specific language, so what we’re trying to do is we’re trying to describe the domain of forensic computation. So, my [personal relationship with it] started maybe five, six years ago, when I promised to write a chapter on digital forensics, and the first thing I started with is sort of the definitions, and I started looking at these definitions, and none of them was satisfying.

So, they were indirectly describing what digital forensics says. So, it’s kind of the application of these methods for forensic purposes, which … if I told you that a car is something that does transportation, it really doesn’t tell you much.

I’m a computer scientist, so I’m looking at this as a computational problem. So, long story short – we’ve worked with this … so, this is a [very little positioned] paper that nobody read three years ago, which is fine. And here’s what we’re trying to accomplish. Of course, this is by Martin Fowler, so you can’t argue with that.

DSL is a limited programming language which is focused on a particular domain. You’re all familiar with SQL, HTML, all of these are examples of such languages. According to Fowler, there are four characteristics that you want. The first one is formality. Formality here means that this is executable – so we can actually run that, and it will give results. So, you have grammar, you have semantics, and so on. Second one is fluency – this is what we were doing in that minute. Fluency is natural to the domain expert. I should not really have to give you even a one-hour tutorial. Things should start jumping out of the page and speaking to you almost immediately. It is limited – so we’re not trying to do everything. This is not Python, it’s not C, this is a high-level description. So, as you add more things and you try to make it more general purpose, you’re starting to lose some of the benefits. And of course, ease of use – this should be much easier. As you saw, if you try to describe this in any programming language that you know of … you [can] describe this in five lines.

Here’s a couple more examples. Here’s something we can do with traffic. And by the way, the actual filtering and the actual processing, we don’t do any of that. This is a software integration project, if you will. We’re not going to redo what Sleuth Kit does, we’re not going to do what Volatility is doing, we’re not going to redo [tshark]. All of this is outsourced. And here’s a memory dump example. These are from the M57 corpus. So, essentially, here’s your target, [parse] this memory, and then extract the [process] list.

Here are some examples. Essentially, we’re trying to create the SQL of forensics, if you will – although it’s a different kind of language. If you’re familiar with Apache Pig … Apache Pig is the immediate inspiration for this effort. We’re trying to do a few things, and we try to do them well. And again, effortless … things need to be effortlessly obvious in terms of meaning. Of course, you can … I was reading this. You can [quibble whether LaTeX] or regular expressions are effortlessly obvious, right? Especially regular expressions of this size.

So, why are we trying to do this? There are a number of reasons. One of them, by the way, is that we can. It is … there are probably about three programming languages per person on earth, which should tell you that we’ve come a long way from the [dragon book]. Back in the ’70s, we have to remember that everything related to [Compile It] was almost literally black magic. People didn’t know how to do it. 40 years [later], this is something that you can hack together a little language in one afternoon, right? So, we can do it.

But the real motivation is that what we want to do is we want to strike a balance … I’ll show you a little diagram. We’re trying to define this formally. We want to describe the entire computation. Because it is a computation. And once we describe it, there’s a whole bunch of benefits. And the most obvious one is that if I have that description, then it becomes automatically reproducible and verifiable. It is not the investigator’s notes, it does not depend on the tool. Different tools can implement … so think of this as an interface, so different tools can implement it, just like we have different implementation of the SQL standard. So, if self-documents, so, “This is what I did.” And then you can go back and reproduce it.

Another important concern here is that we’re separating the specification. We’re separating what needs to be done from how it needs to be done. And right now, these things are bundled together. And the performance concerns, for example, become the investigator’s concern. Investigators are not coders, and they should not be coders. They should understand what kind of evidence is there, what its utility is and what it can prove and what it cannot prove, but they’re not coders. And other benefits here is that if you have two tools, you can actually test them. One of the main questions in forensics is how do we test these commercial tools? They’re black boxes, right? In the States, what they do is they basically send an expert and say, “Yeah, it works.” And it’s sort of a reputation competition. [And they say], “Well, we’re known, we’re listed on the stock exchange, so we must be right.”

This is a principal problem, but if they were to implement an interface like this, we could test them, without actually looking at their code. So, you can design your tests. You can also measure the performance. You can look at these and have a best of breed competition. The other part is users and … in digital forensics, we have users and we have software engineers, and they don’t really have … so, users have a dysfunctional relationship with vendors. Vendors come and say, “Hey, do you want these new features?” And users say, “Sure.” Who will say no to new features. But users have no mechanism to specify “These are kind of computation I’m interested in. This is what is running slow. This is what I actually want you to do better.” And by having this, we say, “Okay, this is my caseload, these are the kind of things that really don’t work on your tool. And now I can have a proper conversation.”

If you’ve written anything related to database, you don’t optimize your SQL code. You don’t optimize any code, because there’s the compiler, and we have all that 40 years of technology, we know how to do this on the backend. Quite separately, you can schedule this in a cluster, which is a different component to this. So, you want that to be somebody else’s problem. If you [de-investigate] it, that has to be somebody else’s problem. And this allows us to separate these two.

So, if you want a bigger context as to where this fits in, we’re kind of in the middle. If you look at, on the left side, we have these abstract models that kind of stem from the cognitive. One of the models that I like to use is the Pirolli and Card model, which, if you’re familiar with information foraging theory … so, it kind of describes how we search for information. And we do exactly what it sounds like – we forage for information, that’s kind of the essence. They have a more formal model, which they did based on observing how intelligence analysts work.

So, it’s very useful if you’re trying to design the user interface of a tool. But it is not … what happened? Okay. [13:21]. Then we have best practices – I don’t have to explain this. Every police … every digital forensic organization will have best practices. So, these are more specific, and say this is how you look for contraband. So, there are some procedures, and a human can interpret them, but they’re definitely not executable. If you look at the other end, you have models such as Brian Carrier’s computer history model – this is starting with the computer, very low-level stuff. [13:48] another one of these … yes.
And they’re very appropriate, I find nothing wrong with them, but they are very low level. And there’s … [some say Garfinkel] kind of formalized the notion of differential analysis, which we do all the time in [all] places. So, we basically look at the before and after, and try to figure out what happened in between. And that’s kind of an incremental technique, but it doesn’t describe the whole thing. You can’t just say that an investigation is a differential [thing].

Which, we tried to kind of be kind of in the middle and provide a high enough level of description so it’s meaningful to a human but it’s actually executable. Alright. I will run out of time …

So, it’s a declarative language. Another thing we think is clever is who’s going to design what’s in the language? Well, what we’ve worked on is basically, you can build it from the ground up, and say, “Okay, these are the tools that I have. Sleuth Kit … these are the different operations that I want to be included in the language,” and you can actually generate the grammar. And we use ANTLR, which is a fairly sophisticated tool, and will generate the grammar, and we can parse your code after that. So, you say, [15:12] 256, maybe you want to [15:15] three to included it, so you just … here’s the executable, you provide a little JSON description, you write a little bit of boilerplate wrapper code in Go, and it’s ready to be included as part of your investigation. This is the runtime, which is sort of the backend, which Chris built, and that’s kind of complementary work.

So, we use ANTLR, we use Go, we use Docker, and of course we use JSON, right? Everything has to have JSON.

So, we have four kinds of things, generally. If you’re familiar with functional programming, all of these things will look very familiar. We have extractors, so this is anything that takes a raw input and generates internal representation. So, this is [ingest], essentially. It could be a raw image, it could be evidence stored in some sort of container, and so on.

Then, we have transformers – hashing would be a primary example, and anything like that. So, you have the input, and you produce a completely new output. And we’ll have filters, so you have each … so we’re working with collections, and every element in the collection will have a number of attributes. So, any selection that you make … projection, and selections, that will … those will be filters. And of course, you have serializers, which will persist the results. And [in serializing, even showing this, the user will call it serializing].

Carving would be in the extractor category. So, it’s looks roughly like this, so we’ll have the compiler and the interpreter, the API, and this is what we provide, is the language runtime, and there’s the resource manager, which is kind of a separate component of this project. And then, on the left-hand side, we’ll have all these tools, which we did not write. We did not write a single tool. We did not want to do that. There are better people to do it. And you have your targets, and that’s what we feed it. And kind of a separate, side project – this is something that Chris had to do, and had to wrestle with the vagaries of [17:38] for several months to get it to work. We think we’re going to go the [17:42] route, so …

But if somebody else wants to write … you know, like the [NFI] guys, they have a system of being … [wanting] to look at for a long time. If they want to take it over …

So, this is a completely different problem. So, this is given, presumably, a cluster. So, these are the resources, these are the tasks. You’ve run the optimizations on the queries, and now you have to schedule this against the resources, perhaps subject to some priorities and other constraints. So, that’s a resource management problem, and you basically want to base that on something … it’s a hard problem, so you want somebody like Google to implement it for you, really.

I don’t have time to show this, but it will basically show that it runs in … I’ll show you how you can run it.

The way to extend it … and again, the example, I wanted to show you the example. The irony is that I told Chris, “You know what? You should spend half your time on a demo.” And then I ended up having to do the demo and he could not quite finish, because, as I said, he had a family emergency, so he had to leave it as is, and it’s incomplete.

But you have a JSON descriptor, you run a generation script that will generate the grammar for you, you don’t really have to touch the grammar at all. And then you do some fairly standard stuff, which we can automate further. One additional benefit is that you can … because we use ANTLR, we can actually include IDE support. So, this will have completion in your editor, like [Eclipse], so if you’re using [Eclipse] or [Intelligace] … in [Intelligae], you can just … it will have completion, so it will help you write these things.

These are kind of the key points here. It’s a new approach to describing the computation, and that’s the whole point. Others have looked at DSLs from … they exist in some form … I think there was an effort to describe file formats using domain-specific language. You can look at volatility and recall, and you can see elements of DSL in there. It’s an open architecture, because that, we wanted to be able to start small and grow. And we’re splitting the concerns, we think that’s a very important point, splitting the concerns of what needs to be done from how it needs to be done.

It also allows you – for me personally, I think that’s a very important thing – it provides you a bridge to the future, and I would say the near future. So, if you think about where data is accumulating, it’s accumulating in the cloud. So, in a few years, at most 10, it’s easy to predict that forensics is going to look very different. We’re not going to ask what happened. What we’re going to have is we’re going to have gigantic logs, and going to try to figure out what happened, how do we get semantics out of it. This will plug right in. So, nothing will change.

And then, we can do the performance optimization where it belongs, so it’s the runtime system’s job, it knows the resources, it knows how to optimize it. It should not be the investigator. And then, of course you can have testing and best-of-breed competition, which should be great.

So, Chris has put kind of a skeleton out. This is the command to run. So, if you have docker, this will allow you to get in there, and you’ll see in the Nugget folder the Nugget executable itself, and you can just run it with the examples that are there.

This is the short version. Thank you.

Host: Thank you very much, Vassil. And the floor is open for questions. Wait. One there? Yes.

Audience member: Hi. Thanks for the talk, it was very interesting. If I understand correctly, each process should start or usually starts with an ingestion, like an extract command, and then data is somehow … somehow becomes in a common format or common structure, as you mentioned. I was wondering if you could tell more about this internal structure, how common it really is, how is network recording is in a similar structure to a memory dump or …

Vassil: Right. So, at this point, we use … this is an excellent point, and I’m thinking that, oh, and we’ll solve that problem with all of [work on case]. So, this is sort of the data representation problem. But the internal representation is very simple. It’s explicitly, but you can think about the json object. So, you have a collection of objects that have attributes, and that’s kind of what you’re working with. Another thing to kind of point out is that you’re working with collections. So, it’s a high level [of extraction]. So, there’s no [for loop] here. There’s no [for loop]. You’re not working with individual elements. And this also tells you that you’re working at a higher … so, in functional programming, that’s what we do. But the internal representation, right now what we do is we launch these tools, basically UNIX tools, and we just kind of feed them the input parameters, and we have a Go wrapper that will process the output that comes out of it. So, at this stage, it’s fairly rudimentary. And of course, it would be improved upon, but it’s kind of … it’s an engineering problem, it’s not a conceptual problem.

Host: Okay. Down? There is a question down there. Two.

Audience member: Thank you, Vassil, for excellent talk. We are also working on the main specific language for forensics, based on [logical] programming, and one of the issues we came across is the need to persist these collections between operations. Because each of these lines can take hours. So, what if you have to stop in the middle and then come back? How do you handle persistence?

Vassil: So, that is a problem that is … okay, let me show … this is a resource scheduling issue. So, this is not part of this talk, but yes, you absolutely need to do that. You need to have … our concept is that you have … the default is that you’re going to have some sort of cluster or cloud environment, so you’re running this on as many machines as you have to. So, you are going to have to be able to keep a persistent track of all the operations that need to be done, and you have to make sure that they are done, because at scale, you will have failures, so you have to go back and restart … so you have to keep track of the dependencies. But the short answer is that you’re going to need a persistent log. So, maybe if you use something like [Kafka], that will give you that reliability to keep track of the jobs. Yes.

Host: Okay.

Audience member: Have you got a feeling for how you might handle more complex operations? So, for example … well, sorry, how you’re going to balance readability and/or practicality that you have in your hashing example with actually preserving exactly what was done. So, for example, if you’re doing a keyword search, say, you’ve got a whole bunch of different options there – the word you want to find, the encoding, [do you] decompress objects within it and so on. That’s going to get really long really quickly, and it’s going to start to just look like a bash history. So, how might you balance those things in the language?

Vassil: Well, the short answer is that … you’re issuing the queries, and that’s kind of the history of what the investigator performed. Probably not all of it is relevant, so you may need to edit some of it out. But in terms of results, I expect that you will store everything. You will get a huge amount of data. But that’s the world we live in. A lot of the old designs and a lot of what … we kind of grew up being taught databases, relational databases, kind of updates in place. That’s out the window. Everything is a log, you will remember everything, you just need the means to process that. I think we’re going to end up with probably employing things like blockchains – not Bitcoin, but blockchains of sort … that can help you with some of that. So, you can save intermediate results, say this has been reviewed, this hasn’t been touched, and so on. But I don’t know if that’s … Am I answering your question or am I getting …

Okay. I’m trying best to interpret, but … yeah, there’s still going to be a lot of … if you have a million results for your keyword search, there’s no easy answer there. But what you might recall … you’ll have the order in which the results came in, you can automatically … which version of the tool was run, and so on. So, you have complete auditability now. You will probably need an automated means to do that audit, just because it’s going to be so much data.
Right.

End of Transcript