Do Digital Investigators Have To Program? A Controlled Experiment In Digital Investigation

Felix Freiling discusses whether digital forensic investigators need to also be programmers.

Transcript

This paper is written together with Christian Zoubek, my former student, who is now at the Technische Hochschule in Nunberg. He cannot come, unfortunately, so I have the pleasure to give the talk.

The outset of this paper, of this research was that we would want to understand how digital investigators work. So the question is what is a digital investigator and what [his work mean]? So we started off … [off-mic talk]

We started off with the question: How do classical crime investigators work? And there’s lots of literature on that. Next slide please. Lots of literature which people from the police actually know or have actually looked at partly. So there’s a set of established methods and evidence types. So when you look at these classical crime investigation handbooks … and there’s also a very well-established separation of duties between the investigator – so the police officer who’s actually doing the investigation – and the scientists in the lab doing evidence analysis. And there’s a lot of documented experience that is not only in these books but it’s also used in education of police officers and people who are doing criminalistics work.

So you can go two slides in advance.

So the question we had at the outset of this work was how do digital investigators work? And there’s not so much literature on that. If you go to the next slide … I mean, there’s popular evidence in CSI and so on, which is very crazy, and I wanted to show you this video, but it doesn’t work … you know this double hacking excerpt from I don’t know which TV series, where there’s a woman forensic scientist being attacked, and she’s hacking away on her keyboard, and the hacks get more intense, and there’s a colleague coming up and hacking on the same keyboard to fight the attacker.

[laughter]

So this is what we know about how digital investigators work. Well, not exactly, but at least partly. The next slide, please.

So it’s unclear. And it’s still unclear what actually is a digital forensics scientist, how does he differ from the investigator role – it’s changing in both directions all the time. And there’s hardly any, at least peer-reviewed, literature, on how digital investigators work. And because we also want to teach digital investigators, we don’t only want to teach computer scientists … we know how to teach computer scientists for a couple of years, but how do we actually educate and teach digital investigators?

This brings me to the outline of the talk, on the next slide. So start with research questions, and we did an experiment to actually find out a little bit more. And then I would like to show you some results.

Let’s start with the research question and some terminology on the next slide. We had a couple of cases, and ‘case’ we define as basically a collection of evidence, mostly disk images, plus case description. And a case description is something that an investigator would get in reality, saying that “This is the crime, we are suspecting this person, and here’s the evidence. Answer the following questions that are interesting from a law enforcement perspective.”

And we had in this experiment a set of participants who also worked in groups. And we wanted to measure the effort that they used for different tasks. And so we measured the effort in minutes spent on solving a specific task, and of course we had group and individual effort. And of course, we wanted to measure quality. And quality, in a university setting, is always the grade that the students get. But basically, quality was defined … if you know what the outcome should be, what evidence they should find, we basically look at the percentage of correctly interpreted digital evidence.

And we looked at different task types on the next slide. We wanted to see how the distribution of these different task types are within the work. So basically, task type one was conceptual work with pen and paper, but also writing the documentation. Then of course, you have group work, you have group meetings and group discussion. You have technical work, like programming new tools or scripting stuff, interfacing with tools, correcting errors in tools, and so on. And task type four was actually doing the actual investigation, applying tools to the case, and doing the investigation.

After all of this terminology and task types, we come to the research questions. So we wanted to have a look to distinguish maybe different cases. Can you characterize cases and derive efforts or predict effort for particular cases? What are the strategies that are used for different cases by different groups? What is the distribution of the task types I’ve just mentioned? To learn a little bit more on how they work. And more statistical issues – so if you want to look at … if you have cases where there is a high effort required, what factors correlate with this high effort? Or can you actually predict effort in a certain way? You have evidence before, and then you predict the effort, [for practical purposes, good].

But also, of course, you want to have good quality results. So what factors of the investigation correlate with good quality? What do we have to do so the quality is good? And on the other hand, how can you predict quality? If you know that a certain setting is there, and a case is investigated, how can you ensure that the quality will be good in the future? We wanted to find this out, and we did an experiment, which starts on the next slide.

And we can go to the setting of the experiment. In the winter semester 2015-16, my Advanced Forensics course had 40 students, which is quite a large number. And I thought we have to use this somehow, we have to exploit this somehow, we have to do an experiment. Statistical data, we have to collect statistical data …

And we split these 40 students up in 10 groups and we prepared three, arguably, realistic cases. We had some examples from the police in [07:03], which these cases were based on, and we gave the students a pre-study questionnaire to ask about the motivation, about the previous grades and so on. And we got from them the final investigative report, which we used to grade. And at the end, we had data of 34 participants, which is quite good. Of course, it’s not enough to have statistical significance in the results, but it gives an indication of what a correlation might be.

Now, the next slide, there are three cases we had. Just to give you an idea – there is more in the paper about them. The ARP spoofing case, administrator manipulating the system. The terror case, which was kind of … we didn’t want to phrase it in real terror, in realistic setting – it was set in the Star Trek movie case, so it was an assassination of the ambassador of the Earth on the planet Vulcan. So if we talk about it, not somebody says, “Oh, there’s a terrorist attack going on here.” And the classical malware case, where we had an infection through an infected website.

We shaped these cases [so they] were a little bit similar. So there were three disk images which you found incrementally, so it was like a Capture the Flag type of operation. First you analyze the first, then you came to the second, and then to the third. And there was at least one false trail in each case description. So one case description said this guy is supposedly … he has child pornography on his computer. But there was no child pornography on the computer. Everybody was looking for it, but they found something else, and this was the real trail.

So the experimental design was we had the participants, we randomly assigned them to groups – so to not have a bias in the groups. The groups were randomly assigned to the cases, so to have no bias in the selection of cases. And we had a timeline for the experiment, and the idea was that they get … on the next slide, at the beginning of semester, they get the case description and the first exhibit, the first disk image. And then, on a regular basis, they have a meeting with the public prosecutor which commands the investigation, which was me. So I was sitting there and saying, “What did you find out? What are your conclusions? What should we do next?” They took some notes. And then, interestingly, they found an IP address, and they said, “Oh! By chance, I have this. I have a copy of the disk, of the server, of this IP address by chance.” And here, this is your next exhibit.

So they went on, and on the next slide, it’s the final … so they had to deliver the report at the end of semester, and then we looked at the report and there was a debriefing afterwards, where they were given the real, the realistic story, or the real evidence.

So that’s the experiment, and now we come to the experiment results. The first set of results was basically to just see how the tasks type differed for the different cases. And this is basically the sum of the groups that did the malware case. You cannot see much here, except that at the beginning you have more investigative work and at the beginning you have the documentation, which is usual probably in all programming or in all student projects. What you can see here pretty clearly – and if it’s not so clear, on the next two slides, maybe you can go through to the next slide again, and back two slides – you see that there’s a kind of deadline-driven behavior. There’s a bulk of work, and there’s basically no work. Then there’s another bulk of work.

And this is kind of a deadline-driven behavior which is typical for students. I’m not sure if that’s typical for investigators. We always say that’s deadline-driven behavior versus quality-driven behavior. [chuckles] And I would actually hope that in practice the quality-driven behavior would be prevalent, and not deadline-driven behavior. Although I expect that in practice there’s also deadline-driven behavior.

But you can see there’s always the shift from more the technical analysis work to the documentation work. Well, this might be expected.

So this points to the different strategies that people use. But now, how do the cases differ? This slide shows the total effort per case, over the groups. So there were at least three groups on every case, so we have an average here with a standard deviation, where you see that the terror case required by far the most effort, but also with a higher standard deviation. So the terror case was basically also the case which had the most unclear investigative goals. They said … so this guy has child pornography on his computer, and so they didn’t find anything, so they searched and searched [laughs] for evidence, but they found something else that had to do with a bombing and so on, and then they came to the public prosecutor and said, “Ah, yeah, I heard about this in the news. Investigate.”

The others had different … ARP spoofing and malware, they had lower effort and also much lower standard deviations. So clearer specification goals of investigation help, obviously, to narrow the standard deviation.

Next slide is the effort which was distributed per task type. This was insofar interesting, that … well, not interesting is that the task type four dominates, the technical investigation dominates. But the conceptual work was task type T1, is almost half or more than half of the effort used for the actual technical analysis. So think about what you want to do, and then you do it – which of course we always teach the students, and partly they do it. The question is how this relates, for example, to practical cases.

Most surprising for me at least was that they did … nobody programmed. It’s interesting insofar that we thought that they are computer science students, they should automate stuff. But obviously, it didn’t pay off for them, so they didn’t automate in this case. This might also be interesting to look at in a future experiment – when do you program, when do you automate?

The next slide is a correlation between the effort that the group had with the motivation of the group. And the motivation was counted in the pre-study questionnaire. So “How motivated are you?” we asked the students. And interestingly, the motivation on the left-hand side – one is high, and five, up is low, and the effort correlates negatively with the motivation. So the higher … the planned effort correlates inversely … the motivation … the higher you are motivated, the least they spent effort. So higher motivation resulted in lower effort. It’s not so clear what this means. So there’s no clear correlation here, let’s say.

The next slide is of course interesting … the effort the people spent versus the grade they got. The quality of the result. Fortunately, [laughs] the effort correlates with grade. The more they invested, minutes, the better the quality of the result was. At least there was a slight positive correlation. And I think that’s sort of comforting. Or at least good if you want to motivate students to spend a lot of effort.

So it’s slightly counterintuitive, the quality is good, the higher it gets – 90% is the best. And the effort is to the right – so positive correlation between effort and quality.

The next slide is grade versus motivation. The grade does not correlate with the motivation, interestingly. The motivation is to the left is good and to the right is bad, and to the top it’s good grades. So the less they were motivated or [told] they were motivated, the better grade they had. This is also counterintuitive. So don’t ask the people how motivated they are. This will not predict the grade that they will have.

But on the other hand, a good predictor for the quality was the grade that they had in the course before. So they all had an introductory forensics course, and this is the last results slide. So we asked them, “What grade did you have in the basic course in forensics?” – which they did the semester before – and how does this correlate with the quality of the result that they had in the experiment? And the grade of the course is to the left is better, one is the best grade that you can get, and the quality goes up. So there’s a slight positive correlation between the grade in the previous course and the quality in the current course. So previous grades seem to be slightly good predictor of the quality of the results.

Some conclusions. The first is how can we interpret the results? There were no real hard hypotheses, there’s not much literature on this. So the work was mainly meant to be able to, after this work, formulate concrete hypotheses. So statistical significance was not so important for us. We wanted to just get a feeling for what types of experiments do you have to do. And one insight … starting of an insight was that well-specified investigative goals might reduce the effort. This is something which is also important for practice, so if law people ask questions to the technical people, they should phrase them in a way that is clearly understandable and easy to answer.

Maybe one insight is that the effort is more important than motivation. Okay, this is counterintuitive, at least – you always want to have highly motivated people because they spend a lot of effort, but it’s more important to spend a lot of effort than to have high motivation. Don’t tell this to my students.

And probably it’s better to use the quality of previous work to predict the quality of new work that’s going on.

And for future studies – it’s on the last slide – lessons learnt, lessons that we learnt from doing this experiment was that … we did group work, but this kind of made it harder to interpret the results, because we had the questionnaire which was done individually, and we had to, from this, calculate group measures. So if we were to do this experiment again, we would use individual work and not group work. Now of course it’s easier to formulate precise hypotheses, and of course if we were to do this experiment again we would use 100+ students. But hard to get – you have to do this over several years, which obscures the results again.

So if you at any time have the chance to do this experiment with 100+ people, I will gladly join in the course you’re doing. And one open question which I would also be glad to get input is the three cases were only comparable regarding the step-by-step manner. It’s hard to really compare cases, and this is a really critical variable in assessing effort and result quality.

So that’s what interests me.

And if you want to maybe do your own calculations on the data, the data is available online, anonymous – so no personal data inside. So you are … so I’m happy to share these insights with you on this data.

So that’s what I wanted to talk to you about. Thanks for your attention.

[applause]

Attendee: [21:18] I have a question. Could you go back to the slide number I think 31 … 31 I think. Yes. Exactly. I was wondering – for the correlation, you probably took what we call in mathematics the [L2] [21:42], which is pretty sensitive to outliers. I was wondering, for example, the one that you have on the right – my view on outliers – if you remove this, you see that the line will really go down.

Freiling: Yeah.

[laughter]

Attendee: Because usually, actually, good students work best. [laughs] So if you … it could be interesting to review the calculation taking a more robust … like for example the LO1 or something like this. This would be less sensitive to outliers. The correlation sometimes is not completely straightforward.

Freiling: Yeah. Yeah, there’s surely a lot of things you can still do with statistics, and that’s why we published the data. So you can also continue the analysis there. And we were always very careful not to throw away data because we had so little. And of course, you can do better. That’s right.

Attendee: [unclear]

Freiling: We can get more [differented] view, yeah?

Attendee: Did you have any way to assess the skill level of the students going into it? Because it seems like that person or that group with the [22:57] least amount of effort got 80 per cent. And I was just wondering, well, why is that such a … why did they get such a good grade when people who put in much more effort got a lesser grade? Is that just correlated to the fact that they were given the more open-ended problems?

Freiling: It probably has something to do with the tasks they got. And we could have a look at the data tied to the individual cases – to eliminate the influence of the case type. And the basic variable that we used to assess the skill level was the grade of the previous year course. But there are probably also better ways to assess the skill level.

I think in the questionnaire, there was a different experiment. You could ask, for example, how many images have you analyzed before or something, and use this as a variable. But we used, basically, just the grade of the previous year course as a …

Attendee: Another … I guess if you ever do the experiment again, another question might be what [24:03] tools did you use versus [GUI] tools. Because that can also be an indicator whether their comfort level with using these tools.

Freiling: Yeah. In the paper, there’s a list of the tools that they stated that they had used. But we didn’t actually control when they used the tools. So for this, you would have to do an experiment observing, basically, the investigators all the time. Yeah. Video-taping sessions and so on. But that’s of course huge work to analyze later.

Attendee: Basically, we can measure the efforts of the students by looking at the tools they are using and the kind of logs they are generating and the kind of evidence they are trying to gather from all sorts of [24:52]. So that’s the way you can measure the effort. How do you measure the motivation?

Freiling: We measured the motivation basically by asking them.

[laughter]

Freiling: How motivated are you? How would you otherwise measure the motivation? The glow – measure the glow in their eyes.

[laughter]

Host: Any more questions?

[silence]

Felix, thanks.

Freiling: Thank you.

End of Transcript

This video was recorded at DFRWS EU, in collaboration with Forensic Focus. Find out more about DFRWS and register for next year’s conference on the official website.