Barbara Guttman And Jim Lyle On Confidence In Digital Forensic Results

Christa: How confident are you in the results of your digital forensics tools and techniques? Can you measure that confidence or defend it in a court law? 

Welcome to the Forensic Focus podcast, where monthly we interview experts from the digital forensics and incident response community on topics ranging from technical aspects to career soft skills. I’m your host, Christa Miller. 

In the United States, a legal precedent known as the Daubert Standard — named for the 1993 Supreme Court case it came from, Daubert versus Merrell Dow Pharmaceuticals — is used in federal and some state courts to assess the admissibility of forensic evidence. 

The standard relies on five criteria: whether a given method is first, generally accepted; second, empirically tested; third, interpreted according to established standards; fourth, peer reviewed and published; and finally, its error rate: the measure of confidence in its results. 

With me to talk about the intricacies of error rates and digital forensics today are two experts from the National Institute of Standards and Technology.

Barbara Guttman is NIST’s Program Manager of Digital Forensics and leader of the Software Quality Group. And James Lyle is Senior Scientist with NIST’s computer forensics tool testing project. Barbara and Jim, it’s an honor to have you both here. Welcome. 

Barbara: Thank you. 

Jim: Thank you. 

Christa: Let’s start with a document published in 2018 by the Scientific Working Group on Digital Evidence. ‘Establishing confidence in digital and multimedia evidence forensic results by error mitigation analysis’ is a document that points out that in digital forensics tools, errors can be both random, as a result of any given underlying algorithm; and systematic, as a result of how the algorithm is implemented. Please explain more and give us an example if you would please.

Jim: Well, the randomness of error is related to uncertainty in how you measure something, usually. This frequently shows up in a statistical situation where you’ve got an average measurement — when you… say, you take a ruler and you measure how wide is this object. If your ruler is calibrated in feet, then your measurement will be in feet. But if you take a different ruler and you measure things in inches, you’ll have a more precise measurement. And if you have multiple people take the measurement, you’ll have slight differences. Barbara will say it’s three feet and four inches. And I’ll say it’s three feet and three and a half inches. Neither one of us are right or wrong. It’s just, that’s the uncertainty because of the measuring stick that we’re using, literally. 

Now, for the systematic errors, that’s more like if you have a computer program. If you give it the same input, you should usually get the same output out there. There’s some conditions that arise when things get a little squirrely, but in general, you do that. But if your algorithm has a flaw in it, then you’ll consistently have the same error over and over again. So they’re slightly different things. They come up in different contexts. 

Now, suppose what you want to do is, you have a block of digital data and you want to make sure it hasn’t changed. So what you can do is… the simple way is, make a copy of the block of data, and then you’ve got the original data and you’ve got your copy. And if you want to see if the copy has changed, you just compare it bit for bit with the original data. 

Christa: Right. 

Jim: But that’s a huge pain to keep track of all these copies of all the crap you’ve collected. So what you can do is, you can compute a checksum. And the wonderful things about checksums is, they’re small and they’re finite. So I can compute a checksum of my block data. And then later on, if I want to be certain that it hasn’t changed, I compute the checksum again. And if I get the same answer, hey, I’ve still got the same data. 

Now, my algorithm for computing the checksum has a random error component to it. When you compute the checksum, there’s the possibility of what’s called a collision. In other words, two different blocks of data may actually have the same checksum. And the size of your checksum will tell you what’s the chance of having a collision. 

For example, an old traditional way in communications theory of doing checksums is called the CRC, the cyclical redundant code. A 16-bit CRC only has… well what’s two to the 16? 32,768 possible values. Once you get one more block of data than that, you’re guaranteed to have a collision. And there’s some other problems with CRCs that you get. You get clumps of values that come up very frequently. So that’s not the best algorithm for verifying integrity of data. 

On the other hand, the cryptography folks have developed a variety of algorithms called cryptographic hash algorithms. MD5 is one of the popular ones. We did a calculation where we considered a teaspoon of water: freeze that into a little ice ball, and do that for all the water in the solar system. Okay? So you’ve got all these little ice balls. How many have you got? You’ve got a whole bunch of them. And the MD5 chance of the collision is the chance of marking two of these ice balls independently, and then finding them on a draw. So that’s not going to happen. That’s vanishingly close to zero. 

So that gives you an error rate for an MD5 collision, one in more numbers thank you can think of SHA-1 is like thousands and thousands and thousands, millions of times, smaller chance of collision. And for the truly paranoid, there’s also SHA-256.

Christa: That’s interesting, beause I know that in some courts anyway, at times, attorneys will bring up the theoretical or the controlled lab environment, examples of SHA collisions or MD5 collisions as a way of trying to disprove the method. But that’s not necessarily…  

Barbara: There’s actually another paper put out by SWGDE that addresses the use of MD5 and the SHA algorithms because the MD5 algorithm has been broken from a security application point of view, but it does not mean it’s been broken for an integrity verification point of view. So there is a separate SWGDE paper adopting that. In general, people should be using the modern SHA algorithms, just on the international cryptographic community put a lot of effort into getting good algorithms out there. And the SHA algorithms are the recommended choice. But from the point of view of Jim’s analogy, he’s going to go on to explain the difference between the random part, the chance of a collision — so we’ll be about for now the possibility of the engineered collision — the chance of a random collision happening is very, very small. But I’m going to set up Jim, he’s going to tell you that there’s other factors at play here, when you look at factors beyond the algorithm itself, take it away, Jim.

Jim: Yeah. Barbara wants me to confess that when I try to write programs, I’m frequently a moron. When the NSRL was getting started, I was the local programmer liable to get something to run the quickest. So they gave me some code from our security division, which computes various cryptographic hash algorithms we were interested in. And so I put a wrapper around that and handed it back to them, and they had a program that would compute MD5 CRCs and SHA-1s. I don’t think SHA-2 had come out yet. 

Barbara: This was before that. Yes. 

Jim: Yeah. So this was, like, 20 years ago. So I did, and gave this to my buddy, Tim. And then Tim came back and said, Hey, why doesn’t this work sometimes? What do you mean, why doesn’t it work sometimes? It works every time I run it. And I ran the test data, it was fine. And then Tim showed me his results and I said, Hey, you’re getting the wrong answer. 

Turns out… it occurred to me, Tim, did you move this to a Windows machine instead of a Unix machine? Like, where I wrote it? Turns out, Windows has a helpful feature that when you read a line of text, it adds an extra character at the end of the line. As soon as you get that extra character in there, that’s one more character for the hash algorithm to chew on. And so you get a radically different answer. So I made a modest, literally one character, change to the program and then everything worked fine. 

But we looked at when it would fail, and it failed on text files. Binary files were fine, but we had a systematic error whenever we tried to compute the hash for a text file.

Christa: Hmm. 

Barbara: And this is a pretty good example of the difference between a random and a systematic error, in that the error rate of the algorithm didn’t change. It was the implementation. And it wouldn’t be natural to describe that as an error rate, like you wouldn’t count up all the binary files versus all the text files, because that would be a local answer for you. You would describe it as a systematic error. Like, this code only works for this situation. That’s the more powerful way to describe it. The way that communicates to somebody what’s likely to go wrong. 

Jim: And that allows you to mitigate the error because, Oh, if I have this situation, I know things are going to go south. So I avoid that situation.

Barbara: Or in this case we wrote more code, so that it wouldn’t be broken. And that’s a way where you look at… it’s really important to know what statistical technique you’re using. Does it correspond to the problem at hand?  

Christa: OK. 

Barbara: It will be a theme we’ll probably come back to more than once. 

Christa: Is it possible for a random error rate to be applied to any given digital forensic method, even if the tools are more likely to result in systematic error, or are the two types of error inextricable in digital forensics? 

Barbara: It is sometimes possible to have an error rate, but sometimes it’s not useful to have an error rate. The error rates are useful when you have random errors. 

Christa: Right. 

Barbara: Or some kind of randomness in your process somewhere, that you want to characterize. So you have to look at what the question is you’re asking and what are the predominant source of errors you expect, to know if an error rate is even a reasonable statistical technique to use in this situation. Sometimes it might actually be a good technique, but you can’t get the data to come up with an error rate; but sometimes you can look and say, well, that actually, if I had been… even if I put in all the effort to get the error rate, it wouldn’t be useful because an error rate helps… like an error rate is for a sort of situation in general. And then you look at it, apply it then to a specific situation, you wanted it to be predictive of your situation or to future situations. And if it wouldn’t be predictive, then it’s once again, not a useful technique. So people have sort of grabbed hold of this idea of an error rate, which is a very, very powerful statistical techniques, but it just doesn’t mean it’s the right one for every situation

Jim: For error rates, these usually are associated with the algorithm, as opposed to the implementation. Implementation is fertile ground for systematic errors. But the algorithm is where error rates can usually be created. 

But one example is, if you have some sort of parameter that can vary, like size of the hard drive, if you have a lot of small hard drives, you’ll tend to have a lot of fragmentation in files. They’ll be broken up into pieces. If you have three or four terabyte files, even though you’re working with lots of different files over a period of time, they tend to be all together in one chunk. And some of the techniques you might use are sensitive to something like that. 

So if you have a small drive, you may have a high error rate being driven by fragmentation that you wouldn’t have in taking data off of a large drive. 

Christa: I see. 

Jim: And if you compute an error rate that’s sort of an average of the two, you meet in the middle where nobody lives. So you’ll have an error rate for somewhere in the middle size, but there isn’t any real data there. 

Barbara: Right. You’re more wrong more of the time. For Jim’s case, you end up being wrong more often. You’re wrong… like if you gave it for the big drives, you’d at least be right when you’re talking about big drives, and if you gave an error rate for the little drives. You’d at least be right when you’re talking the little drives, but you’re never right when you average them together.

Jim: The trick is to know when there’s a parameter that will have that behavior. And for this case, it’s easy to see it, but you probably don’t know what the parameters are driving your error rate.

Christa: So as the SWGDE document outlined, practitioners mitigate errors through the other Daubert elements. So testing, validation, peer review, and standards, including training and documentation. So whether accrediting a lab and its processes to something like ISO 17025 is the best path to consistency in these areas is a point of some contention in the community. But additionally, some research indicates that error rates themselves are not just necessary, but they’re also subject to cognitive bias. So with NIST having recently launched its restructured Organization of Scientific Area Committees for forensic science, what’s the goal in terms of adoption or strengthening reliance on standards in American forensic labs, and are the Daubert elements enough to mitigate the risk of errors? 

Barbara: The elements you described actually go well beyond the Daubert criteria. They are in fact generally well known factors for how to improve quality in the quality assurance and quality management communities.

NIST is very interested in having high quality standards out there, standards that describe scientifically based tested methods out there for the community to use. We’re really interested in having some kind of evidence for what works, both from a theoretical point of view, and can be practically applied in the lab, right? So that people can use good techniques, using good processes, to produce what are actually reasonably easy to understand results. So you have to look at all of these pieces more holistically if you’re going to achieve the goal of forensic science that we want for this country and for the world.

Christa: So given that this is a part of the Daubert standard in courts, what is the best way to communicate that to attourneys and judges that might be expecting the error rate? Is there an alternative that can be posed or is that going to end up being a matter of case law in and of itself? 

Barbara: Well, that is the reason that SWGDE wrote that document: it proposed that instead of looking only at error rate, you should look much more holistically and look at uncertainty analysis, so look at error mitigation. So you should realize that that’s just one tool in the bucket of uncertainty analysis.

Christa: OK.

Barbara: And so don’t limit yourself: expand. And then in the document, it gives a whole list of places where errors can occur and what reasonable steps are to address these kinds of errors. One of the important questions, I mean, when you look at error rates and likelihood ratios is, you actually need to have the data to be able to calculate them. You have to have enough data. Likelihood ratios have become powerful in the DNA because they actually did these huge population studies. But if you don’t have a population study, or your population study is too small or too skewed to one set of factors, you’ll end up with data that’s not actually predictive of the future. And once you’ve understood it, you might, you should really go the next step and mitigate it, right?

Christa: And I think that’s something that the SWGDE document alludes to when it talks about the rapid rate of change in information technology and computing in general, where you know, you do end up, I think, right, with these sample sizes that are too small because the technology is changing so rapidly. Is that accurate?

Barbara: Yeah. And that can happen when a new technology comes out that does things differently. So like, you know, a lot of our understanding of things was based on when we were using conventional hard drives and then, you know, we switched over to solid state drives. They actually work quite differently and it changes things. And sometimes people… you don’t realize what a change made at an operating system level… you know, they’re constantly trying to optimize and they change their underlying algorithms. And it makes a difference.

Christa: Bearing that in mind, practitioners and attorneys tell me the Daubert standard is used mainly to admit evidence identified through novel methods. Dgital forensics methods that rely on new exploits or APIs, for example, might be considered novel. How can a truly novel method be distinguished from one that appears to be new, but really relies on existing standard practice?

Barbara: So within the SWGDE document that is addressed, which is… so digital forensic practitioners are often faced with receiving a piece of equipment for analysis that they’ve never seen before. Or it’s a new version of the operating system, it’s a new version of the app. There’s something new about it. And sometimes you have to write more technique, write more code to handle it. You have to develop something.

And so the SWGDE document is very clear that when you’re doing something new, you need to have a technical review. You know, it’s, it’s pretty, you know, people can make mistakes, especially when you’re generating a new technique, depending on what the technique is, would depend on what kind of technical review you would need to have, because it’s, you know, it’s not just the courts that care about quality. Forensic labs care about quality, too, right? They want their answers to be correct. So when you’re doing something new, it’s pretty good to have a technical peer review it.

Jim: One of our favorite examples of this type of situation that will arise is credit card scammers. The investigator will have a seized credit card scammer and somewhere out in the net, possibly the dark net, he will find the software that will interact with this particular item, which is probably of limited production, maybe three or four, whoever the vendor of this is sells them. But then what he’s got: he’s got some software, he’s got this tool, and he has to run the software on the tool to see what he can obtain from it, from the software. So he doesn’t get to test this very much. The testing comes in the sense that he’s got some data off the tool. It probably ties to, say, a restaurant where this was used and list of credit card numbers. And so he checks those credit cards against people: Were you a customer at this restaurant? Yes, I was. Have you had some unusual charges? Yes, I have. Well, we’ve found the problem and we’ll need your testimony… and maybe not even need the testimony. 

But you’re not going to be able to do a really thorough scientific test of how the tool works and how the software works. You just have to go with: well, get the standard of: I’m getting reasonable data out of this. And it’s consistent with what I can determine out in the real world with normal investigative techniques.

Christa: Okay. And I think that speaks to the need for fairly extensive testing in labs. On the flip side of that, there’s a fair amount of opportunity cost of testing, as opposed to casework. Is there any kind of guidelines in helping forensic practitioners mitigate that opportunity cost?

Barbara: From a testing point of view, I am eager to see people share the burden of testing, because generally the work that’s done in lab A testing something versus lab B versus lab C, to the extent we can pool our testing knowledge, that’s very valuable, because there’s so many different types of tests that could be done. 

Another part of this error mitigation document tries to look at forensic examiner needs to understand when further peer or management review is needed. When, like, you have tried to look critically at the work, right, with a sort of understanding of computer science to have an understanding when things would go wrong. With the more complex items that they face.

Christa: Which seems to go to a fair amount of self awareness, I guess, right? Just to know that there’s the potential for cognitive bias and those kinds of instances.

Barbara: There’s the potential for an assumption that things work like they did in the past, to give it a sort of easier to digest name.

Christa: On a related note, digital forensics tools are arguably becoming more opaque as vendors implement technology like artificial intelligence. They obviously are marketing legal defensibility of the evidence their tools help to capture. What is an appropriate level of transparency that they can offer their users around systematic errors? Going back to the point you just made, especially in such a competitive market where they may not want to disclose problem areas.

Barbara: So I mostly know of AI based tools and the pattern recognition, like face ID recognition or detecting like deep fakes and altered video. That, sort of?

Christa: Yeah, it’s an example. I’m also thinking in terms of vendors that are implementing, say, new exploits on mobile devices or just new methods in general. Maybe even ones that we were not even fully aware of.

Barbara: So I would put those in… I would separate those entirely, because they have very different types of things that could go wrong associated with them. So when you’re talking about an AI based tool, it learns based on what it’s seen in the past. And sometimes they can learn odd things because they didn’t have a rich enough data set to begin with. So anything, any result that comes back that’s based on an AI based system, it needs to be well labeled that it came from an AI based system, so that the human who has to interpret it knows this, and knows they have to do the human side of putting it in context for what it means. Right? 

And certainly you see that with, you know, when something comes back from a face ID algorithm, you know, they make mistakes, but that’s why there’s a human added into the process. Or there’s supposed to be a human added into the process. So you have the power of computing, plus the smartness of humans. 

And the same thing for a deep fake analysis program, you know, what comes back then the analyst looks at to once again add the human level of intelligence, which is very different from artificial intelligence. 

That’s quite different from people who’ve used an exploit to break some aspect of IT, like to bypass encryption on a phone, or something like that. Those ones are, from a quality point of view, a little easier to deal with because you either get something back or you don’t. Right?

But it depends, like if the exploit is used to do something else in the computer, you just have to have an actual analyst at this point, with an understanding of computer science, to know what this means in the context it was used in. They’re both interesting and important problems, but I would put them in very different bins. 

Let’s put it this way. The easiest programs to test are password crackers. Because you know when it got it right.

Jim: The good news about all the tools we’ve tested is, they never manufacture evidence. The place where things go wrong is, well, maybe they missed this.

Barbara: Well, I would say, when people are reconstructing deleted things, things will get put together that don’t belong together.

Jim: And even when things that don’t belong together are put together… say like, it’s a Word file, some text. You can see where the stitches are. And if you see that, wait a minute. If I read from here, this line that I’m reading, the next line is a jump. It doesn’t follow it, doesn’t fit. This is not put together properly. Same thing with images that are put together, there’ll be a jump, a discontinuity. It tells you to watch out.

Barbara: They had that problem with the Casey Anthony one where the jump was harder to detect. 

Jim: Yeah. 

Barbara: So it is possible for digital forensics to bring back results that can be misinterpreted. 

Jim: Yes. 

Christa: Well, I was going to say, I think I saw a presentation at the HTCIA conference last week that was on video forensics specifically. And some of the frame by frame analysis, or some of the frames that come back, can be missing. And so there’s a literal visual jump in that video that can show something quite different than what actually happens. So, I mean, that’s maybe a little bit more concrete than what you’re talking about.  

Barbara: That’s a little bit easier to understand, yes. You don’t have anything that’s wrong, but it might be… 

Jim: Misleading. 

Barbara: Misleading. And for that, like when you’re doing video, you have to have somebody who actually understands video and human perception to do the human part of that. So there’s a need for expertise. 

Christa: So I want to go back to users’ responsibility to test their methods and their tools and whatnot. You obviously have the computer forensic tool testing project. Is there anything that you want to call out specifically about that; about either the federated project or any other aspects of what you’re doing there? 

Barbara: Yes. So we’ve made available the NIST level of testing in a form — it’s called federated testing — that people can use themselves in the comfort of their own labs. Since we put a lot of effort into it, I suppose it’s not a surprise that we think it’s a pretty reasonable balance between rigorous testing and testing you can actually finish in a reasonable amount of time. That is always a challenge. Jim and I argue about that daily. 

But to do a pretty good test of your tool… so it doesn’t… it only covers, I can’t remember how many functionalities we have in there now, but one of the big ones is mobile, which is very widely done in the labs. So we recommend people use something like federated testing so that when you look at testing, we try to look at it very much from a computer science point of view, and the intersection between what is useful to forensic practitioners and what are the kinds of things that are more likely to go wrong, based on an understanding of what’s going on one, two or even three levels below what the user sees. 

So a lot of times people will test a tool [indecipherable] their approach to testing as, well, can at work, which is actually a very useful kind of testing, right? You don’t want to choose the product that actually can’t work, but you also want to go the next level to understand well, when does it fail? What’s appropriate to use the tool for, and what’s not appropriate to use the tool for,

Jim: I tend to think in terms of examples and sometimes when you’re testing stuff, it’s hard to decide, well, is this behavior wrong, or does this behavior matter? And an example came up in string search testing. Some of the tools have a button that says go find social security numbers. So, okay. We we’re testing a particular tool that had two different ways to search for strings. 

So we hit the button on method one, which is called index testing, and that’s where the entire digital data is scanned and an index is built. And so to find strings, you just look in the index and see what’s there. 

The other method is a live search where every time you say ‘search,’ it searches from the beginning of the data to the end of the data. We got different results for social security numbers. 

I had three different social security numbers in there, or maybe four, but it turns out that social security administration does not issue numbers from the entire space of possible numbers. No social security numbers issued by social security administration begin with a number higher than seven. Okay? So one of my numbers was 123456789 and another one of my numbers was 98765. Okay? So one of the search methods found both numbers, and one of the search methods only found one. So a little research into social security numbers disclosed the seven thing. So, is it wrong to report both numbers, or should you report just the valid social security number?

Barbara: Oh, and there’s a twist on that, by the way.

Jim: There’s always a twist. The IRS messes things up, because there are a number of… we don’t have to even go to illegal methods. There are a number of legal situations where you are liable for income tax, but you’re not liable to pay social security, or receive social security. So you can’t get a social security number. So what do you do? You go to the IRS and say, I really want to pay my taxes, but I don’t have a social security number and I’m supposed… “Oh, don’t worry. Don’t worry. We hand out tax ID numbers.” Really? “Yeah. They look like just like social security numbers, but they start with a nine.” So, what’s the correct behavior for the tool?

Christa: Right, right, right. 

Jim: So this is where we get our: We report just the facts, ma’am, you figure it out.

Christa: And again, going back to the importance of testing. Right? 

Barbara: Right. The importance of testing. So people understand what their tool does and doesn’t do. 

Christa: Right, right. 

Barbara: It might be desirable to you to have a tool that doesn’t find legit social security numbers, but you might also want to find the IRS issue numbers.

Jim: Here’s a great example of: we’ve got a tool that will give you two different answers. 

Christa: Right. Yeah. Well, this has been a fascinating discussion. Barbara and Jim, thank you again for joining us on the Forensic Focus podcast. Thanks also to our listeners. You’ll be able to find this recording and transcription along with more articles, information and forums at www.forensicfocus.com. If there are any topics you would like us to cover, or you’d like to suggest someone for us to interview, please let us know.

1 thought on “Barbara Guttman And Jim Lyle On Confidence In Digital Forensic Results”

  1. Thank you for addressing the especially important topic of error rates and uncertainty regarding digital tools.

    There are two points raised in the podcast that I would like to add something to, as they seem to be linked to common misconception about probabilities and reoccur frequently in discussion around uncertainty in digital evidence.

    First, whilst it is indeed easier to talk about error rates with random events, it can also be useful to describe systematic errors through probabilities, if the systems are sufficiently complex or not all entry parameters are known. In fact, this is exactly what we do if we talk about coin flips, likely to be one of the prime examples for probability. If we knew with how the coin was flipped, we’d be able to perfectly predict, what side it would land on. I’d argue that modern computers are in some situations already sufficiently complex systems that there may be cases where probabilities are well adapted to resolve issues surrounding systematic errors. In other words, we may find ourselves in a situation where given initial conditions always have a erroneous outcome, but these conditions are of such a high complexity, that it is far simpler to use error rates in discussing the problem.

    Second, Barbara states a common misconception around probabilities: There is no need for data to obtain them, as probabilities are not calculated but assigned. This was most recently raised in Biederman and Kotsoglou (2020): Digital evidence exceptionalism? A review and discussion of conceptual hurdles in digital evidence transformation. As is very well explained in this publication, whilst data is the best basis to draw a probability from, it is not necessary. In the absence of the latter, it is possible to base a probabilistic assessment on personal experiments or even experience and training. The raised issue of small samples is not too much of a problem here, as probability theory allows to take that into account as well.

Leave a Comment