Bit Errors As A Source Of Forensic Information In Nand Flash MemoryBit Errors

Jan Peter van Zandwijk from the Netherlands Forensic Institute presents research into bit-errors.

Transcript

Zandwijk: I’m Jan Peter van Zandwijk, I work at Netherlands Forensic Institute, and the past day, we’ve already heard a lot about NAND-flash memories and also… so it shouldn’t be very surprising that everybody should know by now that it’s the dominant storage medium for [00:29] memory. I also did a very good job in explaining a lot about these errors.

So the purpose of this research that was carried out is to investigate to what extent you could use these errors to your own advantage. So obviously, they are a nuisance, because they might prevent you getting access to the data, but on the other side, it might also convey some noisy information about the data itself. So it can actually be beneficial, [if they have thought of this].

In order to do that, I will go quickly over some issues of reliability and error sources in NAND-flash. It will be very quick, because most of them were already explained by [01:12], and then, something about the data process and some… [01:15] actually in a real life situation, access to the error information which is present on this NAND-flash, and then I will describe a number of experiments that we have carried out in order to verify to what extent it… statistical information about a number of bit errors is really… has some forensic potential.

First, very quickly, something about the NAND layout, so we all know about whatever we are talking about. A NAND-flash consists of a controller, [01:55] controller part, which communicates with the outside world, and the NAND-flash itself, it’s [hierarchically] organized, so it’s divided into blocks, which is the smallest unit which can be erased. And each block is sub-divided into a number of pages, which is the smallest unit that can be either read or written.

The data itself is already also [simplified] in a number of smaller units. I usually call these data chunks, and each of these data chunks has its own extra [part of] storage, which is only available to the controller, which is usually called the spare area, which is used to store additional information about the NAND-flash. [So obviously,] [02:40] could contain some meta information about [multiple] block numbers, but also, this is the information is stored in that area.

So this is a short overview of how NAND-flash organized. But as I already explained, we have different types of NAND-flash memory. We have cells which contain one bit of information, single-letter cells, or you have cells which contain two bits of information, called multi-letter cells. And you also have cells that contain three bits of information, triple-letter cells. But these [results are unreliable], so the content of the memory cell may change over time unintentionally.

Now, a number of processes – the previous presentation already mentioned retention times, so if you keep data for some time in a memory cell, then the charge [may leak away], which leads a modification of the value stored inside the cell. And besides that, reading or writing information might also modify the content of memory cells. In this case, it might modify the content of adjacent cells, not the cell themselves. If you read from a certain [page, then] information in adjacent pages can be modified.

And it was also mentioned in the previous presentation – the rate at which errors develop also depends on the quality of the [04:01]. If it’s repeatedly erased and programmed, then the cell deteriorates, which might lose that [touch] more quickly.

This is one of the pictures from the literature. This is a graph of … so on the X-axis you have the number of P/E cycles, the number of times it has been erased, and on the Y-axis you see the bit error rate for different sources of information, different types of errors. The important things that you have to notice is that the retention errors are the largest in magnitude. They could contribute most to the bit errors, and also the rate at which errors develop also increases with the number of P/E cycles. So that’s important to notice.

The NAND-flash memory itself is an unreliable storage medium. In order to make good use of it, [05:08] have taken some measures in order to increase reliability. We also have heard something about that … in their workshop yesterday. [The most famous tools] they use – randomization, which means that that data is [05:22] before it’s written to NAND-flash, and in that way, you can minimize the number of [inter-cell interference], so the number of read and write errors can be less. And besides that, of course, error correcting codes [are stored] [05:37] [are computed], which enable you to determine the number of errors which are present in the memory and also correct them. And besides that, because it’s sensitive to the number of P/E cycles, so the number of times it has been erased and reprogrammed, it’s important to use all the memory pages in the whole memory, so that [they are all … they go out at the same rate]. So if you would always use the first [page], so it will go out very quickly, but if you spread it across the NAND-flash, then the lifetime of the memory will increase.

These are all things that are implemented in the controller. So when the controller writes data, then it does these three operations – it does wear leveling, error correction, and randomization. And that means that actual data which is stored in the NAND-flash is different from what is sent to a [06:32]. So if you look at this, so if you have the [06:35], you have the controller, and at the right hand side, you have the actual storage medium. So if you want to put this picture on the NAND-flash, what happens is that the controller randomizes it so it becomes [read, noisy], and it adds the yellow bar, which is the extra ECC information. So those are the two operations which are performed by the controller when storing data on NAND-flash.

And if it goes the other way, [the inverse process], so the randomization is removed and the ECC information is [07:15]. So essentially, that is what happens when you store data on NAND-flash. But if you only look at the NAND-flash itself, you don’t have any access to the controller, so the question is from … you also heard something about it yesterday. What you see if you read out – you take a NAND-flash, you see data, [including] randomization, and ECC parity-bits. So it still contains bit errors.

So the question is – if you only have this information, can you go back?

Well, there are several possibilities, so yesterday you also heard some approach to remove randomization and to find ECC [parameters]. We do that ourselves also in a different way. The first time I experienced this, that most of this randomization is often produced by [08:13] of the LFSR, linear feedback shift register. And there is a nice mathematical algorithm which is able to detect an LFSR, so able to compute the feedback taps from LFSR from specific data which doesn’t contain any errors.

This algorithm comes from coding theory, and it’s called the Berlekamp–Massey algorithm, and we use that, we just apply that to our data and see whether it is able to detect an LFSR. [08:44] additional steps, and then we can find an LFSR. It’s a structure of a number of cells, and each time the cells are [08:51], then data is shifted in it. So we have two parameters here – the feedback taps and, besides that, you also have the initial [08:58].

So if you do some further research, once you have found an LFSR, you do some further research, then you can find out how the [register is seeded]. And for the reconstructing of the ECC, the main problem is that if you want to recover code parameters, then you have to get access to a number of code words, of which you are certain they don’t contain any errors. And if you only have a [09:25] to work on, then you can never be sure which of the codes works, which of the data you get from the NAND-fresh is without errors.

In this case, you make the assumption that we use the cyclic code, which is not a very strict assumption, because most of the codes which are used in NAND-flash are cyclic codes, the BCH codes are cyclic codes. And then, we search stepwise for some [technical] parameters of that code, on the basis of our data. And we have a method which can figure out itself which of the data it doesn’t contain any errors and [10:02]. So it’s tolerance for the use of incorrect data.

And once we’ve found that, we determine the [parameters of the code], so how long the , how many of them are [… the number of] errors codes are able to correct. And after we find that, we also look whether that [code parameters we found] belongs to a BCH [10:26] very quick, dedicated methods [would be probably] BCH codes. So that’s in short, the possibility.

So this question we can ask, we can [say it’s affirmative]. In certain cases, when we only have the data from NAND-flash, then we can [recover all the] randomization and ECC parameters, and then we can recover the picture again. With thanks to these three people.

So I will point out – the first two are [11:01] and the last one is [Eukleides], he is a mathematician from around 100 BC, and we use his algorithm in order to compute the code words. So he had invented an algorithm to compute the greatest common [divisor of two words], and that’s essentially the method that we use in order to compute some parameters of the cloud.

So if you are interested in that, I have written a paper some years ago about it. So if you really want to know all the details, you can ask me afterwards.

Back to our research, questions on if you have access to these bit-errors, can they be used for forensic purposes? The first thing you notice is that these error sources are related to potentially forensically interesting processes, such as retention times or just the time that some information has been present on a certain location within the memory, and besides that, also, the number of times it has been read or [12:08] also. So this is potentially forensically interesting, but the question is in a forensic setting, are the sources large enough to [change over the periods] that we are interested in. Typically, in a forensic investigation, we are interested in changes which occur over periods of weeks or months. So the question is whether any changes in bit-errors occur on both those timescales, and also, in a more realistic use scenario, to what extent are these activities … can bit-errors be related to the activities you have actually performed on the NAND-flash?

In the first case, we performed an experiment on raw NAND-flash memories which were written directly, so we didn’t use any controller to write. We wrote directly to the NAND-flash memory itself, and we didn’t use any ECC information. And we did… we prepared memory checks, so we created different regions in the [chip], which underwent different number of P/E cycles. And afterwards, after preparation of this memory, it was programmed with fresh random data and then it was stored at either room temperature or a somewhat higher temperature, and then we watched the rate at which the errors [13:30].

In this case, it’s very easy to compute the number of bit-errors because you know what information is written to the memory itself, and if [you get compared] with the information that you get back when you read the memory, you get directly the number of bit-errors.

This is one of the pictures from the paper. So we know, I’ve indicated the different regions which have been created in the NAND-flash memory, and number of P/E cycles to which each of these regions have been subjected. [According to manufacturer] specifications, the number of P/E cycles for this NAND-flash memory was 5,000. So these numbers are relatively low with respect to the manufacturer-specified number.

So this is the results for NAND-flash which was stored at room temperature and this is the result of a NAND-flash which was stored at 72 degrees centigrade. And the different [14:26] different times. So what you see is … the important thing is that you see it over time. So if you go up, the number of bit-errors increased, both for the higher temperature and also for the lower temperature. But the error rate at the lower temperature is much lower, as you see from this axis, on the Y-axis.

The second thing that you have to notice is that it also depends very strongly on the number of P/E cycles. So in the regions outside, which were not [stress cycled], you see that the rate at which errors developed is much smaller than at this region with only 2500 P/E cycles.

And the timescales is indicated above, so it’s in the order of weeks to months. So actually, what you see is the detectable differences in bit-errors over periods of weeks to months, both at room temperature and also at higher temperature. But the rate [slowly] depends on the temperature, as you saw in the previous slide, and there’s also a rather big effect of the number of P/E cycles on the regions. For forensic, it’s also important, because sometimes if you experience [15:38], so you used a limited amount of time, and if your method works for that, if you apply it, for instance, to a case, then you might have the files which have been used for unspecified number of times, so the effect of, for instance, heat or other things might be much larger for that other device. Because it underwent a sizable number of P/E cycles.

Now, for the second experiment we used a [16:05] [set-up], which is shown here. It’s a heating stove in which [a glass container is and there] we keep some USB sticks. In this case, we performed experiments by using controller, so we copied files on to the USB device, and did a certain protocol, which I will explain afterwards. And after that, we removed the NAND-flash from the USB device and we ran it at the end of the experiment, and then we processed the dump, which means that we de-randomized it and we decoded it and then we [16:41]. So basically, it’s from the number of bit-errors. And after that, we checked for a relation between these statistics of bit-errors and the actions that were performed on the device itself.

Okay, so the protocol consists of copying different files at different times to the device. In between, we stored at either room temperature and 70 degrees. And we did two things – first we did a data retention. [I only copied at different times] files, and then we also tried some stress cycling, which means we first copied a number of large files on to the device, and after that, we repeatedly copied the same file a number of times to the remaining area. So we assumed that was stress cycle … and afterwards we copied a number of smaller files on to the device.

And after descrambling, then we were able to allocate memory pages to a specific file which has been copied on to the file because the computer [17:54] hashes of your data [and compared it to] the hashes of the descrambled memory pages.

This is a typical example – this is not from [18:05] but from a slightly different one. So here you have page number, and on this side you have a number of bit-errors. And [in color code] you see different files which are copied on to the pen drive. So all [18:17] first the blue one was copied, then simultaneously the green and the red one, and after that this device was stress cycled, and after stress cycling, two other files were copied, the light blue one, and the last one copied on to the file was the purple one.

So that’s the order, and the important things to notice is that … you see the wear leveling algorithm, because this large file was spread over two different locations within the NAND-flash memory. You see some effect of retention times – so the blue one is the first one to be copied, and it has slightly more bit-errors than these were copied afterwards, and the last one, this one here, is [evidently] less bit-errors.

So there is some effect of retention time. So they’re not totally random. And after the black dots, this is [noisy segments], so [19:14] the size of a number of pages, for which we have no documentation on this chip, so we [19:22] size of 256 pages, which we assume would be the size of a block. And due to this averaging, we see more [detail], so noisy segment becomes less.

And also, you see some effect of this stress cycling protocol that we did. Because if you only copy files, then they are [19:41] on to a fresh [dump], [19:44] more or less sequentially, but those files copied after stress cycling are more interlaced. So there is some effect of this stress cycling protocol.

Now, to the actual data. We did four USB sticks, two were [stored] for the data retention protocol, both at room temperature, at 70 degrees temperature, so this same picture. But the important thing that you must notice is our large differences between [20:18] same batch. So the [base error rate] therefore for this one is larger than for the other one, and they are both stored at the same temperature.

Also, if you look at data from room temperature, you can also see, if you go back … so that sometimes, even the [area which they choose for] data storage can change. So the upper one uses a smaller area than the other one, but in this case you see more or less the same. So by stress cycling it’s more … the files copied after stress cycling are more interlaced, while the files copied without stress cycling are stored sequentially.

So the conclusion is that there are large differences between USB sticks from the same batch, for both are of data storage and also the baseline error level also varies. And there appears to be a slight relation between the time at which files are copied on to the drive and the stress cycling does something, because of the areas which are used.

Now, for the safety, you can also this information to form some [histogram], so at this time, at the X-axis, you have the number of bit-errors, and at the Y-axis, the number of times it was found. And the [21:43] is the same [for files]. And as you see, for instance, for this time … so on the basis of the number of errors, then you can group the files, the memory pages [21:53] to one file together.

That’s still one minute.

So the thing I want to tell you is if you do low-level acquisition and off-line analysis, then you can get access to an independent side channel of a normally functioning device, so a device that’s not broken, so it works normally. A possible use of this is that it could be used as an independent side channel, and maybe you can use it to perform some relative dating, so say something about some memory pages have been longer on the NAND-flash than the other one. And also, maybe, as I showed in the previous slide, it can be used to group memory pages.

But we are the Forensic Institute, so we should be very careful. At the moment, it’s just academic research, not fit for uses cases. So there are many factors affecting this bit-errors statistics, which means that I only looked, at the moment, at retention errors, but there are money other ones which might occur simultaneously or concurrently. And also, we have this effect of either retention time and P/E cycles. So if you have a file which has been on a bad piece of the memory for a short time, it might contain the same amount of errors as a file of a piece of data which has been a bad piece of the NAND-flash for a longer period of time.

I think that this was my last slide. Okay.

This is where we stand at the moment. We did some initial research and, in principle, we would be interested in any cooperation or any additional ideas that you might have in this respect. Thank you for your attention.

[applause]

[u]End of Transcript[/u]

[i]This video was recorded at DFRWS EU, in collaboration with Forensic Focus. Find out more about DFRWS and register for next year's conference on the [url]official website[/url].[/i]