Ahmed Bajat shares his research at DFRWS EU 2018.Ahmed: Thank you. My name is Ahmed Bahjat, I am here from George Mason University – shoutout to Jessica Hyde. I’ll be talking about file fragments dating; this is probably your first time hearing about dating on this forensics theme.
It’s actually adopted from the archaeology field. They have two types of dating: they have relative dating and they have absolute dating. Relative dating is where you pinpoint an age for an artifact, or absolute dating where they put a timestamp on a date. So I’ll be talking about this, and our efforts at George Mason to realise this.
So the goal of our research, basically, is to put a timestamp on found artifacts on a hard drive, loose artifacts. More often than not, you find deleted files that actually do not have any reference, in which case you’ve lost all the metadata, all the dates you have. And some of these dates are actually very rich. Some file systems, like the NTFS, store eight dates for each file: four in the MFT record and four in the file name record. And this will be lost if the MFT record itself is overwritten, which is very common.
So there’s a new field called digital stratigraphy; I think it was proposed last year by Eoghan Casey. In sync with our work outside this, more than two years ago… digital stratigraphy is basically studying the hard drive as layers, stratums, where the upper layer is always newer than the lower layer.
The challenge here is, this is not very obvious in many file systems. Some file systems are actually very random, you can’t get any layers out of it. You’ll see this when you compare NTFS file systems with a FAT drive on systems.
Also, another thing you might actually think about – I was a little bit intimidated when I was placed in the IoT segment – but I think my broader audience actually are in the IoT segment, because you guys are the ones who are finding these loose file fragments, the GPS systems, and I’ve seen actually this done quite a lot in research, where you have a TomTom GPS system and you’re trying to find out when the guy actually visited that place; you know, trying to put a time stamp on a deleted log file in an embedded drive. So in these cases, this is your best fit. Naturally, you’re lucky enough because most IoT systems use FAT drives, probably, rather than NTFS.
So our approach is to use pretty much a reverse moving average of the neighbouring files of a given sector, and we can pinpoint a probable date for our file.
A little bit about the file systems: So in NTFS, we have eight dates grouped into four dates in the MFT record: the typical MACB dates, and we have MACB dates also in the file name. In the file name record, I mean.
And there’s two types of overwriting the file in the hard drive: there is content overwriting and there is metadata overwriting. So you can either replace the pointer, the index in the MFT drive, in the MFT index or MFT table, in which case you lose the metadata only but the file is still intact, in the hard drive, which in many cases you can recover the entire file, and if it’s something you can deal with then you can get the actual dates from it, like if it’s Windows, if it’s a Microsoft file, a Microsoft document file or PowerPoint slides, or a PDF file. In many cases you will find internal dates in these files, in which case you don’t need our scheme to take these files.
In many other cases, the content itself is actually overwritten, and you have [indecipherable] a pointer on the table. The last strange case, which happens also a lot, quite often, which is you have both cases happening in your evidence. So you have partial deletion on the content – specifically in the header, the initial contents of the file – as well as the pointer to the file. In which case, you’re out of luck unless you try to date these files and give the probable date.
And file fragments can be found in either file slack, file slacks are getting bigger now – in the new file systems, the file slack is about 64kb, versus 4kb. And it can be resident files within the MFT, or it can be non-resident; it can be allocated or unallocated. So allocated means there’s an overwriting file on the file, generating the slack where your fragment is found. Or you can find your fragment in the [indecipherable] volume slack.
Alright. And our work does not address finding these fragments. Jim Jones, my supervisor, is actually working on founding evidence for uninstalled softwares on a hard drive, in which case he uses MD5 hashing and partial hashing of files to find evidence of fragments. And then the following step is to put a date on these files.
Of course, related work can be grouped into time in forensics, reconstruction of events is also related, file classification and recovery is also related; I want to skip those two.
Alright, so this is a snapshot of a FAT drive. The first experiment I did was trying to see the patterns of writing for a file. As you can see, in the beginning, the files are sequential and the system number, until the hard drive is pretty much full, and we’re reaching the last sector on the hard drive where it starts looking for empty clusters from the beginning of the hard drive.
So this is called a ‘next available’, versus first available, scheme in the hard drive. So this is a next available, where it’s actually sequential until you hit a growth block, and then the end of the file system, you start looking for empty [indecipherable] at the beginning.
And it’s not trivial work, because you see a lot of anomalies. For example, this screenshot at the top was bugging me for quite some time before I figured it out. This was an updated software. Because if you try to read these dates, you see the file at the beginning, it’s in the same starting cluster number; the size has changed; but the dates are still the same for the accessed and created; but the modified is changed. So how can the file be changed, while the access date is still intact?
And this is obvious for all of us, because you can disable updates to the access dates for performance reasons, and also there is a lag on the access date caching, so it’s cached for, I think, about an hour or so before it’s getting overwritten. So if you collect the hard drive before all the accessed files have been committed to the MFT, then you find anomalies like this. But this, I know for sure that this has been collected a while after the file had been updated, so it was still bugging me until I realised that this is an updated software, so they actually replaced the entire file from the animate, with an updated modified date, so it has not been accessed within the file system, the created date is the same.
Another anomaly, in the lower side of the slides, is you can see the file’s actually moved in the hard drive, the cluster number’s changed, but the dates are intact. It could be all the dates, if you actually disable that last access date, but if you do enable the last access date and you perform, for example, defragmentation, you will see after a while the date access has been changed.
Another example, as well, where you can see the date created also changed, and the cluster numbers is also changed, because of a defragmentation.
Another interesting thing is, when you actually delete a file, none of the updates get updated unless you actually delete it to the recycle bin. In this case you will find evidence that the MFT record has changed, and the MFT record date will be changing. But then, if you empty the recycle bin, none of the dates will actually change, so you cannot pinpoint a date when the file is moved from the recycle bin to unallocated space.
So if you look at a typical hard drive, if we try to list out the sequence of files in a timeline, we can call the files pre-occupy the slacks before our fragment pre-file, and then post-file for the one after the file. And then you’ll find evidence, probably [indecipherable] or MD5 hashing, looking for a specific fragment, or you know you’re looking for a binary string, for example, and you find that hit within another file, occupying that slack. You know the upper boundary of your file is your file type, file system. So the slack owner is the new file occupying this place, where the file fragment [indecipherable]. So you can consider that as an upper bound.
Of course we have the natural upper bound of all the files, is the collection date or the acquisition date, so this is true that you should do this first, calling criteria where you exclude any dates that happen after the acquisition, of course.
And then you have a natural lower boundary, which is the file formatting date, so the date you actually created the volume.
We narrow this down again as a slack owner date, so that the slack owner is created after your fragment has been deleted, so that’s an natural assumption also. But the trick is: do I get the created date, or the modified date, as the slack owner, versus the deletion of the previous artifacts?
So we collected the data, where our ground truth was a snapshot of the 57 hard drives, for [indecipherable] drives specifically. And if you look into the files in each snapshot, trying to have a date, a set of deleted files first. And we have the ground truth dates by looking into the snapshots, the day-to-day snapshots. So if a file is missing, go to the next snapshot, assuming the file was deleted on that day.
We actually tried to optimise our deletion ground truth, by looking into records from the USN journal. So if we do find an evidence of the actual deletion event from the USN journal, we update our ground truth to the date reflected in the USN journal. But as you can see also, the utility of the USN journal, it’s not very generous in terms of space, it keeps getting overwritten. So if you’re doing a lot of deletion, you’ll not find your date on the USN journal as well.
Alright. So we did our analysis for this dataset, we tried to find the upper boundary and the lower boundary for each file, and then we calculated the residual error for our probable date of deletion, versus the actual deletion date either from the USN journal or from the delta between the snapshots. And we were able to actually get a pretty good shape, at least for the upper bound, where at least 80% of the data were… we have an 80% accuracy for our dataset on the upper bound prediction. And this is because we’re dealing with a slack owner, so the slack owner files is mostly driving the upper bound deletion of the evidence file.
The lower boundary is a little bit shaky, because we’re taking the moving average. You’ll get a better shape if you’re using the FAT system, it would be a lot more accurate, and actually I have done that for the FAT and I’ve reached more than 98% accuracy, and pinpointing the actual deletion date of the file with FAT. But with MFT, because of this best effect allocation scheme, you get a semi-random behaviour on the writing pattern.
So this is actually – this graph, and the prior graph – the way you can view this is the accuracy, given your time limits factor. So if you say, OK, how far are we with… if I need my deletion accuracy to be within, for example, five days; I don’t care if it’s actually more or less; then we actually reach 80% of accuracy. If your deadline is a little more loose, like 10 days or so, we can achieve more predictions. So 90% of our data fit on the 10 days period.
But in the lower bound, you actually have to give me more time. You have to give me plus or minus ten days to get at least 80% accuracy.
So in summary, this is [indecipherable] about dating file fragments, both in NTFS file system, which is this published paper, and in FAT hard drive, which I’m assuming will be a lot easier, which we have done research on. But with this, you can at least have your evidence brought to court with a probable cause. Because this is really critical, if you have an evidence without a date, it could be just useless.
Thank you for your time, and if you have any questions.[applause]
Host: Thanks, Ahmed. I really enjoyed that talk, because it’s interesting to see that an area of forensics that we think is very well-established, that we’re still finding and learning new things. I think that’s great.
I think we have time to take maybe one question. Anyone? No questions. OK. Thanks a lot, Ahmed.[applause]