Shining A Light On Spotlight: Leveraging Apple’s Desktop Search Utility To Recover Deleted File Metadata On MacOS

Mark Scanlon discusses his research at DFRWS EU 2019.OK, good morning, how are you doing? My name is Mark Scanlon, I’ve been speaking to some of you guys already. I’m from University College Dublin. This work was primarily conducted by Taj Atwal, who had intended to come and present it to you, but at the last minute had to cancel his plans.

Taj was a student of ours in UCD on Master’s program who graduated last year, and this is a paper that has been produced as a result of his dissertation research as part of his Master’s. And it’s co-authored by myself and my co-author from UCD An Le-Khac, who supervised the project.

So the main points that I’m going to cover here is to go through exactly what Spotlight is, if you’re not familiar with it. Those of you who are using Macs, I see a lot of Macs in the room, will be very familiar with it, I imagine. Talk through the experiment that we conducted and what we found; go through some of the results we had; and then I’m going to tell you what I told you again.

First of all, Spotlight looks like this. It’s a Mac desktop search utility. It started in 2004, and was released by Apple, and it was purely desktop search. So when it was released initially, it indexed the files on your local disk and allowed you to search those files relatively easily. Ti’s evolved over the years with the releases of Mac, so now it does things like prepares online searches as well. So you can see here it’s done an online search for a dictionary lookup for related websites, or apps in the app store, stuff like that.

Bubt the thing that we’re interested in is that it indexes the files on your local disk, and it allows you to search through those files. So on the left-hand side you get the results of whatever you search for; on the right-hand side here you can get a thumbnail, perhaps, of what that image looks like, what that document might look like.

The thing about when this was released — it was released in 2004 for the first time — internet search was already popular, so that was started by Altavista back in 1995 or so. If you think about web search, while it’s not trivial, it’s easier than looking at proprietary formats on a desktop machine. So web search is a text-based search as standard, the HTML standard is there. We know what we’re looking for when we look at a formatted document. So web search and indexing is relatively straightforward. Trying to build an index and actually look inside proprietary formats on a desktop is more complex.

So it was created in 2004, it allows you to search for information using keywords; and of course, Spotlight creates databases. It stores everything from the file system attributes; metadata; and it actually indexes any of the textual content that it discovers in the file.

For those of you who are not Mac users, if you’ve ever shared data with somebody via USB key and you get all those extra .spotlight files that annoys everybody, they’re the files that are used for indexing each device.

I’m speaking on Taj’s behalf, as I mentioned. He had this interest in this topic since 2011. The reason he focused on this topic in his Master’s was that the number of Macs he was using in his lab was increasing; the popularity of Macs is continuing to grow. So in 2011 he started out looking at keyword hits on Spotlight directories. There were no tools available to parse the data; the format of the databases was proprietary and not documented; there was no published research from Apple or anyone else.

In 2013, a company called 504ensics — you see what they did there — announced a tool called Spotlight Inspector, and they had built a tool and they had reverse engineered the format. But the tool disappeared almost immediately as it was released, and the structure was not revealed; it was not made open-source, and it’s not available any longer. Basically the documentation that we could find about that tool, it was not obvious if it was able to handle deleted records in any way.

So in 2016, when he started his Master’s, the examining of Spotlight had not really progressed. The current best practice for looking at this stuff is to use a lab machine, a VM or something: you mount the device that you’re looking for, so every single storage medium in the root of that device will have these Spotlight data, and you’ll use the official Apple tools to interrogate that database.

So you’re kind of manually looking through these records, and none of these options that are there so far offer a solution for recovering deleted records.

So to give you an idea of what this might look like, this is just a small sample bash script. I should say, this script and more is available on the Github which I’ll have a link to on the last slide, if you want it.

The line that really matters is that you’re looking through the content for any of these database files, and the one that really matters is this one here, 104. So line 104 here, you’re calling MDLS, which is the metadata lookup service, that’s what that stands for. And if you pass it in some parameters, and it’ll look up to see if there are any records for this file already in the store, and it’ll try and get those metadata records back. That only works for active files, currently existing files, not for deleted files.

Metadata is obviously a hot topic within forensics: the more metadata we can get our hands on, the better. To make the most of the metadata, we should be able to ideally understand any file data structure that that metadata is stored in. You need to test to make sure that your metadata is accurate. And obviously, it needs to be shown to be reliable.

So parsing the data, then. If you have ever tried to parse a file that you are not familiar with — I’m sure most people have, at some point — it’s very difficult. You’re kind of going in blind; you open up hex, you look at it; it’s a very arduous process. You can also make incorrect assumptions about offsets and what various things mean: just because something works this time doesn’t mean it’ll work next time you look at the same type of file. So it’s often misunderstood. And this can lead to an incorrect conclusion being drawn.

In terms of Spotlight itself, this is how it works. The core of the Spotlight system is the orange box there, the metadata server: that’s the software that actually runs the entire service. So every single file system event that takes place: FS events is a demon that runs on Mac machines that constantly monitors the file system for any changes. So all reads, writes, deletes, modifications.

When that happens, it updates the metadata server to say that there’s a new file. It will then pass it over to the metadata importer, which is what Apple call their tool for analysing proprietary content. So metadata importer, pretty simply, for a JPEG photo, would automatically look for EXIF data and add that to the store.

Metadata importer would also process a Word document. So it’ll index the data in the document to make that text searchable. So you can use Spotlight to look at the content within the documents on your machine.

All the stuff from metadata importer is sent back to the server, and all this is then stored inside the Spotlight databases in the root of every storage media, in a folder called Spotlight V100, and that’s a .directory, so that’s hidden on Unix systems by default.

In terms of the events that trigger an update to the database: FSEvents is a demon that runs, that monitors the file system for any changes. So for any of these events that are triggered: creating a file, deleting a file, changing it, renaming it, modifying it, exchanging a file, so if you’re moving data between two files; creating a directory, you’re changing the attributes… pretty much anything you do triggers an FS event, and that triggers an update to the metadata store.

So the metadata importers that I mentioned: these are a collection of tools that are able to consume proprietary file formats. Obviously when Apple started off at first, they knew all of their own file formats, so they created metadata importers for each of those; but over the years they’ve added more and more of these importers, so now they can handle more files.

So the thumbnail that you get — the text search that you get — has become far more useful as the versions of Mac have increased. So there’s a worker process, then, that is kicked off every time the metadata database starts, and a list of words is passed back to the metadata store.

The store itself then — this directory: if you’re on Mac, you can look into this yourself, it’s in the root directory — the stores are created on volumes, obviously, where the operating system has read/write permissions. Again, if you’ve ever shared data with anybody with a Mac, and you’re on a Windows machine, and it looks like they fill your USB key with all these spotlight files.

Its presence on a disk indicates that that disk has been indexed by Spotlight at some point. There’s a plist file on Mac which shows the location of all of the Spotlight stores that are being held on that local system.

This research is mainly focused on the metadata database — that’s the thing that’s interesting. So that is where all of the metadata updates are stored. That is a proprietary database structure; Mac’s documentation, Apple’s documentation, does not provide any information about how that is formatted. So as part of the research, we reverse engineered how it works.

First of all: what’s already out there? I mentioned already the tool by 504ensics which is no longer available. There was no documentation to get through the store.db file. Current methods of extraction and interrogating Spotlight use Spotlight itself. So the lookup service that I showed you, the example bash script: that’s using the right tools to interrogate the database. And there’s no research out there to prove that the records are recoverable either directly or from unallocated space.

So the approach: this work was first of all to try to figure out the structure of the database, to work out the reference file, and trying to establish then what happens to deleted records. So are the records that are deleted recoverable from the Spotlight database? How long were they recoverable after the file has been deleted? What happens if the user intentionally destroys the Spotlight index? Can deleted versions of the database, or database pages, be recovered within unallocated space?

The experiments that I’m going to talk you through will follow this same principle. We used virtual machines, we created virtual Mac OS installations. Various versions of Mac OSX, Mac OS. And the reason we used virtual machines was that it was fast and quick and convenient, but also it allowed us to easily create snapshots of the running machine, and the versions of Spotlight after these particular events.

We then populated the file system in the Spotlight database with known metadata, so we knew what we were putting, where we were putting it and what we were doing; we had a script to follow for the interactions with the disk. The reason we did that was because it obviously made it a lot easier to reverse engineer the database. So all of the entries associated with the actions that we had performed, we knew that we could try and find them in the database.

We then created scripts that enabled the processing of the structures to try and automate it, and to carve out the structures, so all of the pages, the headers, the records that are in that database file. We needed to have processed compressed content and encrypted content; and we need to identify all of the offsets and relative offsets for the records and files. All of these scripts, as I mentioned, are on the Github.

Tjhen what we did was exploited that database structure that we figured out to be able to locate deleted records within the database; parse whatever information we could from them; and then we were able to use that to search through deleted databases in unallocated clusters.

We had nine different experiments that we performed. The first experiment was looking at the persistence of the metadata records within the structure, and unused space on the system. The second looked at the records on mounted volume; the third, persistence of records on mounted volumes that were shared across two different operating systems, specifically Mac and Windows. Number four: persistence of records when Spotlight indices are deleted using appropriate commands on the command line in Mac. Persistence of records when Spotlight indices were deleted via the GUI interface; persistence of records when indices were deleted using again the terminal command, and repopulated with a very large number of files, so you would think that that would completely overwrite the content.

Creation of metadata records for the purposes of reverse engineering the metadata store: you’ll note that that was experiment number seven that we designed, so as I go through the experimentation, I’m skipping number seven because everything else we\’ve learned I’m showing you during number seven.

Number eight: we have the persistence of records when the operating systems were updated, so both major and minor releases of MacOS upgrades; and position of the records within unused space: we used casework ten forensic images.

What we discovered: there are three main types of database pages within the store.db files. So you have the header page; the map page; and the data page. They’re each identifiable by the four byte signature: you’ll see the four byte signature there, which is always located at the beginning of the page.

The header page looks something like this. This is an example. In our experimentation, this page has always been 4096 bytes in length. That’s not to say it always is, but in our experimentation it always was.

You can see here at the beginning of the file, you have the header string which refers to the header page. You have the page size, again, which has always been kb long; and then you have a variable length store.db file because it depends where that file is being stored, so it’s of a variable length.

The second one, then, the map page. The map page is the database map, so it says where everything is on the disk. It describes each data page encountered within the database. The first 22 bytes describe the information contained within the page; and starting on line 32, which is the blue highlighted piece of the hex that you’re looking at, each data page is described by 16 bytes. The first four bytes declare the size of each data page, and we’re not actually sure what the remaining twelve bytes show, we weren’t’ able to work that out.

The store.db data page, then. These data pages contain different types of data described further in the paper, but we’re focusing on the metadata stuff for this gtalk. The header of the data pags is identical, and the records for parsing the data pages were always found within the first 20 bytes of these pages.

The compression library: you’ll see here that the content in red looks like junk. It’s compressed content. Apple stores the metadata here using the zlib library,and the next slide you have tht decompressed, so you can see that it contains useful information. The metadata that’s contained here and that we’ve highlighted and are described in detail in the paper include the creation dates, modification dates, the file attributes, the owner, etc. The metadata that you might expect to be there. It contains the file names and the index as well, which is useful later.

The interesting metadata out of this data page is the catalogue node ID, which is in purple here, as offset four; and the parent catalogue note ID, which is in pink there as offset 10.

It’s similar here to the structure of a network database. Each record maintains a relationship with it sparent and allows a hierarchical arrangement of records to be built.

The catalogue node ID is used by HFS+ to uniquely identify each file or folder on the system. An important feature of the catalogue node ID is that they’re not reused until they’re exhausted, and then it cycles around. New files created on the HFS file system are given the next CNID even if an earlier one had been made available by deletion. So it’s a sequential allocation of files and then it loops back around.

Why is that important? We can use that to identify deleted files: files that no longer exist within the catalogue main index file, within HFS+. And it enables us to create directory files. So we know within this hierarchical structure it is the mapping of the file system and the folders and directories.

Our first experiment, we added some files onto the system; we then deleted them. The store.db file, there are two versions of it: store.db and .store.db. The dot file is obviously a hidden file, again, on Mac and Unix: it’s hidden by default. We believe, in our testing, the dot file is actually the main database; and the store without the dot is the version that’s the last known good one. So if anything goes wrong, it can revert back to that previous store and then build the index again. This is what we’ve seen.

The metadata records within the database persist for an amount of time after the corresponding file on the file system is deleted. If you do a large deletion event — for example, deleting a folder, a working file, or whatever — entire database pages are dropped. And then, if that’s the case, they’re not recoverable on the file system.

So although Spotlight reported that indexing was complete, Spotlight periodically again is monitoring the filesystem using HFS events, so if it sees the content; if you plug in an external hard drive that has hundreds of thousands of files, it takes time to index that. So Spotlight will tell you the same thing that’s happened for each new device that you plug in.

So although Spotlight reported that the indexing was complete, not every file was actually indexed, because it takes several minutes for the files that are discovered to be passed to the metadata importers, to be passed back to the servers, to be stored back out. So indexing is completed when all the files are there, but it hasn’t yet processed getting all the metadata out of those files.

After a short period of time, the databases do catch up with all of the outstanding notifications from the metadata parser.

Each record, as I mentioned, is of variable size, and it sits back to back with the next record. So the data stored within these records are compressed, and when the records are deleted, the menu records show a fluidity and collapse into the newly made space in the page.

The consequence of this is that it overwrites any deleted records, so once a deleted record is removed from the space, they are no longer recoverable.

There’s a small asterisk there: if you did have access to the system and were able to move quickly enough, you could recover it, because if FS Events — again, that’s the thing that’s monitoring the file system — if that’s still trying to keep up with the events that are happening on the disk — if a large deletion event has taken place, for example — you have time there until the event has a chance to catch up. But that only works if you’re analysing a live system.

So it’s possible that you could still recover the deleted metadata files if you look at the last good version, which is the store.db file. So the last good configuration will still have those deleted records, until that periodically updates; and if you examine the unallocated clusters, you can recover deleted pages from that database in those clusters, and you can still process them. So you can still find those records in other places, they’re just not in the live version: the .store.db.

The second experiment: the databases were populated again; we added in files, we deleted a lot of files. Once we deleted from the store databases it’s no longer recoverable. One notable exception that we encountered was on the FAT32 formatted volume. Every metadata record remained intact and is actually recoverable on a FAT32.

It didn’t matter that the files were deleted; and the last known good configuration, the store.db file, that wasn’t updated after 30 minutes.

The next experiment: we had two USB devices, one formatted exFAT, the other FAT32. We moved them between MacOS and Windows 10. At snapshot one, there was an expectation that the databases would contain two records, but only about half of them were actually indexed. So that did skew the results, but that is because the metadata importer is still working.

The next experiment involved reindexing the Spotlight metadata, you can use th eGUI or the terminal commands. When you reindex the metadata store it results in the metadata being deleted. And they are actually available for recovery from the file system directly. Each time the data were deleted, the databases were created again, as confirmed by a new catalogue node ID being issued.

In experiment eight we were looking at the different major and minor releases of MacOS. Deleted pages appeared within the unused space of the file system across upgrades. Examination of that file shows that the database itself is not deleted, but remains in the same location pre-upgrade and post-upgrade. A check of the physical locations of the pages showed them to be in different locations from the site of the original databases. So it’s suspected, then, that a copy of the metadata store is created when the system is undergoing a major update, and that copy then gets subsequently deleted. And during a minor update, that behaviour does not take place.

The last experiment, then. We were looking to identify if deleted database pages could be recovered from actual casework. So there were fourteen different cases here. Unallocated records are recoverable from the unused space in file systems. Every database page was found to be the size of 16384. They were always found at a sector boundary on a disk. The examined images made use of a different version of the database, store D1; this depends what version of Spotlight you had when you first used Mac, and if you are Mac users you will be familiar with how when you buy a new Mac, magic things happen and everything comes over to your new Mac, so it moves all of those Spotlight databases over with it.

Although all of these are still recoverable, you can’t process them because the structure has changed. If you look at this last one here, I think this is of particular interest, this last case. This is an Apple hard disk — sorry, a hard disk that was originally shipped with an Apple machine, verified with the label and the machine — and it was subsequently used in Windows 10 as a secondary storage device, formatted as NTFS. It was being used on the Mac machine as storage as well, so it was NTFS formatted on the Mac machine.

Sorry, I tell a lie. It was HFS+ to begin with, on the Mac machine, as it shipped with a Mac computer, and then on the Windows machine it was formatted as NTFS. Now I’ll continue my story.

What was interesting is that wwe were still able to recover the pages of the Spotlight database store, and we were able to recover records from the Mac machine: over a quarter of a million records were still recoverable from the NTFS volume after it had been reformatted. That’s better.

In summary, then: the structure of the metadata store database was analysed and partially decoded; there was one thing that we couldn’t figure out. Experiments were used to reveal that records persist for a period of time within one of the copies of the database. Once a record is deleted, it’s no longer recoverable from within the database but it can be recovered from a copy, or from unallocated clusters due to that fluidity.

Deleted pages from the database are recoverable from unused space on the file system. If the operating system undergoes a major update, it appears that a copy of the metadata store is created before it is deleted. Database pages can be recovered from the unused space on the file system. And if the Spotlight index is reset, reindexed or recreated, the whole metadata store is deleted; and then you can get those database pages still from the unused space in that file system.

If you’re interested in learning more, obviously your first port of call would be our paper that’s just been published in this conference. Your second port of call, completely coincidentally and in parallel and without any communication, there was a paper published just last month in Digital Investigation called ‘Investigating Spotlight internals to extract metadata’. So this was by Yogesh Khatri — I hope I’m pronouncing that correctly. What’s interesting is that our scripts that we’re providing are all bash scripts. Yogesh is primarily on Python — again, it’s on Github, you can access the Github links there — the difference between our research and has published research is that we were able to recover deleted files. So once a file was gone, we were able to recover that information from the metadata store. So both approaches were trying to reverse engineer the proprietary store.db file and the associated pages, that’s what’s highlighted in that Digital Investigations paper that was just published, as well.