A Standardized Corpus For SQLite Database Forensics

Sven Schmitt discusses his work at DFRWS EU 2018.

Sven: Thank you very much, ladies and gentlemen. Thank you for having me. I will today talk about SQLite database forensics. My name is Sven Schmitt, as Bruce introduced correctly. I am a PhD student with Felix Freiling at the FAU University in Erlangen-Nuremberg, but actually, I only pursue my PhD during night-times and on the weekends, because I mainly work for law enforcement in Germany.

The topic that I’ve brought to Florence this time is ‘A Standardized Corpus for SQLite Database Forensics’. Basically, we will talk about forensic corpora in general, at the beginning. Then, we will introduce the forensic corpus that we created during this work, and I’ll who a little bit details about the tests and the results that we have, because we tested some tools against the corpus, and finally, we derived some points, some requirements that we think are useful for forensic tools in general.

How did that work start? We basically were trying to find a useful research question, and we came up with ‘How can we transparently test and validate analysis tools for SQLite?’ And we did not find something very useful that existed, so we had to come up with our own answer, and basically, it is to create a corpus dedicated to SQLite databases, because we think SQLite is already as important that it’s useful to create an own corpus comprising SQLite database files. And the corpus will allow to improve tests, to compare test results, and to also improve new algorithms and tools that deal with SQLite.

The database system is very dominant also in forensic investigations. All of you know very well that SQLite is very widespread. We find it on all sorts of mobile devices, but only on regular computer systems. It’s a very popular storage engine also for all the mobile apps and stuff. So, we find a lot of personal data in SQLite database files. For example, calendar data, call lists, messenger data, browser histories, all of that is today stored, in one or another form, in SQLite database files, which is, in the end, coming into forensic investigation. So, in a forensic investigation, we have to deal with SQLite files very regularly.

Our contributions in this work – first of all, we elaborated on the SQLite file format, so we had a look at the inner workings of the database files and came up with specifics, pitfalls, and corner cases as well, defined a set of database files concluding all of these. And we grouped them together in five categories. In the end, we have 77 database files that make up different scenarios, so they are all different but they are all very small. And we group them into a SQLite forensic corpus, so to our knowledge, it is the first corpus dedicated to SQLite.

We donated into the public domain, and you can get your copy, since this afternoon, at the following URL. And the third contribution is … well, we did some tests. We chose a selection of SQLite tools and ran them against the corpus to see how they deal with all the pitfalls and corner cases that we built in.

When we talk about forensic corpora, of course we need to talk about Simson Garfinkel, because he has done, in the recent years, significant work on forensic corpora. He also released a taxonomy on corpus sensitivity comprising five different types of data that can be comprised by a corpus, namely test data, sample data, realistic data, real and restricted data, and real but unrestricted data.

So, how does our corpus behave regarding these points? Well, we specifically created SQLite database files with our own data, so this is not something that is real. It is only test data. But the advantage is that it does come with no privacy restrictions and with no copyright restrictions. So, we can share the corpus freely all over the world, and this is what we are doing. It contains, also, aspects of sample data, because over the years, we collected between 1,000 and 2,000 database files out of the internet. However, we don’t know what contents are in that databases, so we currently do not share them. What we have done though is we looked at the sample data, analyzed it, and drew some specifics and pitfalls out of that, which we included in turn in our corpus.
So, it is test data, it contains aspects of sample data, but it is not real, and it is not realistic data. This is kind of a shortcoming of that corpus, and maybe in the future we will be able to extend the corpus, by sample data, by real sample data, or by real data.

Garfinkel also defined seven criteria to measure the usefulness or the importance of a corpus, which are namely: a corpus should be representative, complex, heterogenous, annotated, available, distributed, and maintained. So, how does the SQLite forensic corpus behave regarding these points?

Well, our databases that we created differ in settings, in number of elements (meaning number of tables, number of [06:54]), and in the contents. But it could be more representative. As I said, it’s test data, and we deliberately kept the corpus very small. So, the databases do not have … they are not very large, they do not have an immense amount of data. This is still missing.

The corpus has a low complexity, I’d say. We have 77 different files, meaning 77 different scenarios, which is good, but the complexity could be higher. This is something to improve by extensions to corpus that probably the future will come up with.

It should be heterogenous. Garfinkel says … yeah, should be created on a range of computer systems, using different patterns and stuff like that, which is helpful. But in the case of SQLite, the file format is strictly specified. So, it is well documented on the one side and on the other side it stays the same, regardless of the hardware SQLite runs on or regardless of the operating system. So, you can even interchange SQLite databases. So, if you take a database from an iOS system and bring it to an Android, you should be able to … I have not tested it, but you should be able to run it right away. So, the file format stays the same.

It should be annotated, which is also why we release extensive metadata along with the corpus. I’ll come to that later on. Basically, we deliver, for each scenario, the database itself, the SQL CREATE statements that have been used to create the database, and an XML file that has additional descriptive information about what is in the database actually.

A corpus should be available. Well, it’s in the public domain. You can download it, it is available.

It should be distributed, meaning using open formats and having available tools for manipulation of the corpus. We do think this is the case for that corpus, because well, it’s SQLite3, the format is open, the format is documented, and the metadata that we provide is in SQL and XML format, all of these formats are very well known and there are tools out there to manipulate these formats and to deal with it.

A corpus should also be maintained. This is an important point, because if the corpus is not maintained, probably in a few years it will be basically useless.
So, this brings me to another conference that will be held in Hamburg in Germany in May, the IMF conference. And we have done some work in the past to present a first extension to the corpus, where we will also cover anti-forensic aspects.

Let’s talk a bit about the metadata that accompanies the corpus. Each scenario has the database file, a SQL file that you can run, with SQLite itself, for example, in order to create the database, and a descriptive XML file containing further information, for example, like how many tables are in the database, what are the contents of the database, how many entries are there, etc. Basically, it’s the same information in the XML and in the SQL file, but we think the XML file can be more easily used to automate whatever you want when you deal with the corpus, for example, to automate tool testing or to automate comparison of results, and stuff like that. That is why we deliver the XML files.

How does it look like? Well, it’s just normal SQL statements. We set different pragmas at the beginning, for example, the page size, the encoding that we want the database to have, and secure-delete flags. With secure-delete … we activate the secure-delete flag at the beginning, when we fill the databases, because we say we do not want deleted artefact … we do not want artefacts that are in unallocated space of pages, for example, in the database, when we build it. And many databases only store logical contents. We have about 50 databases that hold logical contents only, and no deleted artefacts. And the remaining 27 databases also comprise deleted artefacts where we dropped some tables or deleted some entries in order for the tools to be able to recover this data.

The XML file describes the overall contents of the database, so this is an example with one table and two entries. Giving some meta-information, like the title, subject and … yeah, there are different fields, almost 15 fields about meta-information, also the category name, like we have table names, because we group the databases into categories. I’ll come to that right now. And the actual database contents as well. So, you can extract from the XML files what columns are stored in a certain table and what entries the table holds.

Here are the five categories of the corpus – the first one is keywords and identifiers. We played around with weird table names, I brought some examples on the next slide, I think. We encapsulated column definitions in a way that is not so common, but perfectly acceptable, meaning all the databases in the corpus fully comply with the file format specification. There are no tricks that we did or no definitions that we violate. The databases are all fully intact, fully correct, and comply to the file format.

In category three, we played around with keywords and constraints. We used all of the UTF character sets that SQLite offers in category four. Introduced further database elements in category five, like triggers, views, indices. Most of the tables in all of the other categories, except category five, will contain a few tables only. Tree and page structures is another category where we experimented with fragmented contents, meaning introducing larger contents into the database, so they cannot be stored consecutively anymore, they will be fragmented within the database, like in a file system for example – if you store a file in a file system and it gets larger, then chances become higher that it will be fragmented. It’s the same in the database.

We introduced reserved bytes at the end of a page – so, basically, all the SQLite files are paged, meaning you can think of like clusters for a file system. Cluster for a file system is something what a page is for SQLite, so we have several pages, consecutively, from the beginning to the end of the file. And SQLite tries to store all the contents on one page, but if the content gets larger, then it has to be split on further pages, and this is the point where it gets fragmented. And for every page, you can introduce a reserved space at the end of the page, holding arbitrary data. So, this is for folder eight.

We introduced special pages, like pointer-map pages, in some folders, and categories A to E. Play around with deleted content, so we deleted tables, entire tables. We deleted also records from tables, and we also try to overwrite certain contents. The same is true for records – so basically, when I talk about a record, I mean database entry. It’s just a row in a table, for example. We deleted records, we try to overwrite records, and we also deleted some overflow pages or, in general, some pages, from the database files.

How does it look like? This is an example of the first category, where we played around with weird table names, encapsulated column definitions, keywords, and constraints. It’s just to give you an example; if you’re interested in what we did in detail, then please read the paper. We describe all the folders and categories in detail there. But it’s just to give you an idea what we did.

So, at table name can just be double quotes. Normally, you would have the name of the table between the double quotes, so we just omitted the name and left the quotes or brackets, or something with single quotes in between, a space, or [format strings] as table names. And most of these already posed problems to the tools that we ran against the corpus. Then, we did the same for encapsulation, meaning we sometimes have incorrect encapsulation regarding the occurrences of columns of quotes, of double quotes, of brackets, and stuff like that.

Folder four covers, as I said, all the UTF character sets, so we made sure that we have UTF-8, UTF-16, little endian, big endian, all of that is covered. Also, including, sometimes, German umlauts, for example. Or Chinese Unicode characters, or Latin and Chinese Unicode characters, because we wanted to cover that aspect as well. The following folders, five through seven, have different pitfalls. I will not go through all of the details now. You have them in the paper. One interesting point regarding the deleted contents – we described for folders A through E what we actually did, in a very short form. So, if you are interested how the database C-10 has been created, then you can go to that table, look at C-10, and you will see that we first created two tables, with [floats] and texts. We inserted 10 entries in each of the tables, and then we deleted 10 entries from the tables.

That brings me to the evaluation. There are quite some tools out there that allow the analysis of SQLite. Some of them are commercial, some of them are open-source. Some of them were developed with forensics in mind, some not. Some do only extract logical data, some do leverage also the recovery of deleted contents. Some offer both functionalities. Basically, there were too many tools out there, so we had to make a selection and restrict our tests to a few tools, and we selected six tools and made sure that we covered different categories.

We chose two open-source tools and four commercial tools, for example. We chose tools that only extract logical contents, but we also chose quite some tools that are able to recover deleted data. And we have tested the SQLite engine, but we did not include it in the evaluation, because of one reason – all the database conform to the file format, so SQLite, the official SQLite engine, is able to deal with these databases. Only for the logically present contents, of course.

These are the tools that we chose. One of them is Undark – it’s an open-source tool written in C. Another one is SQLite Deleted Records Parser. It’s an open source Python script that we found on the internet. SQLiteDoctor, which is a proprietary software, as well as Stellar Phoenix Repair for SQLite. SQLite Database Recovery and Forensic Browser for SQLite.

And let’s have a look at how these tools are promoted. You find phrases like “SQLite deleted and corrupted data recovery”, “recover deleted entries”, “repair and restore corrupted databases”, “recover corrupted databases, easily recovers all deleted records”, which is a strong argument I think, “repair and export corrupt SQLite files”, this is not doing deleted records, by the way, and “display all present data and restore deleted records”. So, these are strong phrases, and we were keen to see how these tools behave with our corpus.
How did we run the tests? Basically, manually. We took a tool, ran it on all of the 77 database files, and had a look at the results. Of course, when you run a tool on 77 database files, a lot can happen. [laughs] So, we had to group the results somehow. This is a bit … that’s a pity I think, but it’s necessary, because otherwise, you don’t understand really what the results are.

So, we have different categories, starting with a simple checkmark, meaning all elements are correctly processed by the tools. We have a checkmark with a star, meaning some elements were correctly processed, but something else happened, like some error was thrown or some warning was given, something was wrong during the analysis. But we still got good results for the elements. And a red cross, meaning no element was correctly processed. And I want to be very clear – that does not mean that no content at all had been recovered. That just means that no content had correctly been processed – there may be parts of records that we saw, but something was wrong. And yeah, just another sign for not supported by the tool.

And yeah, you see that with different categories … I will not go through the detailed results, because they are just too many. But when we look at category one, you see that Forensic Browser maybe has some weaknesses there, while the other tools behave quite good. In the other categories, we have other tools that fail and tools that behave better. And basically, the overall evaluation showed that none of the tools was able to correctly parse the files in the corpus, and they all showed strengths in some categories and weaknesses in others. This is not surprising, and I think this is also important why we should continue to build on the SQLite corpus, because this will allow for improvements on the tools tested, on tools that will be tested in the future, and on new algorithms that will be developed in the future. So, this is, for the first time, the possibility to have something we can compare when testing different tools over the world.

I will leave the detailed evaluation results for the interested reader of the paper, and come to the conclusion.

What we have done is we introduced a corpus for SQLite forensics. The first version launched today, with 77 databases on board. This is not what we think is the final corpus, so we invite you to participate and to contribute to the corpus. We tested the first 77 databases against six chosen tools, and we showed that they are quite sensitive to contents that are not so common, that are probably not part of many Android database files, but that are perfectly valid, and that can occur whenever you parse a database file. And keep in mind that we extracted many pitfalls out of sampled data, so that means we also found these corner cases and these pitfalls in real databases on the internet. So, this is not just something that we came up in the lab and said, “Oh, that would be fun to test.”

From our test results, we basically derive three requirements that will be nice to have, at least for forensic tools. The first one is: make sure that traces from underlying evidence are correctly shown. We discovered errors and wrong conversions when presenting results to the user, be it on the GUI or be it written to another file, or included in a report or something. Especially with integer values, with [floats], with Unicode characters, there have been many problems.

The second requirement is erroneous analysis on a little part of the evidence should not cause the elimination of the rest of the entire evidence. What does that mean for a database? A database contains different elements – probably many tables, some indices, some views, some triggers, and if you encounter a problem when parsing the database with one of the tables, then try to continue, because that does not mean you are not able to parse the other tables, or the other elements. What we have discovered is that some tools are stopping when they are encountering an error, or cannot analyses something. And the worst scenario is when they stop and they don’t tell the user that there was a problem or something occurred that they didn’t deal with.

And the third requirement – the analysis of existing, logically present data shall not be degraded when you activate the functionality for data recovery. This is also something that we have discovered. When we ran tools without data recovery options activated, we got a certain set of results, and when we activated the data recovery options, the results from the logically present data differed. So, we had less results from the logically present data, for example. This is something that should not be.

Again, you are first of all invited to read the paper, if you are interested. Get in touch. And we would be happy if you make use of the corpus whenever you develop tools or algorithms for SQLite analysis. Test them when you release new tools and algorithms. Feel free to contribute and extend the corpus, and help us, in the end, keeping the corpus up to date, because this is important.

And we will ourselves make another step, because we have been working on an anti-forensic extension to the corpus, where we introduce another, I don’t know, 50, 60 databases, covering, for example, loops or other pitfalls that we think are anti-forensic. Because these databases no longer necessarily conform to the file format, whereas the first version of the corpus holds databases that fully comply with the file format, the anti-forensic extension will no longer comply with the file format. So, this is where we will see crashes from the tools, where we will bring the tools to run in endless loops, and stuff like that. If you are interested in that, join us at IMF 2018 in Hamburg.
That’s it for the talk. Thank you very much for your attention.

[applause]

End of transcript

A Standardized Corpus For SQLite Database Forensics

Get The Latest DFIR News

Join the Forensic Focus newsletter for the best DFIR articles in your inbox every month.

Leave a Comment Cancel reply

David Shipley: Investigating The Darkest Corners Of Digital Evidence

Deepfake Forensics: How To Analyze Suspected AI-Generated Images

Cellebrite Genesis: Get Case Insights In Minutes

Digital Forensics Round-Up, June 24 2026

RAID Forensics: Handling Unknown Configurations In DFIR Investigations

The New Age Of Investigations: Cellebrite’s Journey To Genesis