Patrick Mullan shares his research at DFRWS EU 2019.Hello, I am Patrick. I would like to introduce you to our recent research on forensic source identification using JPEG image headers. So the idea is to identify the source of the image — so which device took the image — with a focus on smartphones, particularly Apple devices.
So why is this interesting? It’s one goal of multimedia security; it’s often called reconstruction of provenance, where provenance means to find something out about the background of the image: the history of the image. Like, was it processed with some software; which path did the picture take, or was it distributed over, until it reached a certain point; and finally, and that’s what we are looking into today, to identify the source of the multimedia content.
So in this scene on the right, the question is, which of the devices took the picture on the right?
One way of doing this is to validate the fingerprint of the image file header, so the file as it comes in the image, and that allows for detection of manipulation, so if the file header was manipulated, it doesn’t match anymore. And it’s a relatively simple or easy way to do, at least in a first step, because you don’t need any deep learning methods to look into the file content, the image content. You always have the file header at hand and can look into that.
We are not the first ones who did this: there are works from Kee in 2011 and Gloe in 2012, who did some studies on that already. They, however, focused mainly on traditional devices like DSLR, point-and-shoot cameras, so traditional cameras they used a few years ago. And they investigated the file headers from those devices already.
However, people have started to use their smartphones more for taking pictures: you probably know that from yourself, that for a quick picture you take it with your smartphone and make it instead of getting a DSLR or so.
And here, that’s from our data derived plot from 2008 until 2018, so over the course of eleven years; and you see in blue that Apple pictures got a much bigger share. So that just shows that Apple smartphones in particular get more prominent use for taking pictures.
The thing is that, on those traditional devices, you bought the device and it was running all the time on the same configuration or so. Now with smartphones, you have a much more complex ecosystem of hardware and software playing together and the software can be changed. So the question is, what does software play for a role when doing source identification of images?
To put it into one scheme: so on top, on the top row, you see an identification granularity scheme for the hardware. So you have a picture and you want to know what type of picture is it, in our case, digital camera pictures? Then which make, which brand, was the camera from, so Canon, Apple, Huawei? And then finally we have an Apple picture, so which specific model it was: iPhone 5, iPhone 6, iPhone X, and so on.
So that is, for the model identification, how it was done up until now. But we suggest here to get a second dimension downwards, which also takes software into account, because there is a lot of software running on a smartphone, for example the operating system; the imaging app of the operating system; and of those, for example, iOS 9, iOS 10 or iOS 11, the imaging app that’s the native app, that’s the pre-installed app that you already have on your device.
And you can also download a device from your app store, like Instagram and Flickr and so on. And of course those lists can be expanded, any of those, and it just shows how complex this source identification is in general.
So we say that the software plays also a significant role now, especially on mobile devices. And with that software can be updated — the hardware was not updated — and you can also now choose the app you want to take the picture with. So it gets more complex.
And that also means that the signature of the file header can potentially be changed, or look different depending on the configuration of the software that the image was taken with. But to take this problem as a new goal, let’s try to do the software fingerprinting instead of only fingerprinting of a concrete model, or hardware.
OK. Let’s look at how pictures are actually stored. If you say you have a JPEG image, technically you have a JPEG encoded image which most likely comes in a fil container — either the JPEG File Interchange Format or the Exchangeable Image File Format — those are the ones which are widely used.
And those file containers then, on the one hand, embed the encoded picture — we will look at how the encoding works on the next slide — and also you can add additional metadata, that’s data describing the picture.
So the JPEG algorithm is a lossy compression algorithm, where ‘lossy’ means that some content is removed, which is hopefully not too bad for the human eye, so it still looks good. And that is mostly done through quantization.
Quantization is a division by a fixed 8×8 matrix, so that’s a coefficiency table which is preselected for the images encoded, and that’s then a constant matrixx. And that enables you to make a trade-off between file size and image quality. The bigger the values in that quantization table is, the smaller the picture gets, but also it looks coarser.
So this is an example picture from Wikipedia, on the left you see a lot of artifacts, that’s because that part of the picture was encoded with very hard quantization matrices. And here you’ll see it looks fine over here, that would be different tables used here.
And the thing is that that trade-off between image size and picture quality is in the hands of the programmer who implements that very particular version of the JPEG encoder. And so you could get here already the first possibility of getting some leads to fingerprint the encoder, or find some characteristic traces here.
And now we have the encoded image, we save that to the file header in the JPEG file, which is split into subparts; so some subparts are used to save the complex image, as we’ve just discussed, and then other subparts can save further metadata which comes usually in key value pairs if they are human-readable.
So for example, a key would be make, value would be Apple; key model, value iPhone 6; key ISO, value 400. Or it can be some encoded proprietary daa that you cannot read or decode without some additional knowledge.
But the important thing is that it is not mandatory to set those key value pairs. So each vendor can set or deploy a different number of those key pair values that also already can give you some leads on what the source was.
Further, key value pairs can be grouped into image file directories — abbreviated IFDs — which are logic groups, also suggested by the EXIF standard. So IFD0 for example, that could be image height and image width; that’s information concerning the main image. The same for IFD1, but for the thumbnail image, where thumbnail is a small preview image that can be embedded.
Then ExifIFD is for photometric information like the shutter speed or aperture it was taken at. GPS, geolocation coordinates: where you were, where you were standing when you took the picture. And finally MakerNotes, that’s data that is often the proprietary information, not so easy to read and not specified in the standard.
Then we know how the JPEG file works, let’s look at the data we looked into. We downloaded pictures from Flickr, that’s an image sharing website, and we only downloaded pictures tagged as ‘original’, because Flickr says then that they do not process the image any further, so they provide you the picture as it was uploaded.
We downloaded pictures over the last eleven years, and we did some pre-filtering. So for example, if we found the value ‘Photoshop’ in that EXIF data, we discarded the picture, because then most likely it was manipulated. We wanted to fingerprint the original source, so pictures like that were discarded.
And further, we also saved the user ID uploading the image, so that enables us to sort the pictures by user. And we only allowed for one picture per user in our dataset for investigative property. Because if one software could bypass our filter rules, and one user would upload many images that bypassed our filter rules, he would pollute our data in some way and make it more noisy, so we only allowed one picture per user, or one device.
So the goal of the study is to link the make, model or software from the header information, to fingerprint it. And for that we read out the value pairs — key value pairs — of make, model, and software, and considered it a ground truth for the evaluation.
How did the evaluation work? Let’s start with a traditional device. Here you see a Canon EOS 450D, that’s a DSLR camera. And now you see five plots, which are those logic groups of the EXIF metadata: ExifIFD, IFD0, IFD1, GPS, and MakerNotes. And each of the plots goes from, in this case, 2011 to 2018. And you see how many key value pairs were set in each group.
So for example here, about 70% of all pictures from that camera had exactly 31 key value pairs set in ExifIFD. And this number is constant over time. So you have a constant bar, and this bar you find different values maybe, but you find it everywhere in the plots, in groups.
Let’s compare this now to the smartphones. So from an Apple device, for example, that’s the same plot. We again have grouped by five plots, by the groups. And now you see, for example, ExifIFD in the beginning, in 2011, had 25 key value pairs set; so in 2012; but then only 24; then it jumps up to 31, then 34, and continues. So it is not as constant as it was for the DSLR camera, for the traditional cameras.
And you find those jumps at many places, for example here Apple decided to add more GPS information; and also in the beginning, Apple had no proprietary information in the MakerNotes, but then started to deploy more and more information in that MakerNotes as well. So in contrast to DSLRs, those numbers are not constant over time.
Further, let’s look into the quantization table: that were those eight-by-eight matrices with the constants. And we basically hashed them to get a unique ID, and then sorted them by frequency. So this is now for iOS version — not for year, but for the iOS version — and you see for iOS 5, 6 and 7, you find either one out of two matrices in the beginning. But suddenly from iOS 7 to iOS 8, you do not find those matrices at all anymore.
And the interesting thing is, that was for an iPhone 4; for an iPhone 5, you have exactly the same jump between iOS 7 and 8. So that means that Apple changed the quantization matrices and they did it with the operating system changes, or updates, independent of the actual model use. So it doesn’t matter if you had an iPhone 4S, or 5, or any other one; if you would change from iOS 7 to 8, your quantization matrices were changed with that.
So as I showed you, for traditional cameras the header information is much more constant, but for smartphones it changes over time, and the changes occur with the version updates, or changes. And the metadata characterises more the operating system, or the software, than it characterises the actual hardware: the iPhone 4S and the iPhone 5 both had the same changes.
That makes the associating of the header information with a concrete model more difficult, because now the software changes and with that the file header information changes. But this leads us to a new experiment: can we determine now the software version, and not only the model version, from that information?
For that we trained a random forest classifier — that’s a basic machine learning algorithm — and we studied the confusion between operating system versions.
So here you see on the left the hypothesis of the random forest, which iOS version is predicted; and on the bottom, what the true iOS version was according to the metadata. And for example, in about 90% of the cases, if it predicted the picture was taken with operating system 4.0 it was correct; and so overall on the main diagonal it’s highly populated, so it predicts more or less reliably the operating system of the picture. And if there are confusions it’s usually only on neighbouring operating systems, but there are never jumps like from iOS 4 to iOS 11 or so.
And in the paper, we also did the same for model identification, so the hardware version, which was only about 60%. So the software version prediction is better than the hardware version prediction; still, the hardware prediction is not completely bad. And that confirms that the header information changes with the software version.
And finally, the last experiment: can we also determine, or estimate, which app the picture was taken with?
For that, we studied the uniqueness of the quantization matrices. So we looked at one very specific quantization matrix table — or pair of tables — and looked at what was written in the EXIF field ‘software’. And on the top plot, you see for one given quantization table we found about 204 times the value ‘Snapseed’ — that’s an imaging app you can download from the app store — and only 17 times we would find some different value in that field. So that means, if you find this quantization matrix, it is ten times more likely that a picture was taken with Snapseed than with any other app, or the data was removed.
That worked for Snapseed pretty well. For Camera+ on the bottom, that’s a different app, it worked not so well. So here we found 49 times for one specific quantization table the value ‘Camera+’, but 67 times a different one. So it can work, but it does not work always. But it can work.
OK, to wrap up my talk. I showed you that smartphones offer a much larger variability of the information in the header, and that it is not constant over time. We investigated the possibilities of associating the JPEG header information from Apple devices — mainly iPhones — and then looked into the linkage information on the header information to the hardware, to the operating system, or to the imaging software.
And the general observations are that the software components are changed with the updates, or can be changed, at least. That the software stack is not fixed, but the user can use now any arbitrary app, which he could not do on his old DSLR camera. And that much metadata, or compression parameters, are dependent on the concrete software version, the concrete software stack used at that point, when taking that picture.
Nevertheless, that allows us now to fingerprint software instead of only models. And in this scheme, you see that we suggest now to use this second access to also uncover information about software when doing source identification, where software can be the operating system or the app that was used for taking the picture.
Yeah, that is my talk. Thank you, and I’d be happy to answer any questions.