Wust: Hi everyone. My name is Karl Wust, and I’m a security researcher from ETH Zurich, and the work that I present here is joint work with Petar Tsankov, Sasa Radomirovic, and Mohammad Torabi Dashti.
Usually, when we open a file – if we open an image file, for example – we expect something like this. The [file viewer] pops up and shows us the image that we want to see, in this case, a nice photo of Zurich. But sometimes, files get corrupted, for different reasons – transmission errors, [degradated] storage, or some other reasons. And in some cases, we can still open the image, but it will look something like this. So we get some artefacts here. In this case, we have some part of the image that is on the bottom right that should actually be on the bottom left. But you can still open the image and you can still see that this is a picture of Zurich, and if we compare this picture to the previous one, we can still clearly see that this is originally the same image. So we can still get some information out of this.
But we can have different kinds of corruptions. And these corruptions lead to something like this. The file viewer pops up and tells us, “Could not load image ‘zurich.png’ because this image is not a PNG file. In this case, the file [01:49] of PNG is broken, that’s why the image viewer doesn’t recognize it as a PNG file, and it produces an error, and you don’t get anything. These are the kind of corruptions that we are concerned with and that we work on. So what I’ll present to you in this talk is about repairing these kinds of files, files that cannot be opened by a file viewer.
This means that, ideally, we get from producing an error to, at some point, opening the file. However, if we are only able to open the file with some artefacts showing, and we can still get some information out of this – so we’re still somewhat happy.
Our goals are that we have an invalid input, a file that we cannot open, and we want to get file viewer, at some point, at a later time, to produce an output that is, for pictures, visually as similar as possible to the original file; for other files, the similarity depends on whatever we have. For example, if we a PDF, a text-based PDF, you might want the text to be as close to the original as possible.
Our approach differs a bit from previous approaches, in that we don’t actually change the data of the file, but instead we modify the execution of the file viewer when we want to open a corrupted file. And we learn the behavior that we want to enforce by opening a lot of valid files, and see how the [file viewer] behaves when opening these valid files.
So our tool works in two phases, basically. We first have the training phase, in which we learn the behavior; then we have the force open phase, in which we force the file viewer to actually open the corrupt file.
In a file viewer execution, we usually have some checks. In this case – this is an example for a PNG file viewer – we have the check, and it says, “Does the file start with [04:28] 137 80 78 …” and so on. This is the PNG file signature, so this is one of the first checks that is done, to see whether it’s a valid PNG file. If this condition is false, then we get this pop-up with the error, telling us it couldn’t open the file. If this check succeeds, then we continue the execution. Then at a later point in time, we might have another check that checks whether the CRC checksum that is stored in the file is actually equal to the CRC checksum that we compute from the data of the file. And there, again, if this check succeeds, we continue the execution; otherwise, we will produce an error, this time telling us that the CRC check failed.
Now, the behavior that we do want is [basically this path down here]. What we don’t want is the program producing these errors, right? Because at the end, [we want to] open the file. So if we land here, because this condition is false, we don’t reach our goal, the program doesn’t open the file. The same for this here.
So what we can do is just force the program to continue down the left path, in this case, and force the checks to always evaluate to true. This way, we won’t get these errors. So if you look at the bigger picture, [06:17] we get some execution tree. So each of these [bubbles] here is a conditional, and so we have one valid file here, that’s for the first conditional, takes a left branch, and for the next conditional that’s on the path, it takes a right branch, and so on, until at the bottom here, the file viewer opens the file. For example, for a PNG, it would display the image here. For a different file – if you have an invalid file that [06:55] input, it might take a different path. So let’s say at the first conditional, we continue on the right path, and then we have another check, and there we again continue to the right path, and then here, the image viewer will display an error message and the file will not open.
The valid files are files that satisfy the file format specification, and the invalid files do not satisfy the file format specification. This format specification can actually be implicitly … implicit in the program. So this doesn’t need to be a specification that is written down somewhere and standardized by some organization. But this can be an [implicit] specification regarding the actual file viewer that we look at and the constraints that are given by this format specification manifests in the execution of the program through these conditional statements that we have here and through these checks.
What we can do – we can record the branches that are taken when we execute the file viewer with a valid file as an input. So in this case, we have the file from before, we have this execution path here. You can have an additional file where the execution always takes the left branches, and we can record multiple such branches, such paths, and we get this collection of branches that are taken. And we see here that for the first check, for example, some valid files take the left branch and some valid files take the right branch. So there, we don’t get any information about whether this file is valid or not. But if we look at this check here on the right, we see that all of the valid files actually take the left branch. So if we are at this conditional here, then we know that if we want to open a file, we have to take the left branch; otherwise we will get an error at some point.
So we can actually basically throw away all of the recorded branches where some files take the left branch and some files take the right branch, and just keep the ones for the conditionals where [the file viewer] behaves the same way with all valid files as input. So we’re left with some smaller collection of branches, and we know that these branches are always taken if a valid file is an input and we reach this conditional. So now, if we go back to the execution path of the invalid file that we had before, we actually, at the first check … we don’t have any information yet about the validity of the file, because some valid files take the left branch and some take the right branch.
So we just continue the execution to this check here. But at this point, we actually know that all of the valid files take the left branch, and so instead of going the right path, and being 100 per cent sure that we will reach an error at some point, we can just force the program to take the left branch instead. And we will continue the execution and go to this next [pair]. And this one, we again know that all of the valid files take the right path, so we will again enforce this path, and reach this check. And here, we actually again don’t know which way the file should go if we want to open it. So the check is up to the file viewer, and in this example, let’s say it takes the left branch and then hopefully, we will be able to see the file here. So hopefully, the file viewer will open the file when we reach this point.
This approach, the nice thing about this is that we can implement this as a black box approach, meaning that we don’t need any information about the file format, we don’t need access to the source code. All we need is a binary of the file viewer, and some valid files that we can [trace]. And we use binary instrumentation and execution hijacking, so when recording the program executions for valid files, we inject a function after each jump statement that records from which location the jump goes to which destination. Then, after we collect all of these traces for the different program executions, we can combine them, and then, again use instrumentation to hijack the execution of the file viewer when we want to open a corrupted file, and insert unconditional jumps before the jumps that we want to force, so that the program will always take the same path at these conditions.
This makes the approach generic – because we don’t need any format-specific heuristics. We inherently don’t have any format-specific heuristics, but instead we learn how the file viewer should behave in order to open these broken files.
We implement this in Python and C++, so the … with C++ we use the Intel Pin framework for instrumentation, and then we have a wrapper around that in Python that does the merging of the different traces and gives us the branches that we actually want to force. And for the evaluation, we used random file corruptions. So for this, we actually … let’s say if we want to corrupt one single byte, then we corrupt a byte in the file, try to open up with file viewer; if it succeeds, then we throw this file away and try again with a new byte that we corrupt, and do this until we get a file that cannot be opened. So we get files that are invalid, and we get a uniform distribution on the corrupted files of this type.
Now, to the results. We compared our results to two reference tools, and we evaluated this for PNG files, JPG files, and PDF files. And for the image files, we compared it to PixRecovery, which is a commercial image recovery tool, an image repair tool. And for PDFs we compared it to PDFtk, which is a PDF editing toolkit that is also able to repair corrupted files.
Our approach, or the one thing that is very surprising is that the overlap of the files that our tool is able to repair and that the reference tools are able to repair is quite small. So you see here, for PDFs, actually, only less than one per cent of all files were repaired by both tools. But a significant amount was repaired by either of the tools. So our approach actually increases the total success rate if it’s used in combination with existing tools.
For PNG files, our approach is actually more successful than the reference tool, and again, the total success rate is increased significantly. The same goes for JPG files. So for PNG files, the overall success rate is increased by roughly 85 per cent, for JPGs about 35 per cent, and for PDFs about 30 per cent.
Then, this is some more detailed results for PNGs. PNG has very nicely structured file format, which makes it nice to analyze. And these are files with one-byte corruptions, so we exactly know in which part of the file the corruptions are. And we see with this that some corruptions are easier to repair than others – which makes sense, because if we have a file header that basically always looks the same, we should be able to repair more of these files than for some other corruptions. And we actually see that a lot of these files we can repair.
On the other hand, if we look at this line here … IHDR is the image header of the PNG, and this stores information such … for example, the image dimensions. And these corruptions that we have here, the “chunk data (other)”, most of these are actually in the image dimensions. So we see that for these kinds of corruptions, we are only able to repair very, very few files, so that this file here that we could repair is actually not in the image dimensions. for image dimensions, we weren’t able to repair any of them, because this is very highly data-dependent, and there is no simple “If then … else …” statement where you could force the direction of the branch.
So this is one type of corruption that we cannot handle well. But for stuff like CRC checks, the approach works reasonably well. For things that are similar or the same in all images, the approach works quite well, and also for these ancillary chunks – these are chunks that are not required by the PNG standards, so these are additional chunks, they don’t have to be present in a PNG for it to be valid, and in these, we are also quite good in repairing them, because they can sort of be ignored.
Some of the limitations. As I had already mentioned, the approach is not suitable for highly data-dependent corruptions. And also it requires if-then-else statements for the conditions that it works with. So if we have some chunk table, where it chunks to some location that is dependent on some data, this will not work, because we don’t have a simple branch. And one of the most important things is that using this, we can get unwanted side-effects, and it should be sandboxed in an adversarial environment, because if you force the program to behave as if all input files are valid, then we also don’t do any checks any more for buffer [flows], for example. So this will very likely make the program vulnerable, and thus, it should be sandboxed if you expect an adverse [error] to be present.
We are already at the conclusion. The approach I have presented is lightweight and it’s a black box approach that modifies the execution of the file viewer instead of modifying the file itself. We applied the approach to PNG, JPG, and PDF files, but it is applicable to any file format. And it works very well complementary to existing tools, and is able to increase the overall success rate significantly, and if you want to look at the tools, they’re available online for you to download if you wish.
Alright, I’m open to questions now.
Host: Yeah, so we’ve got a few minutes for questions.
Audience member: I have a question to the execution tree that you showed … and you have the first switch that you have there, you said it’s not important or you don’t modify the decision because on both sides you have valid files. So what I was thinking when I was looking at this, in a hypothetical decoder, this might be … the first check might be something like “Which file version do I have? Which version of the file format that I’m using [do I have]?” So if the error in the file is directly in that, that small part, so it’s not like saying it’s version one of the file [followed by] version two. And that might be the only modification that stops this file from being correctly interpreted. And by just accepting that decision, you might go directly along the path that leads to a non-valid file, but if would change exactly that spot, you might have a valid file. Did you consider that or is it prohibited because of the number of decisions [22:35]?
Wust: I think that’s actually possible, but it’s sort of … if we consider the specific corruptions, then we already need to take into account the file format itself, and we need to have more in-depth knowledge about the file format. So this is sort of [23:01] to … it’s actually … it sort of goes against this black box approach, where you don’t want it to be necessary to have this detailed information. What we can do in this case is that if you know that the file format is version two, then you can just train it with files of version two, and then, it’s likely – although not necessary – but it’s likely that it will then have the same behavior for all files.
Audience member: I understand that you explained that you measured the success rate. But I imagine that the outcome of the reference tools, and your outcome, might also [differ in quality].
Audience member: Did you measure the difference in quality?
Wust: Well, it’s difficult to measure the difference in quality. What I can say is that for the repaired images, with our approach, most of them actually looked identical to the original. Some have some small artefacts. With the PixRecovery, the results were overall a bit worse. So what we did to compare our outputs with the originals for PNGs and JPGs is we used this keyhash library, which is a library that produces a visual hash. They have different metrics that compare the similarity of images, so they are robust to some changes. For example, if you would change an image from color to black and white, it will still recognize it as the same image. So we used this, and we also did a visual inspection, and in the visual inspection, the results that our approach produced in general were of better quality than that of the reference tools. But of course this is not an objective metric, because … yeah.
Audience member: You have an [if-then-else] statements, that what could happen and … [if my] programs have more [loops than if-then-else] statement … so what happens with loops?
Wust: This approach works on the binaries. And in the binaries, we actually don’t have loops, and we don’t have if-then-else statements, we have chunks. And there, it doesn’t really matter that much. So for loops, we have usually two chunks – one chunk that basically goes back to the beginning, and then one chunk to exit the loop. And since usually, or hopefully, if we have valid files, we will exit the loop at some point, at these jump statements we will have both [halves] that can be taken. So for loops, in most cases, although not necessarily all cases, this should actually not affect the behavior of the loop.
Audience member: [26:58] basically, you want to shortcut any test that [27:02].
Wust: Yes, more or less.
Audience member: Yeah, so I was wondering, why would you need to take the program from [27:08]? So instead of looking at the tests, looking in the possible places where you see that you have error messages or things like this, and then, go backward and see which test [could lead] to this messages? This way you wouldn’t have to train your system, because you would have found all possible errors.
Wust: Yes, for that you first need to identify how the … or what an error looks like. This depends on the file viewer. In a lot of the image viewers, an error doesn’t really produce an error, in the sense that we actually get the output in some way that we can programmatically see the error, because we get a [GUI] that this place, hey, there was an error. So we don’t get the error. The error
because the program hasn’t finished yet. So this is somewhat difficult to identify, actually, all of the points where you get an error. But there is actually an approach that is somewhat related to what we did, which is called [ducovery], where they used symbolic execution to repair files. And that sort of goes in that direction because you replace some parts of the file with symbols, and then basically want to solve for these symbols in a way that they don’t produce an error. The drawback of that is that it takes a lot more time. So to repair one file, you need to do the whole symbolic execution, which is very computationally expensive. But it’s also an approach that works surprisingly well. So these are independent approaches, I’d say. Host: Any other questions? Maybe from this side of the room? Or are you too busy enjoying the view? [laughter] Audience member: One more question. [29:23] [statistics, and you group them by file type.] Wust: Sorry? Audience member: [29:28] statistics that were grouped by file type, and your approach is also always one block. Did you try to use different PNG viewing libraries that you have to [29:41] or … Wust: No, this is always regarding one program. Because this is actually something that we found when we did the research – when I started out, I used Image Magic as the file viewer, and then I generated a lot of corrupted files with Image Magic, and it turned out a few weeks later that a lot of these files that Image Magic wasn’t able to open were easily opening with other file viewers. So we then switched to another file viewer that was more robust in order to not affect the … get some false positives in the results. So this is actually regarding one file viewer where the files are corrupt with regard to the specification that is implicit to this file viewer. So even though we have a PNG specification, not all PNG file viewers behave the same way when opening a PNG file. While they all open files that adhere to this specification, some may be more lenient, and some less. And so a file that is corrupt with regards to one file viewer might not be corrupt with regards to another file viewer, in the sense of invalid files. So that’s why we concentrated on one file viewer. But in practice, I think you probably could increase the overall success rate even more if you use different libraries for these files. Host: Alright, I think that’s … [we’ll have to stop]. Thank you very much, Karl. [applause] Wust: Hi everyone. My name is Karl Wust, and I’m a security researcher from ETH Zurich, and the work that I present here is joint work with Petar Tsankov, Sasa Radomirovic, and Mohammad Torabi Dashti. [u]End of Transcript[/u] [i] Forensic Focus will be at DFRWS EU in Italy in March - [url].[/i]