Oisin Boydell discusses his research at DFRWS US 2018.
Oisin: Good morning, everyone. Can you hear me? My name is Oisin Boydell. I’m from CeADAR, which is the Centre for Applied Data Analytics Research, and that’s at University [00:15]. I’m going to talk about our paper, which we call ‘Deep Learning at the Shallow End: Malware Classification for Non-Domain Experts’.
To give it a context – malware analysis and detection and classification face a number of challenges, and these are related to the huge volume and variation of malware and data that is present. This is very dynamic. Malware is constantly changing, it’s constantly adapting. There’s definitely a cat-and-mouse between malware creators and developers, and tools and approaches to prevent that and analyze that. Also, it requires very deep domain expertise. So, the techniques, the tools, the expertise that is required in order to analyze it and to understand what it’s doing, and the different methodologies for measuring what it does and how it acts is very specialized. This is also very time-consuming.
For example, running malware in a sandbox environment, measuring its different impact, and [culling] out different features and behaviors of what that’s actually doing, that takes a long time. And this makes it very hard to scale. As there’s more and more different types of malware, the issue is becoming more difficult to scale [01:44] volumes.
These traditional approaches require very specialist tools, computational resources – so, we’re talking things like virtual machines, specialized sandbox environments that need to be configured, isolated networks where the monitoring of network communications can be done, which is completely separated from [broader] connected networks. And the time as well. Malware often needs to be executed in real time for analysis. When you want to determine its impact on the underlying hardware and network traffic and so on. And it needs to be run in real time, as it would be executed in a real environment. And then the expertise around it is … it takes a huge amount of expertise. This diagram on the right shows a number of different approaches.
Might be a bit hard to read at this resolution. They’re not very different approaches and techniques that are commonly used. There’s things like dynamic analysis, you’re building a virtual machine, you’re testing using a malware sandbox, you’re monitoring the activity, you’re monitoring the processes, you’re detecting DNS activity, and you’re looking at things like [03:03] strings from starting up malware, you’re analyzing the link libraries, and [disk assembly] and so on. All this takes a huge amount of time, it’s hard to do, there’s a lot of expertise and computational resources [03:18].
Machine learning in malware analysis is a newish area. Currently, there’s a lot of interest and ongoing research around using machine learning for malware analysis. There’s a number of good surveys on this. For example, in 2017, this survey on usage of machine learning techniques for malware analysis, and there’s another from 2014. And in our paper, we’ve provided a sort of high-level review of machine learning techniques in this area.
What machine learning can do is it’s been used to automate and also improve the [efficacy] of many malware analysis tasks. And particularly, the last speaker was talking about an interesting approach in this field. It’s particularly been applied malware classification – that is, once you have a number of different types or classes of malware, you automatically infer for new examples what class this particular piece of malware belongs to.
And just to give an overview of the machine learning and a bit of context for our work – this is a typical workflow and how a machine learning task operates. You start with a label training set, so in this case, these could be different examples of malware, which are each labelled to a different malware class. These then … the machine learning algorithm then trains over these examples, produces a trained model. This trained model then has learned the patterns and the different expressions of these different classes of malware.
So, then, when you get a new example, an unknown class, you can query the model, and it will give a prediction of which is the correct malware class for that model. This obviously depends on the scope of your training dataset and the classes that it contains in it, and the capabilities of your machine learning algorithm and so on.
This diagram [… this is] a crucial step here, and it’s a simplified view. In the majority of traditional machine learning algorithms, what’s required is feature engineering, [an] extraction step. Traditional machine learning algorithms, they can’t operate directly on the raw data – so, in this case, the raw binary malware files. They need to operate on higher level features. But the extraction of these features is very manually intensive work. It requires a lot of expertise and understanding of what that data means and what kind of features are useful for the classification task.
So, now we run into a scenario where we need both the domain expertise to understand malware, to disassemble it, to run it in different sandbox environments, to measure the activity of it, to [06:23] these higher-level features. In addition to that, we need the machine learning expertise as well. So, there’s a lot more complexity to do machine learning for [this type] of malware analysis.
And as I mentioned, the generation of these features is still very manual. And some of these typical kinds of features that are used in malware, machine learning classification is [06:51] features such as [processor instructions], strings [and other static] resources that are contained in the code, library [imports] and [06:59]. And this will be done from statically viewing the malware files. So, for example, using maybe [disassembly tools, etc.]
Then there’s also dynamic features. These can be dynamic system API calls, also interactions with other resources, such as memory and storage, as the malware is actually running. The previous speaker, their approach was very much focusing on this interaction with the memory. Then there’s the network communications. But these are still the same kind of features and the same kind of tasks that traditional malware classification relies on, which, as I said, needs a huge amount of domain expertise.
To introduce deep learning – what deep learning is, it’s a type of machine learning that’s based on artificial neural networks. It’s called deep learning as there’s many levels and layers of these artificial neural networks, which allows these models to simulate very complex relationships between the data and the output. So, they’re able to model very complex functions.
A very key feature of deep learning is its ability to operate on very low-level raw data representations. This is a diagram here [08:23] to represent this. For example, with the traditional machine learning approach, in the top, we see in this example … this could be, for example, to classify whether an image contains a car or not. So, we have the input, which is an image, for example.
Then, there’s this stage of feature extraction, which requires manual knowledge, manual effort, to extract the various features that may be in that image that the algorithm can then operate on. This goes into the classification model, to learn that model. And then, this can provide predictions as to whether or not an image contains a car or not, in this example.
However, if we contrast this to deep learning, the deep learning model is able to do this feature extraction part and the classification part all as a single operation, and it learns the features automatically from the data that corresponds to the output and the function that it’s trying to learn.
We can look into that in a little bit more detail here. We see on the left-hand side typical rule-based systems where you have hand-designed a program, domain experts are hand-designing the various rules in the system. And going to classic machine learning, where we still have hand-designed features, but in this case, the model learns [the mapping] from these features to the output. However, on the right-hand side here, we have representational learning, and this includes deep learning on the right-hand side, where the input … we can give the models the raw input, they can infer the simple features, they can build additional layers of more abstract features from the simple features, and also learn [the mapping] from these to the [output].
And deep learning, because of this ability and also its ability to map and to learn very complex associations, it’s rapidly achieved state-of-the-art performance across a broad range of application areas. There’s quite a lot of publicity, [this is] quite well known, so for example, in image analytics, [the machine vision], we have huge successes in object detection and image classification. It’s completely transformed that field. We see it in things like playing human games, so for example, Go. A number of years ago, Go was considered [a game that might be] quite a number of years off before computers would be able to beat the top human players, but this has happened in the last year, recently, with [11:00]. And we’ve also seen applications in the medical domain. For example, there’s an image there analyzing cancer from medical imaging. So, this is [based back on the] image analysis [angle]. So, Google [translation] services are very much based on this. So, deep learning has a lot of different applications.
We’ve applied this deep learning approach over the raw data [11:23] classification. So, we’re looking at building these models on the static, raw malware executable data – just the raw, static data. And what we have here is … this is just a visualization of a couple of examples of malware files, where we’ve visualized [in] the raw malware bytecode in terms of [11:46], and we’ve put these into a 2D representation. So, this just shows that between different malware examples, there is a lot of patterns, a lot of information in that, even looking at it from this context. So, we can see there’s a lot of information here, and this is the kind of raw data that machine learning and particularly deep learning algorithms can really harness and take advantage of. And this is a data-driven approach, so the idea is to allow the model itself to [deep learn and model,] to learn the features from the raw data itself.
Just back to the motivations – what are we doing this for? Well, there’s no deep malware domain expertise required. So, we’re running the model directly on the raw bytecode of the malware files. There’s no sandbox environments, no disassembly, no need to manually identify and extract the features requiring that expertise, to know what to look for. It’s easily adaptable to new types of malware and class types, so just requires [label training data]. And the speed of it as well – there’s no need to actually run or disassemble the code, this whole feature extraction process isn’t required. And the classification in purely based on the [13:00] bytecode.
But the question is, well, how can an approach that’s based purely on this [13:06] bytecode and essentially ignores a lot of this human malware domain knowledge, and work that’s been done in determining what features work well for machine learning-based malware [classification], how could this be any good? Well, in our evaluation, we tried three different model architectures. Convolutional neural networks, which are commonly used for machine vision and image analysis, so these work on very small, localized regions in the data, to find localized patterns. We also evaluated a model which had convolutional neural network layers followed by unidirectional [long/short-term] memory. What this is able to do is to find longer distance associations between these underlying patterns [that the CNM architecture can infer], and you find longer distance associations across the malware file. And we also tried a third model, which used a bi-directional [LSTM layer] on top of the CNM.
The [data that we used] was from Kaggle – there was the Kaggle Microsoft Malware Classification Challenge back in 2015. The URL was there, it’s a public dataset. This was over 400 GB when it was uncompressed. It contains malware examples in nine different label classes and has over 10,000 malware files as raw bytecode, which have the labels [14:40]. And then there’s a test set without labels. So, for our work, we were focusing on the training set, as the original challenged closed in April 2015, so we had to use the examples that had the labels provided.
There was some required data preprocessing. So, although our approach, it’s very much designed [to work] on the raw static malware bytecode, we did need to do some preprocessing. But this is easily automated, and it doesn’t require a particular analysis [per] different types of classes, for example. What you see here is a histogram of the file size of the malware examples in the dataset, and as we can see, there’s a range of sizes, range of quite small [15:31] to quite large files [in the scenario]. One of the restrictions of the types of deep learning architectures that we’re using is that all the samples for training and for classification need to be common size. So, what we did is we used OpenCD, which is an image analytics library, to compress each sample down to a length of 10,000 bytes. This is a lossy compression in itself, so it’s again losing quite a lot of information from the raw original file, the raw data.
But we used an image, this image compression technique because this type of compression is designed to preserve patterns, is designed to preserve the information in the data, as it was [… as they’re used] on image data.
Another advantage of this as well is that to train the deep learning model on the original size of data would be very intensive in terms of computational resources. So, by reducing the size of the training examples, it just makes it more feasible for training.
There’s also a [class imbalance] in the dataset. As you can see here, there are nine different classes of malware, and there’s a huge variation in the number of examples, where some of those have a very small number. And this can cause problems [in the machine learning] context. So, we applied two approaches to this. The first approach, we simply preserved this class imbalance, so we took the data as it was [17:08] in our training. And the second approach, we used resampling and oversampling to balance this representation in the training data.
To show the results – we used five-fold cross-validation on the dataset, and you can see here that the different [accuracies] for the different architectures that we used and the two-sampling approach for each architecture. The best performing [accuracy] [17:36], which is the measure of [accuracy] that’s balanced by this class imbalance, to give a more accurate [prediction] of that.
Well, we can see the top-performing model that we had was the [CNM] with the [17:51] and using the rebalanced dataset for training, where we got 98.2% accuracy. So, this sounds pretty good. But what we needed to know was what’s the context of these results? Or how do these results compare to other approaches?
Two similar approaches that are related are … from this paper … first one [we found], Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. In this work, they used traditional machine learning approaches and algorithms, and they also used features engineered from the disassembled binaries, and from other more traditional feature extraction processes. But they also combined this with classic visual image analysis. So, those visualizations of the data that I showed previously, they did a similar thing on the data and did classic image analytics, image classification approaches, and features from that using [18:52]. And they were able to obtain 95.5% accuracy using the exact same five-fold cross-validation approach that we did. So, we were able to beat that at least, with 98.2%, using our approach.
Another related approach was using convolutional neural networks, similar to our approach. So, what they did is they just used the basic convolutional neural network on that raw binary data, they didn’t use the compression. They used a much more complex model, with many more variables in it as well, so it was less of [… practical from a training and … with real-world practical use].
So, they measure their accuracy using Log Loss, using the [19:44] Kaggle dataset. There’s more information in the paper about [these particular] metrics. So, they got a public score off Log Loss of [0.11] and a private score of [0.13], whereas when we evaluated our approaches in the exact same methodology, we were able to get Log Loss much less for both the public and private segments of the data. So, there’s more information [about that] in the paper. Our approach was able to better these similar approaches [20:14] in the context of how our results compare.
Also, I was interested [20:21] the practical runtime considerations, because we were very much trying to simplify and enable the malware classification and malware analysis to be a lot more practically used. This shows the training time – so, our most complex model, we have a training time of just under two hours, and this is on some quite specialized hardware [20:45], which can speed up the deep learning training time. But to put this in context, this is an offline process, this training only needs to be done once, it can be easily updated as new examples arrive, which doesn’t take the full training time. And this is done offline, so this can be done … it’s not required when you’re classifying new malware examples. Whereas if we’re classifying a single new binary file, it only takes an average of 20 milliseconds, and that’s on a pretty standard machine using CPU.
So, this shows that this kind of approach can be used in live, online, [near real-time] contexts where you might be classifying malware or a [stream of files], as is seen for example in a real live context.
To summarize our approach: Our approach, it doesn’t require deep domain knowledge of malware. It doesn’t require time, tools, and resources for the complex feature extraction, which is the common methodology for malware analysis. And particularly machine learning for malware analysis. Classifying new instances is very fast, so it’s practical in online, live, and near real-time applications. And it’s also [scalable to new] identifying malware types. So, you just need the labelled examples and the model can be incrementally trained on those. And we’ve shown it can achieve very high accuracy as well.
Just in conclusion – we’d like to evaluate [22:25] datasets, and we’re particularly interested in binary malicious/benign classification tasks. So, this will be used in the context of a malware filter and detection application that’s running in real time. We’d like to explore the capability to identify and record the similarity between malware classes and variants. So, to use this trained model as an analysis approach to analyze how different types of malware relate to each other. And we think this [has some interest] there.
And we’d like to apply to the task of determining the type of binary [packing use, so the packing approach]. We’ve had some discussions with the National Cybersecurity Centre in Ireland, and they actually proposed this as a task which they found took quite a lot of manual work to do, and [slowed] their workflows. So, to provide that.
And my background isn’t in forensics, I’m coming from a machine learning background, so I’d be very interested to talk with people about potentially other applications that [23:33] we are not aware of.
Finally, if anyone has any questions. , so people can rerun these examples and the data [23:47] available from Kaggle. Thanks.
[applause]Oisin: Sorry, [could you say that again]?
Host: The question was: In your opinion, why do you believe malware classification has problems [24:14]?
Oisin: Yeah, I think it would work in a number of contexts. One is for the analysis of malware, so understanding how different classes maybe relate to each other. And I can also see that it could be used in terms of tools to filter and plot malware, so in determining which different types of malware is being encountered. And to do that [24:42] in terms of more the binary malicious versus benign, [having that] as a prediction on [24:51].
Audience member: [Inaudible]
Oisin: Yeah, well, I’m not sure of the details of how that dataset was generated, and I’m sure that dataset doesn’t … it represents quite a small view on malware. There’s only nine labels there. I presume that’s [created] from Microsoft Malware Analysis teams. Yeah, I do agree it’s a very restricted dataset. We’d like to try the approach on a much more expansive raw dataset.
Audience member: [Inaudible]
Oisin: Yeah, well, I suppose, firstly, the results can show that a lot of the information of what class that a particular malware belongs to, at least in this dataset, is in that raw [binary, the static raw binary], and that can be learned from this particular types of deep learning model. So, that information is there to an extent, and that could be leveraged. That seems to be what the results show for us. And that’s including this kind of compression, which loses a bit of the information. I presume to achieve higher accuracy [where that bit’s] missing, to reach higher accuracy … I presume one interpretation could be [27:44] you’re going to [lose] some information by doing that. So, maybe that missing information is … I mean, I have to say that we were quite surprised by the results that we were able to get using this raw data.
That’s something for us to do, actually, to interpret a bit more. Because there’s ways to look into the deep learning model to analyze what those features that [it can learn] are, and to maybe relate [that to the original] files.
So, that’s [28:16] that scenario, to understand. And that’s something maybe that would be good to work with [someone who has] more in-depth expertise of how malware works, because that’s not our background and we don’t have [28:33] to do that.
Audience member: I’m curious of how or why this approach would work for [hacked] malware, because I would assume that malware [28:43] encrypted, most of the information about the features, I would think it would be [completely obfuscated].
Oisin: Yeah, so we’re not sure in the dataset … I don’t know [whether] the samples are [… were all hacked or maybe …] we’re assuming they were all [29:04] the approach [29:08] task of predicting or inferring which [29:13] various binaries. So, the model [and learning] the features of the different [29:23] approaches to be able to tell [if they’re] encrypted from [29:26] the file to saying it’s likely to be hacked, [if you do this].
Audience member: I would assume that that would be successful, just based off the unpacking stuff, [29:33] if the data is [completely random], there still needs to be some but that’s not going to give you information about whether this is [29:43] or ransomware or …
Oisin: Yeah, that might be a two-step process [if] we’re dealing with [29:50] packing mechanism, unpack that, and then work [29:59].
Audience member: That makes a lot of sense.
Host: [Could you transfer correlation to the file size] [30:07]?
Oisin: No, we haven’t done that. We were working with the data that was compressed down to this specific length, in our case 10,000 bytes. There was a couple of reasons that we did that – firstly, the deep learning models and architecture we’re using required [30:26], now there’s ways around that, but we didn’t want to add another complexity. And secondly, it just enabled our approach to be more scalable, because that specific size and number of bytes that we chose was something that we could feasibly build models on. So, [yeah], we didn’t look into analysis further of those sizes. We just used … kind of went from [this standardized item].
Host: Because I think this is really cool work. I think this is the first time I’ve seen deep learning applied. But I’m concerned about those two [30:59] if each of these malware classes used a different [packer], then your results would be [due to the packer]. Similarly, if it’s correlated to file size … I’m not familiar with [31:15], but that may be part of what the success is due to.
Oisin: Yeah, no, I understand that, those concerns.
Host: [31:27] since so many of your samples … and it’s hard to say, because your diagrams start from zero. So, surely [none of them are zero bytes]. So, it’s kind of unclear, 10,000 [31:41] but it’d be also interesting to look at what if you just looked at the first 10,000 bytes of each of those? Because you’d still get most of your sample there.
[crosstalk]
Oisin: Yeah. We did actually try an approach where we just truncated the first 10,000 bytes of the file. And
Host: [Did it work]?
Oisin: And yeah, it wasn’t as successful as this approach. And then, what we did for the files that were smaller – because there was a number that were smaller than that size – [we put zeroes on the rest]. And it wasn’t as successful as when we used this approach.
But I agree, there’s lots of scope for further testing of this approach on different kinds of datasets and further analysis.
Host: Any further questions? Alright, thank you.
[u]End of transcript[/u]