ElMouatez Billah Karbab discusses his work at DFRWS EU 2018.
ElMouatez: Hello, everyone. My name is Karbab ElMouatez. The title has been changed – it is not [00:16] detection, according to one of our reviewers. So, it is Detection.
So, my name is Karbab ElMouatez from Concorde University, Montreal, Canada. And I will be presenting a framework for Android malware detection, and exactly I will be applying [deep learning] techniques under the API to detect malware.
The outline – we start with the introduction, then we define our problem, the objective, and some gaps in the related work, we present our work, do the evaluation in different … and see if this evaluation meets the requirements and our objective. Finally, the conclusion.
So, mobile – mobiles are everywhere, and we all have smartphones, smart applications, and it is in the core of our lives. But also, it comes with expensive [that] we have multiple malware that’s targeting our mobile. And on top of it, we find that Android operating system take in market share of 55% of all the actual devices that we have, and due to the open source and many companies embrace Android operating system. But also, this comes with expense of many malwares. Let’s say 90% of the mobile malware targeting Android operating system.
What is our problem? We have lot of malwares that are targeting not only devices but also markets … Play Stores, like Google Play Store … and there are many malware and there we are looking for Android malware solution that could detect these malware. So, we are looking … let’s put the context. We are looking for tools that leverage static analysis where we don’t have any execution, only the actual IPK to do the … [to extract] features to detect malware.
So, we are looking for accurate solution – so we are … in the forensics we don’t need lot of false positives, this is [burden]. And also, we are looking for automatic feature engineering. So, the features of last year, it must [03:08] the feature of this year, so we want to have some automatic feature engineering. And this will give us time resiliency and also, in the context of forensics, let’s say we know that this application is malware. But we may know further information – we need more information, what type of malware it is. Let’s say it is ransomware, is it botnet? The kind of malware, if … the kind of malware, we’ll prioritize our action towards this [issues].
Also, we need it efficient, so we can deploy it in markets, Play Store, mobile. And different strategy for mobile, what is inside or outside.
Let’s look at the [03:55]. I just selected two papers from the one of the top-tier conference in security. So, Drebin – in 2014, it’s collect features, like user permissions, suspicious API, network permission, network addresses, use this feature as … feed it to [classifier, SVN classifier], and after we have the decision … but if you look, the suspicious API, it changes over time. Today it is suspicious, next week it’s not suspicious. As example, network energy consumption in 2016 was not that suspicious, in the Android application. But in 2016, I remember that there is a paper that they use power energy to find your place. So, it’s change over time.
Let’s summarize. So, it is not that accurate, but it is 2014 – at 2014 it was very accurate, so it is very good work at that time. It doesn’t have this automatic feature engineering. We have to explicitly select the features, feed it to our system, and for next year, if we have change, we need to select new features. Time resiliency is not … but is provided to us … threat attribution, and it is efficient solution, and could be deployed everywhere. Relatively.
Another work from last year, and NDSS 2017, and this work is … it’s very nice work, it leverages the call graph from the actual … you get the malware. They get the call graph and they build the sequence, they do sequence extraction, as you see here. And the nice feature here that [leverage] Markov chain to be this … they built the Markov chain model from this sequences. And you see that they leverage only the packages – java.lang, android.utility.
So, they get this [numbers] as a [vectors], as you see, and feed it to random [06:12], and do the classification. So, the nice thing here is they have automatic feature engineering using the Markov chain. It is accurate, very accurate, it is nice, automatic feature engineering using the Markov chain. Time resiliency, because … they provide some, in their paper, some detailed evaluation, but it does not provide attribution. We [can] see which type of malware. It was efficient … actually, in their paper, they mentioned that it is about 20 minutes, about, for one application, at the max. But on average, 10 minutes I think. And [06:55].
So, in our case, we are [06:58], first we need to formulate our application. We have malware, we give it another model. We formulate an Android application as a sequence of API, but not only API, but more granular features, granular thing, instead of having Android.net connectivity, we have more granular view, that the method, actual method, which can give us more information about malicious behavior. In contrast with [MaMaDroid], which is 2017, it’s using only android.net.
So, how can we collect this sequence? At what order? The first thing we can use [07:48], we have multiple basic blocks, and from each block we get the set of sequences. So, we will have … the order is important, as you see here. In this image, each bar defines one function … API call. We give it a unique number and plot it in bar. Each color defines the API core sequence within basic block. And what important here is the order within the basic block. So, it executes this, malware executes this, then this, then this. But the order of the basic block is not important. I mean, we believe it’s not important. This is why we don’t move the control graph, we just scan all the application from beginning to the end, and we just focus on the order within the basic block, and leverage … yeah.
We have … this is … at the end we will have this thing, which is a sequence, each bar define one API call, and the order is kept only in the basic block, but we don’t keep the order between one basic block and other basic block. So, we can formulate this problem as a time series, and we are leveraging 1D-Convolution Neural Network to do the classification. As you see here, we are using raw feature. We don’t do any explicit feature, we just give it the sequence as it is, and all the burden of feature extraction and selection will be on the 1D-Convolution Neural Network.
Let’s revise. Let’s look at the overview of Maldozer. We get an APK which is the package of [Android]. We get [the text], decompile it, just scan it from the beginning to the end, find all the API. Where we keep this sequences? The sequences of the API method calls. [We seen] on the GPU, we get the [model], put it in any of the device that you see. Here we have put in servers, which is [the market store], or phone. And we are mentioning here IoT because we use Raspberry PI, which is a grey area between [IoT] and mobile, so this is why we mentioned it’s [IoT].
So this is graphic representation of our model, which is just generic model, it has been inspired by work from 2014 from [Keyne] from the University of New York, which is they are adapting [10:39] they’re using for text classification for [10:43] analysis, and we are using that model here.
Let’s put our checklist, what we are going to verify. It is accurate, let’s say … we see it is automatic feature engineering because we are giving only the row sequences, we are not doing any selection, and we are not giving that this is … API is suspicious or not, we don’t care. Give it everything, it’s up to the network to decide. It’s time resiliency because … yeah, we will see, in [11:16]. Let’s say if it has given us threat attribution and efficiency, it is efficient or not, and multi-scale deployment.
Just one thing to mention here is we have the detection at the end here, there, we see the detection, and we have also another head for attribution. Which is family attribution or threat attribution. Let’s start with the evaluation. We have a data set – about 200 gigs, I think more than 200 gigs between malware and not malware. So, the first dataset, it’s Malgenome. It is 1,000 that set. We are using this dataset because it is one of the referenced sets and it has been used for 2012 paper, it is well known paper in [SNP]. And also we are using Drebin paper, dataset. We are proposing our dataset, which is newer and bigger, but … in addition to the benign.
But the advantages of this three dataset is they provide the family attribution. The family – you can find the family. So, you can find any malware outside, but you don’t have the family. Here, in these dataset, you can see the family. And the dataset, it’s open, so we can … just [send us email, you’ll have it].
Let’s start with the effectiveness of the solution. We use these datasets, difference [cruise] validation settings, where we have 2-fold, 3-fold, 5-fold … the fold is just having dataset, divide it, let’s say 2-fold. Having the dataset, divide into two parts. [Training] on one part and test on another part. And we do the same thing with other thing. The result will be the average. And so on and so forth for the 10-fold, 3-fold.
So, we see it is giving us very good result. And in all the dataset, in all the setting. So, the input was only the raw features. Let’s say if we have … it could detect unknown families. We train the model without the sample from given family. And we looked … so, zero families in the beginning. We looked at if the system could detect it with … we didn’t train in this family. It could detect it, and we check. At zero, it detect most of them. But this will increase, will add some samples from this family to the training, and see how the accuracy increase if we add more samples. And we see this is increasing. The more samples we add in our dataset, the more accurate it will be. And also, if we don’t have, we can detect in our dataset.
So, let’s see about the time resiliency, where we have … we divide our dataset into three, four parts, for each year, 2013, ’14, ’15, ’16. We train on only one year and check on the other years. So, let’s say [we want to train it] for 2013. And we see the result on 2014, ’15, ’16. So, we see that we are keeping the detection, and these results are better than the result provided by [MaMaDroid] in 2017. And if you make the comparison…
The most noticeable thing, that if we train it on 2016 sample only, we don’t need to train it on the others. It keeps the same performance, at least in our [research].
The other thing in the detection – let’s say code transformation, where you have malware, the [15:04] you want to change something there in order to deceive you, deceive your system. But what is the input of our system? It is just sequence, the sequence of the API. And what he could do, it change the order. We did this. We are doing here the shuffling. Random shuffling of the actual input of the system. And we see the accuracy of our detection, starting from 10 shuffling – so getting the sequence divided into ten parts and randomly shuffling it – to the 10 power four, which is one API and randomly split … I mean, shuffle it. So, we see that we keep good accuracy during most of the shuffling, including if we shuffle one by one. Which is … yeah.
And here, another result of our … of the model accuracy based on the complexity of the model. The more parameters we have, from one million to six million, the more accuracy we will have. But the question here, is it we keep the same efficiency if we have? We will see later.
Here we see the attribution. We have this is malware. And give me which family it is, and we see here how accurate our system to detect the family of the sample. We are doing very good in Drebin dataset and Malgenome, but we are doing relatively good in the Maldozer. But the issue here, as you see here, some [label] are not good. For example, the agent we have in 51%, which is almost random, because the agent, it’s a generic name for malware they used in [16:57].
So, this is the confusion matrices, and … we talk about effective. It is detecting very well, but how efficient it is. We study on different hardwares, starting from server, normal server, with GPU server, without GPU, high-end server, and we did in laptop 2009, I think [17:25], and Raspberry PI. As I mentioned, Raspberry PI, it is just small, small computer, it is let’s say IoT now, mobile, it’s grey area, but is small, very small PI. It is Raspberry PI 2, not 3, because there is difference.
And you see here, that we divided the efficiency in two parts. The pre-processing, where we can do the pre-processing in only four seconds in Raspberry PI, which is tiny, and the detection could be in almost one second for devices … for Raspberry PI also. For the prediction, it’s almost negligible for the server GPU.
Here detailed results about Raspberry PI, laptop, and server. Here we see, as we ask later … before … how the effect of the model complexity on the efficiency of the system. And we see that it’s not that much affecting the efficiency. For the GPU, it’s negligible, it’s zero to zero … 0.30, I mean … seconds, which is very small. This is very good for large-scale detection. And if we use very small devices, which is Raspberry PI, it is the six million … [19:00] give us about one second.
So, let’s compare. Yeah, compare to other solution. We have good accuracy, and like MaMaDroid 2017 NDSS, and better than Drebin. We have automatic features, feature engineering, it is time resiliency, as we used in our evaluation. We have threat attribution, where we can see the family and can see the details about each family. Efficiency, we can apply it everywhere, small devices. And can deploy it everywhere. So, we check all the thing off our checklist.
As a conclusion, Maldozer is an automatic malware framework for Android malware detection. We apply it here on Android malware, but the idea we believe can be applied on multiple, on different malware. We are applying it on [Win32] of malware. The main idea is extraction API and sequences, and give it to the Convolution Neural Network. We have good accuracy, and it also efficient, can … only five second on Raspberry PI 2. And it can be deployed on multiple systems.
Finally, thank you for the anonymous reviewer, they give us very insightful comment. Thank for the DFRWS Europe 2018, they gave me scholarship to be here, thank you so much. Yeah, thank you NCFTA Canada, they provide us servers and things. That’s it. Yeah.
And finally, what is … the two rules. No free lunch and [ugly duckling], yeah. Remember this. That’s it. Thank you.
[applause] [silence]Audience member: How do you handle reflective invocations?
ElMouatez: Can you repeat … I didn’t hear.
Audience member: How do you handle reflective method call invocations?
ElMouatez: Yeah, we don’t handle this. Yeah. What we have as resiliency [21:58] is code transformation. So, you change the order in something. But reflection … so, I don’t think we handle this. I mean, let’s be accurate. I didn’t test on this. When I test, I will tell you.
Host: Any other questions?
[silence]Audience member: I have a question, I might use this microphone.
ElMouatez: Yeah.
Audience member: So, you mentioned that you monitored the malware API calls. So, if you have a brand new family of malware that you haven’t trained on, you haven’t seen before, using your API, that’s like behavior analysis, right? So, would you be able to detect a new piece of malware based on suspicious API activity?
ElMouatez: Yeah. . Yeah. This comes to our evaluation here. So, let’s say you have new malware, as you see here, we train it on dataset without any sample from that malware, and we see here the detection of our accuracy of detecting that family. So, based on our evaluation, we can say that we could detect, because … we could detect unknown malware. This is your [23:11]. What the [intuition] behind is? The [intuition] is that these sequences, this … it’s calling API, then this API, then this. This behavior, it is used by many malwares, even it is new. So, we are looking for the sequence, we are not looking for specific thing. This is why we can detect … this is our intuition for detecting new malware. The behavior is still the same – send in SMS, send in SMS to get money or stealing data or something. These type of behavior are the same.
[applause]End of transcript
Find the paper here.