Behavioral Service Graphs

Mark Scanlon from University College Dublin presents a formal data-driven approach for prompt investigation of enterprise and internet-wide infections.

Transcript

Mark Scanlon: Okay, so this paper is about behavioral service graphs. The purpose here is to look at how we can analyze internet-wide infections and internet-wide scanning activity. This work is conducted by my friend and colleague Elias Bou-Harb in Florida Atlantic University and myself.

To give you a quick outline of the talk: first and foremost, introduction, to talk about how this stuff looks like and how it works.

[interruption]

Mark Scanlon: You can hear me better now, yeah? I’m not really loud enough.

So yeah, we’re going to start with the introduction – our motivation behind this work, to talk about exactly what the contribution that we’ve outlined in this paper… Talk about some related work. We go into far more depth in the paper itself, so not going to spend too much time on it here, we’ll just give you an idea. We’ll talk about the approach that we’ve taken and show you an evaluation of it. We have evaluated against two different types of scan. Talk about some of the limitations of this approach and some of the improvements that we might be able to come up with, and then about some future work.

So I just want to start by showing you this small video.

This video is made by CAIDA in California. What this is showing is a stealth scan by a bot net trying to analyze a range of IP addresses slash zero for open ports and seeing what software is running there. It’s testing for IP telephony, port scans, so the SIP is the protocol that they’re looking at.

And just to get an idea, I get into some flashy circles here that are moving very fast. So each frame that you’re looking at here represents five minutes, twenty seconds. What this shows is the source of the scan, the bots that are conducting this orchestrated, coordinated, distributed scan. Each circle obviously represents a geolocation, so looking up the geo-IP databases to find out where that IP address is located. The size of the circle represents the number of active nodes that are scanning from that particular location. And the color of the circles that you’re looking at represents the volume of traffic.

I get that the keys here on the bottom of the video is pretty small. I’m happy to share this link with you afterwards if you want to have a look at it closer yourself.

What’s interesting looking at this is that the graph that is here in the bottom left of this video represents the number of nodes that are partaking in this investigation over a period of about 12 days. So it’s a prolonged, orchestrated scan. And you can see here on the bottom right, it demonstrates how it actually processed through the entire address [space] that it was scanning. So you can kind of see the method that it followed, you could see the orchestration behind the entire attack.

So yeah, the motivation behind this – that’s the type of attack that we’re trying to detect, we’re trying to identify through behavioral analysis. Recently, the idea of DDoS attacks is in the news. Every day, every week, there’s a new type of attack. Most recent and maybe most famous of this is the Mirai botnet, which you’ve all probably heard about or read articles about online. So Mirai actually managed to take down some of Amazon’s AWS services and Twitter. And so it’s a kind of famous example.

The motivation, again, behind this is that internet-scale infections and orchestrated events are continuing to escalate, far more commonplace. So there’s a need for a prompt, formal, accurate solution to be able to operate against this kind of big-data [origin] from internet attacks. So basically what we’re trying to do here is have an approach that is formal and will actually exploit some data analytics techniques.

Some of the challenges that you have trying to conduct this type of investigation or detection: there’s a lot of information. So when you’re dealing with network traffic analysis, you have an awful lot of information and you have an awful lot of low-quality evidence, to be perfectly honest. So you have an awful lot of false positives, false negatives. There’s a lot of data that you need to churn through to identify the pertinent information that might lead you to draw some conclusions.

Network forensic approaches are either passive or reactive. So they generally employ manual, ad-hoc methods and they’re generally very time-consuming. And most current network forensic practices do not support a distributed inference. So the idea of having multiple nodes recording information, and then you use [unclear] that you get from all your detections, and to build that into an inference, to deduct what’s going on. So the first analysts go through another problem – process of correlating the unstructured evidence to try and figure out what the security incident is.

Just to briefly talk about some related work … the related work in this area typically falls into two main categories. So you have anomaly detection – network anomaly detection used in graphs to try to identify that way, or you have a big data forensic approach, so you have a lot of data, you’re trying to use some data analytics techniques, you’re trying to infer what’s happened after the fact.

What we’re trying to do in this paper is to fuse those two approaches together to be able to hopefully get the benefit of both of those approaches. This is kind of our contribution [unclear]. To infer internet-wide infections; to leverage probing activities using a set of behavioral analytics to try and infer the infections; employ a new concept of similarity service graphs to try and infer campaigns of infected machines; and we’re exploiting graph theory to infer the niche of the infected campaign, which I’ll about [in a little segment].

So our approach works for … it’s designed for a security operations center. And we’re looking at darknet data, so we’re looking at internet-scale data that targets routable, allocated yet unused IP addresses. It attempts to infer infected bots by characterizing all of the probing activities, which are the very first signs of an infection. And then constructs graphs and manipulates them to be able to infer an orchestrated campaign. So you’re trying to try to analyze all of the data that you’re gathering, you’re trying to group them together to infer what that campaign is trying to do and what that infection is that’s causing that information.

The first step there – infer and characterize the probing activities – so we’re looking at darknet data, and there’s a range of resources and approaches out there to gather and analyze darknet information. For this particular piece of work, the cyber threat intelligence emerging, Florida Atlantic University had a previous approach, so we used that.

We try to characterize the behavior, and looking at the probing strategy, the randomness there is in the traffic that’s been generated, based on statistics and heuristics.

To give you an idea about how this [feature] extraction works, the behavioral identification works across the different types of behaviors that we looked at. The first thing is randomness – randomness in the campaign in terms of exactly what is happening, what ports are being scanned, where is the traffic coming from. The method that we used for that is the Wald-Wolfowitz method. You have to excuse my pronunciations throughout this entire talk when I talk about the methods, but bear with me.

And then, what we get out of that is that we end up with the characteristic being identified as either random or it has a pattern – very straightforward.

We then look at the probing strategy. The methods that we employ for that are both the Mann-Kendall method and the Chi-square goodness-of-fit method. All of these methods are referenced in the paper, for completeness. And then what we end up with that here is we either can identify that there is sequential probing taking place or there’s a permutation of probing taking place.

We then look at the nature of the probing source. So what we do for that is we analyze the randomness and the actual probing strategy together, and then we can infer whether it is likely to be a probing tool or whether it’s likely to be a probing bot.

We look at the target that’s being analyzed then, so the method that we use for that is we look at the concept of relative uncertainty and we look at the theoretic metric. And then we can determine whether the target is dispersed or not.

And then we have some other miscellaneous inferences that we look at. So we look at the rate, which means how many packets, how quick is the information coming in, how fast is the attack occurring. And we look at the port numbers that are being scanned. What we can determine from that is the rate (so the number of packets that are being executed or sent per second), the destination overlap of that (so is there a commonality between the destination of those packets), and the port numbers that are being looked at (so you can infer probably what type of vulnerability are they probing for).

Okay, so I mentioned that we were looking at behavioral service graphs. So what we do here is we create a vector for each particular piece of information that we have. So when we try and infer what a particular bot is, we look at the randomness, the probing strategy, the target, the rate, the destinations overlap metric, and the port numbers that are being looked at.

Then, once we have that vector created for each bot that we’ve identified, we can create a behavioral service graph. So the nodes on the graph are the actual bots themselves, the scanning nodes in the network. And the edges that are the weights related to their similarity, the similarity in terms of their behavior. So each graph then will cluster a number of bots targeting the same port, and that will define an orchestrated campaign, like the video I showed there with the [unclear].

Okay, so if we look at these two bots as an example – bot one, we’ve identified the behavior to be random, sequential probing, dispersed probing, 60 packets per second, destination overlap was 100, and the port number that they were looking at was port 80.

Bot two, you see here you have … that is actually a pattern in the traffic. You have sequential probing, it’s targeted, at 55 packets per second, the destinations overlap for [unclear] 200 different hosts, and the port number was 80.

So what we’re looking at here is trying to … my rectangles have moved slightly, but you can see what’s going on. So what we’re looking at here is trying to come up with a score to sense the similarity between these two bots. So we look at each of their characteristics, and what we have [unclear] for the rate, the packets per second, because that’s kind of variable in a lot of different things. We allow a threshold of 15%, plus or minus 15% in the rate, to say that they are the same. So the behavioral similarity between these two bots is 50%. So you can see that there are three of the behaviors, the characteristics that we’ve identified, are the same.

Then, what we’re trying to do is we’re trying to allow the prompt inference of bot-infected machines, we’re trying to detect what machines are infected; automate the amalgamation of the evidence from the distributed entities; and provide some sort of insight into the behaviors of those infected machines.

Okay, so I mentioned earlier that the niche of the botnet… The niche of a campaign defines the nodes that aggressively infect other nodes or that are heavily used in the command and control communication for the botnet. If you’re familiar with peer-to-peer networks, it’s something similar to a super-node or master-node on those networks.

Then what we do is we apply a spanning tree algorithm to create an Erdos-Reyni random subgraph. Any objections to pronunciation? No? Okay.

So the nodes then with the maximum similarity are determined to be the niche nodes of that [botnet].

Again, the unique characteristics of the campaign – so we have the population of the bots, which will have several orders of magnitude, it targeted the entire IP address space, and bots that are using orchestrated strategies to maximize the coverage for the target host that they’re trying to scan.

If you wanted to identify the niche of the botnet, the super-nodes in the network, if you will, and you were able to eradicate those in some way, shape or form, to simply block their traffic or whatever you’re trying to do, you can limit the propagation of the campaign. It’s not, probably, going to stop the campaign, but it is going to hurt it quite significantly. [unclear]

Okay, so talk about the evaluation of this approach – there are two different deployment scenarios for this evaluation. The first is kind of an enterprise-scale scenario, so you have a median and a large enterprise, and you are identifying that there’s some suspicious traffic on your network and you’re trying to infer what is exactly going on within your own network. And the second then is trying to scale that approach up to an internet-scale attack.

First and foremost, we need to have a ground truth here – to evaluate this technique. What we use here is a 15 GB network traffic dataset, which is [consuming] the Security Experimentation EnviRonment (SEER). The ground truth here is that this data has already been analyzed to be an orchestrated probing campaign from the Carna botnet. This is considered one of the largest and most comprehensive probing census that has been conducted on IP before.

The outcome of that approach, when we analyze our network – we see that there are, we infer that there are 10 infected machines, [there are a cluster] of those machines to be either based on the behavior of the individual nodes. We were able to identify that two of those nodes are the niche of the campaign. So again, those two nodes are responsible for infecting the largest number of other nodes, and those nodes are probably part of the command and control of that botnet.

To move that outcome to the internet-scale approach, again, we needed to have data to work on, and we needed to have a ground truth to assess the methodology. What we used here was darknet data. We operated the approach here [unclear] a security operation center, and the ground truth here is a probing campaign from October 2012, and this was reported ISC to be targeting internet-scale SQL servers.

The outcome of this then is that we were able to infer and cluster close to 800 unique SQL infection bots, and again, using the similarity of these particular infected nodes, we were able to identify that 84 of these bots were part of the niche of that campaign. So again, the eradication of those 84 nodes would greatly hinder the progression or the influence of this campaign on [your network].

Some limitations to this approach, and some improvements that we could make. First and foremost, there’s a need to fortify the infection evidence. So we’re currently working on [correlating] malware with its probing traffic to try and accomplish that fortification.

There’s a need to find a formal, mathematical computation to infer the niche of the campaign. Currently, what we’re relying on is a threshold related to those subgraphs that we create. But we need to come up with a mathematical method to verify that.

And this is experimental, right? So we’re currently addressing the scalability issues of the approach, and the idea is that eventually, this would work in near real time and on darknet data.

Okay. So we conclude here. This approach views data analytics with formal methods. That approach has rarely been investigated for analyzing botnet behavior. We are leveraging that fusion to be able to infer orchestrated, internet-wide campaigns and to identify the nature of those attacks. This is a step towards leveraging big data analytics with formal methods to be applied in cyber security.

Preliminary results in an SOC model are promising, [unclear]. In terms of future work, we’re looking to address the aforementioned limitations. And we’d like to be able to verify the soundness of the approach in corporate networks using two-way traffic. This is [under the plan].

And I just have acknowledgements here – this research was funded by the NSF in Florida Atlantic University, and there was also some collaboration with the Florida Center for Cybersecurity.

And yeah, that’s all I have. If you have any questions, I’d be happy to attempt to answer them.

[applause]

End of Transcript

This video was recorded at DFRWS EU, in collaboration with Forensic Focus. Find out more about DFRWS and register for next year’s conference on the official website.