*Dinil Mon Divakaran discusses his research at DFRWS EU 2017.*

Divakaran: Good morning, everyone. My name is Dinil Mon Divakaran, and this work is with three other authors – Kar Wai, Ido, and Vrizlynn. We are from the A Star Institute for Infocomm Research. And as has been introduced, [00:25] Evidence Gathering for Network Security and Forensics.

So I think [00:30] takeaways from yesterday’s dinner talk was that [00:37] evidence, right? [00:42] trying to figure out if they can say something with more number of evidences. Before I start, I also would like to say that [00:57] [active borders between security and forensic attributes], kind of similar in many network security and forensic work that at least I am involved in. In the sense that the solution can be used for forensic analysis, can also be used for security analysis, offline, online, and real-time [01:19].

I’ll start with the context and the problem, [starting] the objective. And give you an overview of the evidence-gather framework. The core of the framework is a modeling and analysis component, and we used regression for [01:48] that part. Besides the modeling and analysis, or beyond that, what we do is correlation, and that entire decision-making system, we develop an entire decision-making system. And [02:06] [evaluation] on real data set that we obtain from [… that’s openly available].

The context of the project is that there’s always a need to detect anomalies in the network, and anomalies can come not only due to security-related incidents but also due to faults in the network. So you can have anomalies due to [mismanagements, configuration] as well. But what we listed here is this set of anomalies that are related to attacks.

When you talk to [operators] or when [02:48] those who are used as [02:51] security incident and even management system, one of the problems they have is that they get a lot of [02:56], they get a lot of [02:58], and they are quite huge, depending on the size of the enterprise network, and many of them are false. So you need to analyze them in detail to understand whether they are true or false.

So we look into that context of saying, “Okay, there are anomalies, but can you say more that some of the anomalies are related, they are more important and they are more relevant, and others are not, or others are irrelevant, and you can see that they don’t make much of sense?” So that’s pretty much what we are trying to do here. We want to detect patterns that can be related to anomalies.

[That said,] anomaly detection in itself is a very [old] problem in the network community, and people have tried, applied different knowledge systems or scientific tools, such as expert system, information theory, data mining, signal processing, statistical analysis, and so on. However, most solutions have been targeted to detect a particular kind of attack or particular set of anomalies, which means they are kind of dedicated.

So you can’t … for a user of an enterprise network, it’s difficult to have [all such solutions] [04:22] if you want something which is more generic, that can detect patterns. [Let’s say] you want to detect a particular attack. That’s what we are looking into. And the other is to bring the context. So if you see a [network scan or a port scan], it itself is not a problem. It might be a problem wherein you have a related attempt to do, for example, a [04:47] attack.

The system we developed is to detect what we call fundamental patterns and … I call this as evidences, though I think it’s not the most perfect term to be used. But it’s basically patterns that you feel are suspicious, but in itself, you might not want to look into it and say that “Hey, there’s some problem with this pattern and we need to dig further, further, further,” as in do it manually. So you want to collect [05:20] view it, analyze it, and then come up with a conclusion.

This is the objective of the work – [we want to detect] evidences and patterns related to attacks and anomalies, analyze and correlate them, and the aim is to detect anomaly, without the need to learn [05:44] from normal traffic. So again, this is a distinction from [any works], that we are not doing traditional machine learning, [05:54] we learn models. Machine models from normal data.

So one of the challenges is that … the community is facing is that you can’t really get very clean real data, and even if you get it for one time period, you need to keep renewing it, because the traffic changes dynamically over time. So that’s kind of challenging, to always have models that learn from normal traffic, that learn from attack traffic. So you need to always update your models, you need to have real data corresponding to these two sets.

[To overcome] that, we are not having a model based on normal traffic. We are looking into patterns that might be common for many attacks. So this is the overview of the evidence-gathering framework. The first stage – in the first stage we develop a model, we do modeling of patterns, analyze and detect [such] patterns. In the second stage, we detect scans and illegitimate TCP sequences. So [07:08] illegitimate TCP sequences. So this is kind of deterministic or … this is really deterministic, there is no probabilistic [07:16]. Our model is probabilistic. And the third one, [compile] these two to find correlations and make decisions.

Now, before I proceed, I also would like to tell you that [you may] feel free to interrupt and ask questions if it’s not clear. So you don’t have to wait till the end of the talk.

So the first stage is on modeling and analyzing. Before we go into that, we need to know what are flows and sessions. These are commonly used terminologies in networks. So a flow is a set of packets, localized in time, with the same set of, usually five tuple of source and destination IP addresses, source and destination ports, and protocol. Session is basically again a group of flows that you want to analyze; together, maybe it’s like a number of SSH items, to [a port] with multiple flows but with a single session. So the session definition is kind of [variable]. You might want to look at a destination-based session, which is like you have flows coming to your server, and that you might want to segregate based on session. So indirectly, it’s [probably this port] or the port and [SO’s] address. So you have a session between two end hosts, and [differentiations] between different end host pairs.

The traffic is represented by features, and there are numerous features that can be used on network traffic. For example, you can use flow [08:58] etc. In fact, you can come up with a few [09:01] and even hundreds of features from packets and flows. For the purpose of demonstration, we’ll look into only these three features.

One is the inter-arrival time between flows in a session. So you take the session, and look at the inter-arrival times of flows. So flow, it roughly translates to [connection]. So if you look at a bot, the bot might be trying to communicate to different server at regular intervals. So the inter-arrival time might not be so dragged out as opposed to a human behavior, browsing behavior. This is not just applicable to HTTP traffic, but could also be applicable to DNS traffic, if the [09:49] has a [DG], the domain generation algorithms, which will try to communicate, try to resolve different domain names, then might have some interesting patterns in the inter-arrival time.

The other features are the size of flows – size of flows can be analyzed in packets or in bytes – and finally, degree of an end-host, that is in terms of distinct machines, a particular number of distinct machines an end-host communicates to. So that will tell you whether it’s in the normal range or not. Again, this is not an absolute measurement, but it’s related, because if you have a network of ten machines, you would have a degree going in the range of maybe 10 or 50 every few minutes. But if you have a large network, it can increase by an order of magnitude.

So as I said, we use regression for modeling and analysis. These are the four example patterns and these are the four types of patterns that we detect using our system. So we’ll explain. These patterns were obtained from real traffic from [MAVI] dataset. So [MAVI] is a consortium from Japan, a consortium of different ISPs, so they keep uploading daily traces, and these are some of the anomalies in the traffic. So this is an example – the first one is an example of an outlier [11:27] one point stands out. So this count here is [11:33] number of IP addresses here communicating, for example. So this is an outlier, example of an outlier.

This is an example of a perfect fit, as in your values at different times – so these are the time, the X-axis is the time – so different time instances, you have almost the same count coming. So this count again can be the number of [12:01] you are communicating to it could also be the number of packets in your flow.

And the third example is of a linear increase. A linear increase can come if an attacker is trying to beat a system that uses a linear model to detect. So it’s a linear model that sees if the increase is linear with respect to any feature, and if a system is built on that, then of course this can be detected, but a quadratic one or an [exponential] one cannot be detected. And a linear relationship can also come if you have something like a [TCP SYN flooding] attack where for each [SYN] you get the number of [SYN-ACK] that goes out during a [TCP SYN flooding] attack is always increasing because the [SYN] comes from a spoofed IP address. So because the [SYN] comes from spoofed IP addresses, the [ACKs] do not reach the destination, so the victim keeps sending the [ACK]. So [re-transmission] of [ACKs] happens, and therefore, if you take the difference of this, you would see an increase.

The entire modeling here is mostly based on linear regression, that is, we assume a first order linear model, basically a model of a line. So you have intercept; Y is your observed value, which could be the number of packets in your flow, for example; and X is the discrete time points; so beta zero is the intercept and beta one is the slope; and epsilon is the error in modeling; and usually, in linear models, it follows the normal distribution, which means it’s a zero – normal distribution with a zero mean; and a standard variation of sigma.

Now, what you do with the model is you try to estimate what your response variable is, that is before you get the actual data, you try to figure out what might be the next value that you’re expecting for the feature, if it’s the number of [end-points] you’re communicating to or the number of packets in a flow, whatever it is. So you model it, you try to estimate it. So when you estimate, you have to find the values of … you have to estimate beta zero and beta one, so you have the gap [14:34] estimate.

And in least square method, which is the most commonly used method for line [fitting], the coefficients are obtained by solve the simple equation, which is you minimize the sum of the squares of the difference between the estimate and the observed value. So Y-hat is your predicted value and Y is the observed value. So it tells you the error, the error in what your model predicted, and what is the actual value. And the estimates for the slope and the intercept are made by minimizing this sum of square errors.

[With this now,] you can look into four different methods for detecting the four different patterns [15:26] earlier. The first one is your defined outliers. Unfortunately, the [lead square] regression method is very sensitive to outliers. It has a breakdown point of 1/n, which means that you just need one out of n samples to be an outlier for the estimator to move away arbitrarily from the observed value. Which means your estimator can really, really be out of the optimal region or the correct region if there is at least one outlier. And theoretically, [16:03] it’s even zero, so you don’t need an outlier.

In linear … in the [16:09] there are a number of robust regression methods, and Theil-Sen estimator is one of them, which has a high breakdown point. And the point is close to 30 per cent, which means even if you have 30 per cent outliers, you can still fit the line almost correctly. And in Theil-Sen method, the slope is estimated as the median of all slopes. So if you [have these x-y pairs] for a line, let’s say you have 20 points, so you take the slope of all of them, and take the median, and that’s how you get the Theil-Sen’s estimation for these parameters.

So given this our hypothesis for detecting an outlier is as such, which means it’s an outlier as long as the difference in the estimated value – so you remember r-i – is the difference between the observed and the predicted estimated value. So if the difference is less than a particular [17:12] value, and if you take this … so I know it’s around 30 per cent is the maximum outliers the Theil-Sen estimator can tolerate. So if anything is less than 70 per cent, then you can say it’s an inlier value, and it’s greater than 70 per cent then you say it’s an outlier. That’s a safe bet, but of course you can have a parameter [17:32] to say if you want to play with it, right?

The second one is the goodness of fit. So you saw the straight line and you want to know whether such straight lines are there in observed network traffic. If we have a perfect fit, it actually gives you a functional relationship. Now, functional relationship is not common in network, because there is a human often behind a computer, and therefore what is expected is a statistical relationship, there’s no … it’s not a functional relationship. So the way you can detect that is by [testing for it]. The sum of … you can actually check whether the slope is zero.

The third one to recall is to find if there is a slope which is suspicious. So if it is having an angle of very high slope or steep rise, then you are going to call it as an anomaly. So you have a threshold for that. You say that okay, if the slope is less than theta, then it’s not an anomaly; otherwise it’s an anomaly. So that’s the hypothesis testing for that. However, we don’t have beta one, because this is the case when you have the exact value of beta one. But we are estimating it. And estimation of beta one depends on the error. So you need to correct for the error. So we say that the null hypothesis – we reject the null hypothesis, which is it’s not an anomaly – we reject it if the estimate of beta one is between minus theta and plus theta, corrected by the error in the estimate. So this [key] distribution is used to correct that error.

Finally, we go to the fourth one, where we assume that not everything is a linear increase, so if it’s not a linear [19:34], then your linear model will [say it’s failed]. Now, if it’s failed, it doesn’t mean it’s not an anomaly. It only means that the linear model didn’t fit right. So what happens if the model is … you have an exponential [19:48] or you need to fit a polynomial. So we can’t have models for each and every one of these, so what we do is we have a quadratic model, so we take the polynomial of [degree two] and [fit] the data.

Next, we see if the linear model fits better or the quadratic model. So for that we use the coefficient of determination. Coefficient of determination simply looks to … it quantifies [the point] that your estimator, how much of data has your estimator grasped. So the higher the value, the better the fit. So if the quadratic coefficient of regression is greater than that obtained by [20:31] [which is for a linear] model, then we say it’s an anomaly.

So those are the four models. Now we go to stage two, and in stage two we want to detect scans and illegitimate TCP sequences. I guess everyone knows about scans. It’s very common these days to find out what services are running on which machines, and there are also ways to scan machines behind firewalls to know what is the amount of buffer the machines can tolerate, for example. So scans are quite useful to learn the systems. But we go a bit beyond scans, and we want to know whether our flows that do not conform to TCP’s finite-state machine. So TCP has its finite-state machine, based on which it responds. So if it gets a SYN, it responds to the SYN-ACK, and the client responds with an ACK, and then there is data exchanged, then there is FIN, and we close with a FIN-ACK.

So that’s a simplified version of what the FSM is. What we are trying to do here is you have a small FSM running at the system, and see if the flow is conforming to the FSM or not. And the third stage is evidence correlation and decision-making. So correlation can be made either in time or in space. So in time, if you want to know a correlation, you are trying to say whether what you saw today, the anomaly that you saw today, let’s say in the morning, is related to yesterday night’s event or an event on the [back] and so on. So correlation’s time is not really [22:23], it’s kind of complicated, because you need to have database as large and as efficient to compare with.

The other possibility is you do correlation [in space], which is you find out if the IP address that is related to the particular evidence has some or other … is occurring in some other events or evidence. So it can be the IP address in itself, it can be the network address of the machine. So we choose to do the [space-based correlation] here and not time-based. Time will be viewed, however, but over just a small interval of time. I think in 24 hours.

So if you look at all the four techniques we mentioned, there are thresholds for each of them. To make it easy for anyone to use the system, we normalize the threshold to be in zero and one, and then put a conservative value to say what is low, medium, or … you define what is low, medium, and a high score range. For example, anything beyond 0.75 is high; below 0.25 is low; and so on. The detection is [basically] based on the number of evidences and scores. So it’s not based on one single evidence but multiple evidences, as well as the scores.

Now, [23:59] recap of the framework. As we said, you first model the traffic, and we use these three features – inter-arrival time of flows, size of flows, and degree – and we detect anomalous patterns. These are the examples. We detect scans in these three sequences, and finally correlate them to make a decision on whether a session is anomalous or not. Finally, we go to the performance evaluation. For the data we took both benign and malicious traffic – close to thousand benign sessions and 1400 malware sessions. But the sessions are actually huge, because it depends on the time window that you are looking into. And therefore, there are a few millions of flows [24:52].

The benign traffic was obtained from the [ISCX] database, dataset, that’s from a university in Canada; the LBNL dataset, that’s quite old however; and the third one is two of the co-authors, so it’s a little private, but anonymized. Anyway, we are not doing the [pack] inspections, so we are looking only at the headers and time between packets.

Malicious traffic was generated by malware, and this is obtained from the Stratosphere project, and the traffic is quite recent. There are a number of datasets, [but one of the] datasets we chose had traffic from 11 different botnets.

So we set conservative values for the threshold to test the system. [We say] if the output score is greater than 0.7, it’s high, and the meta decision-maker classifies a session as anomalous if at least three patterns or evidences are found with respect to a particular session, and at least two of them must have high scores. So it means the third evidence need not have a high score, it just needs to stand out.

These histograms are for the different traffic sessions. On the left is a normal traffic session, and what we’ve got here on the X-axis is the number of evidences. So for the normal sessions, the left blue bar is for the number of sessions and the right one is for the percentage. So if you look at this first one, you can see that the normal traffic sessions had no evidence for 60 sessions. But there were evidences that [26:42] and there were 20 of traffic sessions which had two evidences and so on. If you look at malware traffic, almost more than 80 per cent of them had three or more evidences. So [26:56] 84 per cent [and over].

A couple of examples of the sessions detected. Here we see that the flow size in bytes at different time points is almost the same, so you can see a straight line fitting here. This is [model] fit and these are the data points. These are outliers. And this is detected using the linear fit model. The second example is the [27:29], and here you can see that the linear fit is probably not so visible. It’s not clear for our eye, but with the model, the linear fit is not as good as the quadratic fit. You can see the points are quite dispersed, but you would look at it suspiciously, so it’s one simple way of knowing whether it’s a linear model or not.

In this particular scenario that I’m presenting, we obtained an accuracy of 82.6 with a false positive close to 8 per cent. For some of … a number of bots we had high detection [28:14] few with lower detection rates. They still had evidences, and they had at least two evidences, so it’s a matter of configuring them to … if you want to detect evidences for the bots in lower category of detection. I’ll come to that later anyway.

These are the results for the effectiveness of features. So there’s flow size in packets, so almost all detected sessions had evidences on flow size in packets, flow size in bites, inter-arrival time, and degree, but only half of them had related scans, for example. I guess most of them are scans and very few of them are not scans, but [still] [29:04] [50 sessions]. For the techniques, the outlier and the goodness of fit was useful in most of the evidences, and our linear or quadratic models were useful in close to eight per cent. Again, that doesn’t mean that the model is not good. It just detected some of them which were left out by the others.

So if you want to change the decision criteria … for example, we tested with the criteria where we need only two or more evidences instead of three or more, and then the detection accuracy went up to close to 94, but of course the false positives also went down. We performed the computations on an [29:51] machine, with 12 GB RAM, and we could achieve close to 3,000 flows processed per second. However, it’s still only an online solution. Of course, it works [offline as a forensic but as I said] in the beginning, in many of the network problems, we [30:09] security and we would prefer to do it online and in real time, and real time is still a work in progress.

So to conclude, we developed this framework for gathering evidences to make sure that when you say there is an anomaly, there are a few evidences related to it. We actually have a prototype developed where the anomalies are shown and then when you select them you’ll see all these evidences popping up as graphs. So an analyst would see these graphs coming up on screen, so it’s kind of easier than going to the raw data, the log file, and finding out what happened.

So we don’t do any learning from the normal traffic. I think to be more correct, we also do not learn from attack traffic. We only observe … as in there is no machine learning of attack traffic pattern. It’s mostly that we observe and know that these are the patterns that are most commonly seen, and therefore you model the patterns. In particular, if we know that these bots – for example, let’s say the bot has this set of patterns, [31:25] actually attach [prior] probability to those, and have better results. But we didn’t do that for this work.

So it’s still a [conservative] kind of system, I would say, because you could go a bit further and put a more learning of the attack traffic – for example if it’s bots – but that’s because we are not really focused on detecting bots. It’s just a case study that we took for this work.

The next steps – we want to enhance the solution on live real-time traffic, and this is challenging because you need to look at traffic at the level of packets, flows, sessions, and compare it with those in the database.

Also, we just used three features, and there are numerous more features that we can test with. This is again a continuation.

Yeah, with this I come to the end of the talk, and would be happy to take questions.

[applause]

Host: I think we’ve got time for maybe two quick questions.

[silence]

Audience member: Yeah. I see that you’ve used the machine learning for [32:45] [data type patterns]. Did you consider using just a [progression] or real network? Because it seems the [32:52] don’t exactly match a linear progression.

Divakaran: So yeah, so we have another class of study where we used explicitly machine learning [33:03], not regression, but a classification probably, [33:06] classification area. But that’s a work that is ongoing, but for that we need huge amounts of data. So here we are not relying on data. We are relying on data only to [33:19], we are not relying on data to learn. So that is the context that we looked into. If we did not have that much of data to learn, what is the best we can do?

Audience member: Okay. Could you go back to your slide where you have the robust [33:45]?

[silence]

Divakaran: This one?

Audience member: No. Yes, here. Okay, so you take, for the slope, it’s estimated as the median of all slopes. So taking the median brings [34:05]. But [you may need] to get something which you might try, because I think you would calculate … when you are doing this, basically this is a quadratic problem, because … so the complexity [34:19], and you might introduce a lot of noise of [ill-conditioned] slopes, because points that are very close … basically, the different slopes that are [ill-conditioned]. And [34:32] you need … you might have more trust in slopes where points are further away. So basically, you could take kind of windows that move around the sets, and take slopes with [pair of] points, [one on] each side of this window. And this would reduce the time … if you go real time, you might have a [time] issue. And you would eliminate all the slopes where the points are too close, which [35:01] information, because the slope is very unstable.

Divakaran: Yes, that’s a valid point. However, it’s only the [ninth] algorithm which takes quadratic. There are other algorithms for [35:11]. It’s faster than quadratic, so it’s still not linear. So it’s n … I think order of [n log l] or [n root] or something.

Host: Alright. Thank you very much, Dinil. [35:31]

[applause]

__End of Transcript__