Timothy Bollé discusses his research at DFRWS EU 2018.
Timothy: Hello. I am Timothy Bollé. I am currently doing my PhD at the School of Criminal Justice in the University of Lausanne, with Professor Eoghan Casey. And I’m going to discuss the usage of computed similarity of distinctive digital traces to evaluate non-obvious links and repetitions in cyber-investigations.
Just as a quick introduction: The online crimes are repetitive by nature, because there is a lot of victims to reach and because there is a low risk of being identified or apprehended. This repetition can be found as crime series, so it will be the same offender or same group of offenders that will commit multiple crimes. Or, as crime patterns, like hotspots or repeated victimization. To find this repetition, we can use forensic data, like traces, and also, more situational information, like modus operandi description or spatiotemporal information.
To help find those repetitions, we can use this crime analysis process. With repetitive crimes, we will acquire those information, like forensic data and situational information, that we will integrate in a structural memory that we will analyze. The aim of this analysis is to find repetitions and links, and at this stage, it’s just hypothesis. And the work of the investigator or the analyst is to evaluate the links, the hypothesis of links and patterns you found, and based on this evaluation, take some decision at strategic or operational level to try to impact the crimes.
Some limitation regarding this process is the fact that when capturing the information, there can be errors. And also, a huge limitation of the digital world is that it’s easy for people and criminals to change their identities and the tools they use. So, if we stick to exact matches between those information, we may miss important and relevant links. That’s why we thought that it was necessary to have automated ways to compute near similarities.
Those near similarities can be computed at a technical level, so between traces, or at a more behavioral level, so taking into consideration modus operandi, situational information, tools, methods, and so on. And depending on the level we will integrate, we may be able to set a different hypothesis.
So, at a trace level, we may be able to set the hypothesis that it’s the same offender. But at a more behavioral level, when we may just be able to say that it’s the same phenomenon and it’s the same type of criminals.
To answer this question, we will need additional and contextual information. Just to highlight some key points in this process: When we will [compute a] similarity, basically, we will have a score or match, and the investigator or the analyst will have on his computer something like, “Hey, those two cases have something in common, present a similarity.” And at this stage, it’s just a hypothesis that they are linked. Then, you will need to check that it’s not a false positive. Because it’s automated computation, so maybe the method has some false positive, [right?] So, we check that it’s not a false positive.
And then, the next step – and it’s the most important step – I think it’s to check if it’s a real link. It means that you will have to evaluate regarding the [user observation] if it supports the hypothesis that those two cases are linked. And this step will require contextual and additional information.
We can take, as an example, IP addresses. We can say okay, let’s check if in our cases some IP address come from the same network. For example, those two IP addresses are just different from one bit. So, if we do exact matches, we will not see anything. But if we look at the network, we will see that they come from the same network, so it’s a true positive. And then, now, if we … so, maybe the cases where those two IP addresses came from are linked.
Now, if we try to evaluate this hypothesis, regarding the fact that they are [not linked], we will see that those two IP addresses come from a popular ISP in the United States and they are attributed in two different states, California and Massachusetts. So, maybe in this case, we will not say that it’s related. So, we will just move on.
Another example is a case of email addresses. What we observed is that in some cases, we had [closed] email addresses that vary only by few characters. So, maybe it’s because for a criminal it’s easier to manage, or because they can use the same identity multiple times. Anyway, we decided to look at the username path of the address and do some string similarity computation.
We needed to select an algorithm to do this, so we created a test dataset with … we took names, and from those names, we generated some email addresses, and then, we computed string similarity between email addresses generated from the same name that constituted the similar set. And the similarity between email addresses taken from two different names, and that constituted the non-similar set. We tried multiple algorithms and the best one was the Levenshtein algorithm, that’s [08:47] distance.
Then, we selected a threshold – the threshold was selected in order to minimize the false negative rate, because here, we are in an investigative context, so we don’t want to miss relevant links. So, maybe we will have more false positives, but anyway, we didn’t want to miss links.
To evaluate those techniques, we had 207 cases from the Geneva state police. It was mainly online frauds. They were quite [various]. Here you can see some statistics. The most costly was the romance – it’s a dating scams. But we had some different ones, like advance fee fraud and inheritance. So, these statistics are built on quite similar … quite small set. So, maybe with more data, or on wider area, it may change. But in those data, we didn’t have many IP addresses, but a lot of email addresses, so we were only able to test the methods about email addresses.
When we computed the similarity between the email addresses, we had a lot of false positives, and this difference can be explained by the fact that when we created this test dataset, it was not perfectly fit the reality. So, here there is little discrepancy, but we try to adjust a little bit the threshold, and it worked well.
And what it confirmed is that there is, concretely, in real cases, some similarities between email addresses. There is a match. And then, we had to follow the process to confirm it was true positive – so, we had true positive. And then, to evaluate the links it highlighted.
We were in two different situations. The first one was when it highlighted links that were already…
I’ll do it again. It showed similarity between [traces on cases] that were already linked. For instance, here we have … the cases are in red here and the email addresses are in yellow. So, those two cases were already linked by one victim. And in each case, they were on email addresses that presented some similarity with the other one.
And here, the same in some cases. We had multiple email addresses that presented some similarities, and this shows that … it confirms that in some cases, there is email addresses that are closed, and that if we do an exact matches here, we will miss potentially relevant links.
The next situation was when we had cases that were not already linked. And using the near similarity approach, we were about to link them. For example, those two cases are … have similarities on two email addresses that vary only by one letter. In this situation, because it’s a name, when we have to evaluate the hypothesis, we may be confident in the fact that those two cases are in the same series of crime. But in some other cases, like here, the two email addresses are airbnb, rent4bnb, so in those cases, when we have to evaluate the hypothesis, we … airbnb is more common [words], so maybe here we will not be able to say that there is the same crime series here. But anyway, we still can identify maybe the same phenomenon and gain some more knowledge about this phenomenon. Like [frauds] using airbnb.
This shows that it’s really important, when computing some similarities, to take the time to evaluate what we have. This is the job of the investigator or the analyst. And based on this evaluation, you may want to look at other elements in both cases, to see if we can be more confident in the hypothesis or see if maybe there is an alternative hypothesis that explains everything.
And this is why it’s important to have those other informations. As I just said, it’s important to have those information – context information, and in the paper, we presented some formalized guidelines to what kind of information can be collected during an investigation, to help doing this step of evaluation.
To conclude, there is, in real-world cases, some of these near similarities between technical traces, and it’s important to explore further how to use it and how to [apprehend] it and evaluate it. For now, we just … we limited the work to technical characteristics and email addresses. This was limited by the data we had. There is a project in Switzerland to have a centralized platform to collect information from multiple states. So, we hope it will give us more data and more characteristics to examine.
We also … we are currently looking at more behavior similarities, and to do this, we are looking at network and graph analysis. And we are currently doing it, so … the results will follow in the future.
And just last diagram, to sum up the whole thing, is that even if we are able to find similarities at technical level or at behavioral level, the main challenge is to evaluate those potential links, those hypotheses, and to see if there is also alternative hypothesis to take into consideration. And so, this is the main challenge, and we will work on this in the future also.
End of transcript