We know that programming plays a role in investigations, but how much of a role does statistics have? Does anyone have any stories where standard deviation, z-score, correlation, probability, etc played a role in analysis?
I just started learning statistics, but here is an example from a brute force attack on a honeypot, where I quantified the correlation between sessions and snort alerts, which was 0.98 (very strongly related).
#!/usr/bin/perl
use warnings;
use strict;
use StatisticsBasic qw(all);
my @session = qw(2 3 26 2 7 7 27 16 5 7 16 16 9 21 10 19 14 8 8 30 20 3 10 0 12
4 6 7 7 6 112 236 237 605 224 13 7 14 1 6 21 111 7 32 19 13 20 16 38 12);
my @snort = qw(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 46 100 99 187 63 0 0 0 0 0 0 34 0 2 1 0 0 0 1 0);
my $cor = correlation( \@session, \@snort );
print "Correlation $cor\n";
Correlation 0.98
Very pretty -)
I've no stories in particular, but I really, really, really think that statistics and metrics have a role to play. ( See me spouting here. ) You might also like to have a look at the following
and Andrew's mailing-list/website at
Thanks for the links! I've read Security Data Visualization and plan on reading Security Metrics since I've heard good things about it. )
I've actually been using basic descriptive statistics like min, max, sum, average, range, etc. without thinking about it. Now I've been reading a little more on statistics and am encountering equations that I know have uses in investigations but am having trouble thinking of them.
Various correlation equations lets you quantify relationships. The correlation of 0.98 in sessions and alerts strongly links the two, or a correlation of -0.98 on data contained in a binary I would think would suggest it was encrypted.
There has to be a lot of examples of how these kinds of equations can help someone investigate an incident, but I'm having a brain block. (
Remember
"There are three kinds of lies lies, damned lies, and statistics."
The trouble is that a
And this is where forensics and statistics usually part ways I imagine ! However, there is, I think room for corrolations between data sets where other evidence supports e.g.
Suspect "A" has logged on to computer "C" at times "X", "Y" and "Z".
The websites "W1", "W2" and "W3" were defaced at times "X+n", "Y+n" and "Z+n".
There are files that are on computer "C" that showed up in the defacing.
Although no other evidence exists that Suspect "A" commited the crime, there is a corrolation between times that might lead us to believe that there may be more than co-incidence in the events.
This is, however, an opinion, and isn't reliant on fact - so unless asked directly what you "might infer" from the above - it's beyond the realm of a statment of fact.
Of course, your point about randomness in files & encryption is a valid practical application !
Another good use for statistics - not a per case but on an aggregated basis - is for describing general information security trends.
Within our team we use the
There is also a presentation from one of my colleagues about this subject on the
The correlation of 0.98 in sessions and alerts strongly links the two, or a correlation of -0.98 on data contained in a binary I would think would suggest it was encrypted.(
Just wanted to check; your correlation here I assume is using a standard correlation co-efficient, such as the Product-Moment Correlation Co-efficient? If so, a strong negative correlation would not necessarily imply encryption. Encryption methods typically work by introducing as much entropy (chaos) as possible into the system involved, so (unless I am getting completely the wrong end of the stick) a result of zero would suggest encryption, or simply that there is actually no correlation. A result of -0.98 would suggest that there is a strong negative correlation between the 2 entities compared. In your example of snort attacks and sessions, I'm struggling to see how a strong negative correlation would occur, and what its significance would be.
Simplifying these further, a correlation of +1 is fundamentally a graph of y=x (showing perfect co-incidence), whereas a correlation of -1 is fundamentally a graph of y=-x (showing that co-indicence definitely does not occur, in fact the complete opposite!).
Just to clarify; a -ve correlation does not mean that there is no correlation, that is what zero represents.
How you determine whether a zero result represents encryption or simply nothing of interest would in itself be an interesting topic, but one that I would expect to be beyond the scope of an honours project
Check out this paper by Geoff Black
http//
Statistical Validation and Data Analytics in eDiscovery
And check out his blog too. It's cool, 'cause he links to mine )
Remember
"There are three kinds of lies lies, damned lies, and statistics."
The trouble is that a
correlation is not evidence of causation - don't forget that the decline in pirate numbers is directly responsible for global warming. And this is where forensics and statistics usually part ways I imagine ! However, there is, I think room for corrolations between data sets where other evidence supports e.g.
Suspect "A" has logged on to computer "C" at times "X", "Y" and "Z".
The websites "W1", "W2" and "W3" were defaced at times "X+n", "Y+n" and "Z+n".
There are files that are on computer "C" that showed up in the defacing.Although no other evidence exists that Suspect "A" commited the crime, there is a corrolation between times that might lead us to believe that there may be more than co-incidence in the events.
This is, however, an opinion, and isn't reliant on fact - so unless asked directly what you "might infer" from the above - it's beyond the realm of a statment of fact.
Thanks for the warning! It sounds like statistics might be similar to forensics in that where you can't jump to conclusions in statistics by saying correlation implies causation, you can't jump to conclusions in forensics by saying a suspect read a file because an artifact like the a-time changed. You have to test your hypothesis which is something I haven't yet learned about.
The correlation of 0.98 in sessions and alerts strongly links the two, or a correlation of -0.98 on data contained in a binary I would think would suggest it was encrypted.(
Just wanted to check; your correlation here I assume is using a standard correlation co-efficient, such as the Product-Moment Correlation Co-efficient? If so, a strong negative correlation would not necessarily imply encryption. Encryption methods typically work by introducing as much entropy (chaos) as possible into the system involved, so (unless I am getting completely the wrong end of the stick) a result of zero would suggest encryption, or simply that there is actually no correlation. A result of -0.98 would suggest that there is a strong negative correlation between the 2 entities compared. In your example of snort attacks and sessions, I'm struggling to see how a strong negative correlation would occur, and what its significance would be.
Simplifying these further, a correlation of +1 is fundamentally a graph of y=x (showing perfect co-incidence), whereas a correlation of -1 is fundamentally a graph of y=-x (showing that co-indicence definitely does not occur, in fact the complete opposite!).
Just to clarify; a -ve correlation does not mean that there is no correlation, that is what zero represents.
How you determine whether a zero result represents encryption or simply nothing of interest would in itself be an interesting topic, but one that I would expect to be beyond the scope of an honours project
Thanks for setting me straight. I guess I need more than one weekend of statistics ) As for the sessions and alerts, I found the relationship first without the correlation coefficient. The snort alerts about a password guessing attack started at the same time the sessions suddenly increased due to the password guessing attack. I proved they were related first by looking at the snort and session logs, and then went and quantified how much they were related.
Is there a problem with doing it that way and do you know of any good intro to statistics books that you can recommend?