Table of contents
1 Introduction
2 Sampling basics
2.1 The necessity for sampling
2.2 Determine sample size
2.2.1 The level of precision
2.2.2 The confidence level
2.2.3 Degree of variability
2.3 Using formulas to calculate a sample size
2.4 Why random sampling
3 Application of random sampling in different cases
3.1 Pornographic cases
3.2 Detect fraudulent correspondence
4 Example program
5 Conclusion
5.1 Acknowledgements
1 Introduction
In this paper we would like to address a few problems that we encounter in the digital forensic field, in general, which probably will get worse if our methods do not get smarter soon. A few problems that the digital forensic community has to deal with are:
· The amount of data that needs to be investigated in cases increases every year;
· Forensic software is unstable when processing large quantities of data;
· Law Enforcement has a huge backlog in processing cases in time;
· More and more pressure is placed on digital forensic investigators to produce reliable results in a
small amount of time.
So what can we do to be more effective and investigate the right data at the right time? In this paper we would like to propose a solution based on the technique of random sampling, which can be applied to the working field of digital forensics. The goal of this paper is to explain:
· when and why random sampling might be useful in a digital forensic investigation;
· present the reader with background information on relatively straightforward random sampling techniques;
· describe a number of cases where random sampling might be used to drastically reduce the amount of work required in a digital forensic investigation, without a significant (negative) impact on the reliability of the investigation.
To the authors' knowledge, the application of random sampling in digital forensics is very minimal[1], even though random sampling is often used in other forensic fields. For example the use of statistics to accurately estimate the quantity and quality of illegal drugs: if a large amount of amphetamine tablets is found, then how can we determine if every tablet contains amphetamine, without testing every tablet? By using statistical sampling techniques to select an appropriate sample for testing, we can make a reliable estimate of the quantity and quality of the total population of tablets. Also, statistical sampling techniques are used in financial auditing or in fraud investigations to detect fraud in large populations.
One of the reasons for writing this paper was a news article in the Dutch media about a child pornography case. A suspect had 49.500 pictures with child pornography on his computer.[2] There are a lot of different techniques to detect known child pornography, for example by known hash-sets or skin-tone detection techniques, but a lot of unknown material still has to be reviewed by a certified investigator, to see if it fits the criteria for child pornography. After reading the article we had the following questions:
· With what kind of certainty had been determined that exactly 49.500 child pornographic pictures were found on the computer of the suspect?
· How long did the investigation took before this could be determined and at what cost?
· Does the exact amount of child pornographic material found on a suspect's hard drives directly influence the length of their sentence?
Since the number of files on hard drives are increasing every year, a smarter method of investigating these populations has to be established. This paper presents a novel solution which is based on random sampling methods applied in digital forensic investigations. The paper is organized as follows:
· Section 2 covers sampling basics, sampling size and sampling techniques;
· Section 3 addresses the application of sampling in different forensic investigations;
· Section 4 shows an algorithm of a simple random sampling application based on file extensions;
· Section 5 contains the conclusion.
2 Sampling basics
In digital forensics we collect data from suspects and analyze the data acquired from their computers or cellular devices. In a lot of investigations we need to examine large volumes of data of a specific kind. In an ordinary child pornography investigation the examiner has to review tens of thousands of pictures or movies and assess if they fit the criteria of child pornography. If we take the selection from the investigation in the introduction of this paper, then we have 49.500 pictures which contain child pornography. Let's assume that the total number of pictures was 100.000. In statistics the total collection of elements of which we can take a sample is called population. One picture in the example above is called a case of element. If the examiner reviewed every case of element (thus 100.000 pictures) this is called a census. But if we choose to select some of the cases of elements using a specific method this is called a sample. There are many sampling techniques. In this paper we will only discuss the random sampling technique.
2.1 The necessity for sampling
In the introduction of the paper we've given some arguments why, with the ever growing population of files and the demand for quicker results, it's not quite feasible to do a complete review of all material in some digital forensic investigations. Sampling is a good alternative for a complete census if [3]:
· researching of the entire population is not possible in practice;
· budget limitation makes it impossible to examine the total population;
· time limits makes it impossible to research the entire population;
· all data has been collected, but you need to produce results quickly.
With sampling you can reliably use observations about the sample to make a statement about the entire population. We believe that the problems that we currently have to deal with in a lot of digital forensic investigations, are a good reason to look at the possibilities of using sampling in these cases.
2.2 Determine sample size
Before we can take a sample in a digital forensic investigation, we need to determine the sample size. The sample size has to do with a number of factors, including the purpose of the study, population size, the risk of selecting a bad sample and the allowable sampling error. The examples and definitions in this section are based on a paper about determining the sample size. [4]
2.2.1 The level of precision
The level of precision is sometimes called the sampling error. This is the range in which the true value of the population is estimated to be. This value is usually expressed in percentages (+/- 5%) that need to be determined by the investigator before sampling. Because the level of precision can have a significant effect on the sample size with a certain confidence level.
So if an digital forensic examiner finds that 85% of the JPEG files in the sample are classified as pornography, and determined the level of precision at 5%, then the examiner can conclude that between 80% and 90% with a certain confidence level of the entire population of JPEG files will almost certainly contain pornography.
2.2.2 The confidence level
“The confidence level or risk level is based on the ideas encompassed under the Central Limit Theorem. The key idea encompassed in the the Central Limit Theorem is that when a population is repeatedly sampled, the average value of the attribute obtained by those samples is equal to the true population value.”[5]
This means that if 95% is the selected confidence level 95 out of the 100 samples will have the true population value within the range of precision specified earlier. In practice a 95% confidence level with a +/- 5% precision rate is assumed reliable.
2.2.3 Degree of variability
The degree of variability in the attributes being measured, refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain the given level of precision. The less variable a population, the smaller the sample size. The level of variability is expressed using the 'proportion' or 'P'. A proportion of 0.5 (or 50%) indicates the greatest level of variability, more than either 0.2 or 0.8. This is because 0.2 or 0.8 indicate that a large majority do not or do, respectively, have the attribute of interest. Because a proportion of 0.5 indicates the maximum variability in a population it is often used in determining a more conservative sample size, that is, the sample size may be larger than if the true variability of the population attribute were used. In this paper we will use formulas that assume a proportion of 0.5, so we can ignore the level of variability without choosing overly optimistic sample sizes. 2.3 Using formulas to calculate a sample size