Forensic Focus - Computer Forensics, Computer Forensic Training, Digital Forensics
LoginRegisterForumsArticles/PapersEducationReviewsInterviewsNewsletterJobsEventsBlogAdvertise
Search Forensic Focus
Custom Search

Find us on Facebook
Follow Forensic Focus on Twitter

Submit article, paper or blog post
Latest Articles
· “The Data Specimen is the Blood of Cyber Forensics”
· Forensic Imaging of Hard Disk Drives- What we thought we knew
· Can Your Digital Images Withstand A Court Challenge?
· Review: Proof Finder by Nuix
· Forensic Toolkit v3 Tips and Tricks ― Not on a Budget
· Is your client an attorney? Be aware of possible constraints on your investigation. (Part 2 of a multi-part series)
· iPhone Tracking – from a forensic point of view (Update!)
· Android Forensics Study of Password and Pattern Lock Protection
· Skype in eDiscovery
· Forensic Toolkit v3 Tips and Tricks – On a budget

read more...
Main Menu
MY ACCOUNT
COMMUNITY
EMPLOYMENT
EDUCATION
RESOURCES
MISC
Follow Forensic Focus

Join newsletter

Join LinkedIn group

Follow on Twitter

Subscribe to news

Subscribe to forums

Subscribe to blog

Subscribe to tweets

Members' blogs

External feeds

Bookmark & share: Bookmark and Share

Newsletter
Newsletter

You must be a
registered user
to receive our newsletter

Register Now!
Forensic Focus

Forensic Focus

Copy and paste the text below to insert the button displayed above on your site. Thanks for your support!


Digital forensic sampling

Page: 1/3

The application of statistical sampling in digital forensics

Authors: Robert-Jan Mora and Bas Kloet
Company: Hoffmann Investigations, Almere, The Netherlands
URL: http://en.hoffmannbv.nl
Date: 27th March 2010
Version:1.0

Table of contents

1 Introduction
2 Sampling basics
2.1 The necessity for sampling
2.2 Determine sample size
2.2.1 The level of precision
2.2.2 The confidence level
2.2.3 Degree of variability
2.3 Using formulas to calculate a sample size
2.4 Why random sampling
3 Application of random sampling in different cases
3.1 Pornographic cases
3.2 Detect fraudulent correspondence
4 Example program
5 Conclusion
5.1 Acknowledgements


1 Introduction

In this paper we would like to address a few problems that we encounter in the digital forensic field, in general, which probably will get worse if our methods do not get smarter soon. A few problems that the digital forensic community has to deal with are:

· The amount of data that needs to be investigated in cases increases every year;
· Forensic software is unstable when processing large quantities of data;
· Law Enforcement has a huge backlog in processing cases in time;
· More and more pressure is placed on digital forensic investigators to produce reliable results in a small amount of time.

So what can we do to be more effective and investigate the right data at the right time? In this paper we would like to propose a solution based on the technique of random sampling, which can be applied to the working field of digital forensics. The goal of this paper is to explain:

· when and why random sampling might be useful in a digital forensic investigation;
· present the reader with background information on relatively straightforward random sampling techniques;
· describe a number of cases where random sampling might be used to drastically reduce the amount of work required in a digital forensic investigation, without a significant (negative) impact on the reliability of the investigation.

To the authors' knowledge, the application of random sampling in digital forensics is very minimal[1], even though random sampling is often used in other forensic fields. For example the use of statistics to accurately estimate the quantity and quality of illegal drugs: if a large amount of amphetamine tablets is found, then how can we determine if every tablet contains amphetamine, without testing every tablet? By using statistical sampling techniques to select an appropriate sample for testing, we can make a reliable estimate of the quantity and quality of the total population of tablets. Also, statistical sampling techniques are used in financial auditing or in fraud investigations to detect fraud in large populations.

One of the reasons for writing this paper was a news article in the Dutch media about a child pornography case. A suspect had 49.500 pictures with child pornography on his computer.[2] There are a lot of different techniques to detect known child pornography, for example by known hash-sets or skin-tone detection techniques, but a lot of unknown material still has to be reviewed by a certified investigator, to see if it fits the criteria for child pornography. After reading the article we had the following questions:

· With what kind of certainty had been determined that exactly 49.500 child pornographic pictures were found on the computer of the suspect?
· How long did the investigation took before this could be determined and at what cost?
· Does the exact amount of child pornographic material found on a suspect's hard drives directly influence the length of their sentence?

Since the number of files on hard drives are increasing every year, a smarter method of investigating these populations has to be established. This paper presents a novel solution which is based on random sampling methods applied in digital forensic investigations. The paper is organized as follows:

· Section 2 covers sampling basics, sampling size and sampling techniques;
· Section 3 addresses the application of sampling in different forensic investigations;
· Section 4 shows an algorithm of a simple random sampling application based on file extensions;
· Section 5 contains the conclusion.


2 Sampling basics

In digital forensics we collect data from suspects and analyze the data acquired from their computers or cellular devices. In a lot of investigations we need to examine large volumes of data of a specific kind. In an ordinary child pornography investigation the examiner has to review tens of thousands of pictures or movies and assess if they fit the criteria of child pornography. If we take the selection from the investigation in the introduction of this paper, then we have 49.500 pictures which contain child pornography. Let's assume that the total number of pictures was 100.000. In statistics the total collection of elements of which we can take a sample is called population. One picture in the example above is called a case of element. If the examiner reviewed every case of element (thus 100.000 pictures) this is called a census. But if we choose to select some of the cases of elements using a specific method this is called a sample. There are many sampling techniques. In this paper we will only discuss the random sampling technique.

2.1 The necessity for sampling

In the introduction of the paper we've given some arguments why, with the ever growing population of files and the demand for quicker results, it's not quite feasible to do a complete review of all material in some digital forensic investigations. Sampling is a good alternative for a complete census if [3]:

· researching of the entire population is not possible in practice;
· budget limitation makes it impossible to examine the total population;
· time limits makes it impossible to research the entire population;
· all data has been collected, but you need to produce results quickly.

With sampling you can reliably use observations about the sample to make a statement about the entire population. We believe that the problems that we currently have to deal with in a lot of digital forensic investigations, are a good reason to look at the possibilities of using sampling in these cases.

2.2 Determine sample size

Before we can take a sample in a digital forensic investigation, we need to determine the sample size. The sample size has to do with a number of factors, including the purpose of the study, population size, the risk of selecting a bad sample and the allowable sampling error. The examples and definitions in this section are based on a paper about determining the sample size. [4]

2.2.1 The level of precision

The level of precision is sometimes called the sampling error. This is the range in which the true value of the population is estimated to be. This value is usually expressed in percentages (+/- 5%) that need to be determined by the investigator before sampling. Because the level of precision can have a significant effect on the sample size with a certain confidence level.

So if an digital forensic examiner finds that 85% of the JPEG files in the sample are classified as pornography, and determined the level of precision at 5%, then the examiner can conclude that between 80% and 90% with a certain confidence level of the entire population of JPEG files will almost certainly contain pornography.

2.2.2 The confidence level

“The confidence level or risk level is based on the ideas encompassed under the Central Limit Theorem. The key idea encompassed in the the Central Limit Theorem is that when a population is repeatedly sampled, the average value of the attribute obtained by those samples is equal to the true population value.”[5]

This means that if 95% is the selected confidence level 95 out of the 100 samples will have the true population value within the range of precision specified earlier. In practice a 95% confidence level with a +/- 5% precision rate is assumed reliable.

2.2.3 Degree of variability

The degree of variability in the attributes being measured, refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain the given level of precision. The less variable a population, the smaller the sample size. The level of variability is expressed using the 'proportion' or 'P'. A proportion of 0.5 (or 50%) indicates the greatest level of variability, more than either 0.2 or 0.8. This is because 0.2 or 0.8 indicate that a large majority do not or do, respectively, have the attribute of interest. Because a proportion of 0.5 indicates the maximum variability in a population it is often used in determining a more conservative sample size, that is, the sample size may be larger than if the true variability of the population attribute were used. In this paper we will use formulas that assume a proportion of 0.5, so we can ignore the level of variability without choosing overly optimistic sample sizes. 2.3 Using formulas to calculate a sample size







Next Page (2/3) Next Page


Forensic Education

computer forensics education choices COURSE DIRECTORY

User Info

Welcome Anonymous

Nickname

Membership:
Latest: Draugrs
New Today: 0
New Yesterday: 13
Overall: 20808

People Online:
Members: 2
Visitors: 25
Bots: 5
Staff: 0
Staff Online:

No staff members are online!
Latest Jobs

Data Analytics Assistant Director, Dubai
Last post by ScottBurkeman in Digital Forensics Job Vacancies on Feb 02, 2012 at 17:14:03

Experienced Forensic Computer Analyst, Surrey
Last post by pickle in Digital Forensics Job Vacancies on Jan 31, 2012 at 12:35:31

eDiscovery Analyst and Assistant Manager, London £35-£50000
Last post by ScottBurkeman in Digital Forensics Job Vacancies on Jan 23, 2012 at 14:12:11

QCC Vacancy - Digital Forensics Sales Executive (London)
Last post by garybrevans in Digital Forensics Job Vacancies on Jan 20, 2012 at 13:17:43

E-Discovery Consultant- London- £40-£50K basic + 10% bonus
Last post by Teval in Digital Forensics Job Vacancies on Jan 20, 2012 at 10:09:56

Senior Software Licence Review Manager. London. Up to £100K
Last post by Tyrrell66 in Digital Forensics Job Vacancies on Jan 19, 2012 at 13:46:41

Senior Forensic Manager - London
Last post by diana2012 in Digital Forensics Job Vacancies on Jan 18, 2012 at 18:05:43

Data Analytics Consultant
Last post by Nicola in Digital Forensics Job Vacancies on Jan 18, 2012 at 18:04:08

Forensic General Investigations Accountant Consultant London
Last post by Nicola in Digital Forensics Job Vacancies on Jan 17, 2012 at 15:13:44

Forensic Technology - Sr. Consultant Needed in Boston, MA
Last post by mfeeley in Digital Forensics Job Vacancies on Jan 12, 2012 at 18:39:18

Blog
· Harry Onderwater
· Forensic Toolkit v3 Tips and Tricks ― Not on a Budget
· Is your client an attorney? Be aware of possible constraints (Part 2)
· iPhone Tracking – from a forensic point of view
· Android Forensics Study of Password and Pattern Lock Protection
· Skype in eDiscovery
· Forensic Toolkit v3 Tips and Tricks – On a budget
· Anonymous, what does it mean?
· YouDetect – Implementing the principles of statistical classifiers and cluster analysis for the purposes of classifying illegally acquired multimedia files
· Advice for Digital Forensics Job Seekers

read more...
Members' Blogs

Start Blogging

What is Computer Forensics?
Computer forensics (or forensic computing) is the use of specialized techniques for recovery, authentication, and analysis of electronic data with a view to presenting evidence in a court of law.
Downloads
  1: Forensic Examination of Digital Evidence: A Guide for Law Enforcement (pdf)
  2: ACPO Good Practice Guide for Computer based Electronic Evidence
  3: Ancysoft Data Recovery Software
  4: Electronic Crime Scene Investigation: A Guide for First Responders (pdf)
  5: HELIX incident response CD
  6: PDA Forensic Tools:An Overview and Analysis
  7: Recover My Files
  8: Autopsy Forensic Browser Version 2.03 (source code)
  9: Handy Recovery
  10: PC On/Off Time

Use of this website signifies your agreement to the Terms of Use/Privacy Policy available here.

All logos and trademarks in this site are property of their respective owner. The comments are property of their posters, all the rest © 2011 Forensic Focus


Interactive software released under GNU GPL, Code Credits, Privacy Policy
.: fisubsilver shadow phpbb2 style by Daz :: CPG-Nuke port by norseman :: ported to CPG-Dragonfly by jamin :.