±Forensic Focus Partners

Become an advertising partner

±Your Account


Username
Password

Forgotten password/username?

Site Members:

New Today: 3 Overall: 35876
New Yesterday: 2 Visitors: 144

±Latest Articles

±Follow Forensic Focus

Forensic Focus Facebook PageForensic Focus on TwitterForensic Focus LinkedIn GroupForensic Focus YouTube Channel

RSS feeds: News Forums Articles

±Latest Videos

±Latest Jobs

Unique File Identification in the National Software Reference Library

Unique File Identification in the National Software Reference Library



Page: 1/8

Steve Mead
National Institute of Standards & Technology
100 Bureau Drive, Stop 8970
Gaithersburg, MD 20899
[email protected]

Abstract:

The National Software Reference Library (NSRL) provides a repository of known software, file profiles, and file signatures for use by law enforcement and other organizations involved with computer forensic investigations. The NSRL is comprised of three major elements:

  1. A physical library of commercial software packages.

  2. A database of information about each file within each software package.

  3. A smaller database of the most widely used information that is updated and released quarterly. This database is called the NSRL Reference Data Set (RDS) and is NIST Special Database #28 [18].

During a forensic investigation, hundreds of thousands of files may be encountered. The NSRL is used to identify known files. This can reduce the amount of time spent examining a computer. Matches for common operating systems and applications do not need to be searched, either manually or electronically, for evidence. Additionally, the NSRL is used to determine which software applications are present on a system. This may suggest how the computer was being used and provide information on how and where to search for evidence.

This paper examines whether the techniques used to create file signatures in the NSRL produce unique results—a core characteristic that the NSRL depends on for the majority of its uses. The uniqueness of the file identification is analyzed via two methods: an empirical analysis of the file signatures within the NSRL and research into the recent attacks on the hash algorithms used to generate the file signatures within the NSRL.

The research addresses the following questions:

  1. Are the file signatures in the NSRL unique? The NSRL was examined for distinct files that generated the same signature (i.e., a collision).

    1. How likely is it that collisions will occur in the future? The probability of future collisions depends directly on the randomness of the file signatures. We ran statistical tests to answer the following questions:

      • Do file signatures appear to be random?

      • Do files bias the randomness of the file signatures in any detectable way?

    1. Do the recent attacks on MD5 and SHA-1 pose any specific threats to the NSRL? The conclusions of this paper are:

      • There are no file signature collisions in the NSRL for either MD5 or SHA-1.

      • There was no detectable bias introduced by hashing files, and so the probability of future collisions is negligible.1

      • Although there are methods to attack the underlying hash algorithms, they are not relevant to the NSRL.

1 The probability of a collision between hashes in either MD5 or SHA1 is so small that it is effectively zero. Even if the size of the NSRL doubles each year in size, it would take more than 50 years before there would be enough SHA-1 file signatures to encounter a collision just by chance.

1.0 Introduction

The National Software Reference Library (NSRL) provides a repository of known software, file profiles, and file signatures for use by law enforcement and other organizations with computer forensic investigations. The NSRL is comprised of three major elements:

  1. A physical library of commercial software packages.

  2. A database of information about each file within each software package.

  3. A smaller database of the most widely used information that is updated and released quarterly. This database is called the NSRL Reference Data Set (RDS) and is NIST Special Database #28 [18].

The NSRL project was initiated in 1999 at the request of the FBI, the DoD Cyber Crime Center and the National Institute of Justice. The project is a part of the forensics sciences program at NIST’s Office of Law Enforcement Standards.2 The first release of the NSRL RDS3 was in 2001.

As of February 2006, the NSRL consists of over 7083 software application packages that include over 34 million files. Since many of the files are used within multiple applications, there are many duplicate files within the NSRL. Currently, there are over 10 million unique files. 4

During a forensic investigation, hundreds of thousands of files may be encountered. The NSRL is used to identify known files. This can reduce the amount of time spent examining a computer. Matches for common operating system files or applications do not need to be searched, either manually or electronically, for evidence. For example, if the forensic examiner was searching images on a Microsoft Windows™ 2000 system, a comparison against the NSRL would identify over 4000 image files that come with the standard Windows™ installation5. These specific files could be excluded from further examination, as the content is known, and would not contain evidence.

Additionally, some NSRL matches are used to determine what software applications were used on a system. This may provide information for the investigator to determine how and where to search for evidence. For example, if a computer system contains applications to crack passwords, keyboard loggers, or rootkit6 packages, this may lead to

further investigation to determine if the system was used to hack into other computer systems. This type of matching could also be used to resolve an intellectual property question if a system contained proprietary software for which the system owner had no license.

2 Additional information on the Office of Law Enforcement Standards (OLES) is at http://www.eeel.nist.gov/oles/index.html. 3 The RDS is available at http://www.nsrl.nist.gov. 4 The most recent version of the NSRL, 2.11 contains 34,994,666 files, and 10,793,831 unique SHA-1 file signatures. 5 Certain commercial software, equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose. 6 A rootkit commonly refers to software installed after an attacker has gained access to a system to allow continued access, and to actively hide traces of the attacker’s activity.






Next Page (2/8) Next Page