±Forensic Focus Partners
|New Today: 0||Overall: 36232|
|New Yesterday: 4||Visitors: 70|
Unique File Identification in the National Software Reference LibraryBack to top Back to main Skip to menu
Unique File Identification in the National Software Reference Library
2.0 NSRL File Signatures
The NSRL uses a mathematical technique called hashing to produce file signatures. Currently, the NSRL uses three methods to create signatures: two are cryptographic hash algorithms and one is an error checking technique called a cyclic redundancy check (CRC) . It is important to note that there are known limitations in using the CRC to generate file signatures, and that the CRC signatures within NSRL are not unique, given that the same signature may be associated with more than one file.
Hash algorithms work by taking an input, in this case a file, and use advanced mathematics to compress it to a fixed length string of zeros and ones. The specific cryptographic hash algorithms used to generate file signatures are one-way functions: a given input will always produce the same output and the process cannot be reversed. The output is called a hash and, in the context of the NSRL, is referred to as a file signature.
One important property exhibited by these types of hash algorithms is that their output is randomly distributed across the entire range of possible outputs.7 This property explains why one-way hash functions are frequently used as building blocks in random number generators [17,27].
The specific hash algorithms used by the NSRL to generate the file signatures are MD5 and SHA-1. MD5 is an older hash algorithm and is defined by the Internet Engineering Task Force, Request for Comment 1321 . The SHA-1 is a Federal Information Processing Standard (FIPS) promulgated by NIST as FIPS PUB 180-2 . The third technique is a 32 bit version of the CRC error checking method, as defined by [8,32]. Both the MD5 and SHA-1 are cryptographic hash algorithms.
NIST is planning to add additional file signatures generated by other hash algorithms in the future, including those identified in FIPS PUB 180-2 (SHA-256, SHA-384, SHA512) .
An important issue is whether the methods used to create file signatures in the NSRL produce unique results. This core characteristic is the basis for the use of the NSRL within the forensic community. Since it is used for file identification, this analysis examines the uniqueness of file identification employing both an empirical analysis of the file signatures within the NSRL and an analysis of current research relating to the underlying algorithms used to generate the file signatures.
7 For a detailed look at hash algorithms, see Applied Cryptography by Bruce Schneier, see Ref. .
Specifically, the research addresses the following three questions:
1. Do collisions occur within the NSRL?
A collision occurs if two different files generate the same hash file signature. Since the NSRL uses file signatures to identify known files and applications, a collision could cause files to be incorrectly identified. For each hash algorithm used within the NSRL, the file signatures were examined to identify any collisions.
2. How likely is it that a collision will occur in the future?
In order to address this question, it is necessary to analyze aspects randomness in relation to the hash algorithms used to generate the file signatures. Two key properties that cryptographic hash algorithms exhibit are the randomness of their output, and the large number of potential outputs that can be generated. If the file signatures conform to these properties, the standard methods for evaluating the chances for a collision are well known. We examine the overall randomness of the file signatures and whether hashing files bias the resulting signatures.
• Are the file signatures within the NSRL random?
The algorithms used to generate the NSRL file signatures have been thoroughly tested throughout the cryptographic community. However, an empirical analysis of such a large set of hashes generated from files has not been publicly disseminated. If the NSRL file signatures adhere to the properties of the underlying hash algorithms, they should appear to be random.8
• Do files used as input bias the randomness of hash signatures within the NSRL?The NSRL file signatures are generated directly from files which range in size from a few bytes to gigabytes in size. Files and applications typically tend to have some forms of structure, repetition, and other modulated patterns, which may subtly affect the distribution of values produced by the hash algorithm.
3. Are the hash algorithms used to generate the file signatures able to be manipulated to create collisions relevant to the NSRL?
Recently, methods for finding collisions in MD5 and SHA1, two hash functions included in the NSRL, have been discovered. Currently, there are multiple examples of MD5 collisions [14,30], but as of yet no documented SHA-1 collisions generated. These specific types of attacks reduce the usefulness of MD5 and SHA-1 for some, but not all, applications . These types of attacks were examined here to see if they could have any impact on the NSRL within typical forensic settings.
8 Cryptographic hash algorithms such as MD5 and SHA-1 produce output that should be indistinguishable from a random sequence of numbers, given that the input is not known.
The three primary questions, although all related in nature, require different approaches. The first two questions, examining the NSRL for collisions, and addressing the likelihood of future collisions was accomplished through an empirical analysis of the file signatures within the NSRL. The last question, evaluating the impact of attacks on hash algorithms used within the NSRL relies on research into the specific attacks and analyzing them within the context of how the NSRL is used within the field of computer forensics.