±Forensic Focus Partners
|New Today: 0||Overall: 36125|
|New Yesterday: 1||Visitors: 168|
Unique File Identification in the National Software Reference LibraryBack to top Back to main Skip to menu
Unique File Identification in the National Software Reference Library
4.1 Examining the NSRL for Collisions
A collision is defined as two different streams of input that, when hashed, produce the same output. In terms of the NSRL, a collision would occur if two different files produce the same file signature. If the content between two files is different even by a single bit, the files should generate distinct file signatures. It is important to keep in mind that only the file content is used to calculate the file signature—not file metadata such as file name.
Identifying collisions is a relatively straightforward process. The NSRL is searched for all file signatures that appear more than once. If those signatures are generated by different files, then a collision has occurred.
Many of the files within the NSRL have been included multiple times. For example, certain graphic files used in the Windows™ operating system are included in multiple releases of Windows™. Files that had the same file size and the same file signature for each CRC-32, MD5, and SHA-1 were considered duplicates and removed from further consideration9. When looking for collisions, it is important to differentiate between an actual collision (where the file content is different) and cases where the file signatures are the same because the files are duplicates.
After prescreening the data, we compared all of the hashes in search of any collisions. Each set of file signatures generated from a specific algorithm (CRC-32, MD5, SHA-1) was examined and any collisions recorded.
In addition to identifying collisions, we sorted the signatures and recorded information about the spacing between successive hashes. Each file signature is essentially a very large number, and the maximum, minimum, and average distances between sorted hash pairs can be measured. The distance should be related to the specific hash algorithm, as well as the total number of file signatures. If there is substantial grouping or clustering, this could affect the probabilities of a collision, and would need to be investigated further.
9 Due to the logistics and time involved to compare each file byte for byte for duplicates, a combination of all file hashes and file size was used to identify files with identical content.
4.2 Likelihood for Future Collisions
The possibility of future collisions in the NSRL is more difficult to determine given that the NSRL grows in size. However, there are primarily two factors to consider: the total number of hashes an algorithm can generate, and how those values are distributed.
For any hash algorithm, the total number of unique values that can be generated is a function of the specific algorithm. As the total number of unique values increase for an algorithm, the chances that there will be a collision decrease. Currently, the NSRL uses three primary algorithms to generate file signatures, and each of them can generate a different maximum number of hashes. Since the quantity of hashes can vary greatly between the algorithms, in table 1 we use drops of water as an analogy to understand the relative difference between the algorithms.
|File Signature Algorithm Capacity Comparison|
|Algorithm||Max # of Hashes (Base 2)||Max # of Hashes (Base 10)||Relative Size Analogy10 (Drops of Water)|
|CRC-32||232||4.29 X 109||All the water in a small pond|
|MD5||2128||3.40 X 1038||All the water in our galaxy|
|SHA-1||2160||1.46 X 1048||All the water in our universe|
Table 1: File Signature Algorithm Capacity Comparison
The number of potential values a given algorithm can generate is only part of calculating the probability for collision. The other critical aspect to this calculation is how likely any particular hash value is to be used or generated. If all values for a given hash algorithm have an equal probability11 of being used, the odds of a collision occurring can be calculated. The NSRL uses two cryptographic algorithms to generate file signatures, MD5 and the SHA-1. Both of these algorithms generate output that appears to be random. In fact, both of these algorithms have been used as building blocks commonly used in random number generators.
One area of research that has not been extensively studied is the impact of files as input to hash algorithms. Files can vary greatly in size, ranging from a couple of bytes to gigabytes. Additionally, files tend to have internal structures and repeated patterns, which may not be present in other forms of data. We examined if files as input to hash algorithms could bias how the algorithms operate, and change the probability of generating a collision.
10 These values are for comparisons of relative size and assume the following: there are approximately 20 drops of water per ml, the volume of water on earth is 1.3x109 km3, and that each star in our galaxy contains an earth-like volume of water. Additionally, that there are approximately 4x1011 stars in our galaxy, and 8x1010 total galaxies in our universe. 11 This would be described as uniformly random—where all values have an equal chance of being generated.
In relation to future collisions within the NSRL, if files do not bias the generation of file signatures, then the probabilities of a collision are known. However, if we are able to detect bias in file signatures, then the probabilities for a collision may not fit the current models and would need to be reexamined.
In order to identify bias in the NSRL file signatures, our approach focused on two general areas: what tests would need to be performed to identify significant bias and how to organize and structure the NSRL data for analysis.
Randomness is a property that is straightforward to conceptualize, but difficult to define exactly and empirically measure. There is no single statistical or deterministic test that can conclusively identify if a given sequence of numbers is random. Furthermore, there are different aspects to randomness that cannot be captured by one type of test. To have the greatest amount of confidence in the results from any randomness testing, multiple tests need to be performed to detect various types of deviation.
There are two well known sets of tests, the Diehard tests by Marsaglia  and those defined by Knuth in his Seminumerical Algorithms . Each identify a variety of tests that cover a range of random behavior from the simple distribution of 0’s and 1’s (binary frequency tests) to more complex methods that look at various interactions between bits in a sequence.
NIST developed a set of statistical tests specifically to address the need to evaluate random number generators for cryptographic applications. They include variations of those by Marsaglia  and Knuth . However, new tests were introduced such as the Approximate Entropy Test , Berlekamp-Massey Linear Complexity Test , and the Random Excursion Test . This suite was released as the NIST Statistical Test Suite (STS) and the latest release is version 1.8 .