As I see it, the hash done while imaging takes into account the "flux" of data read from the source and written to the target disk.
As such when you verify the hash of the written image by comparing it to the one you obtained during imaging you are ONLY saying that there were not "write errors" on your target, not necessarily that there were not "read errors" on the source.
If you prefer the hashing is a way to know for sure that the image that you examined and of which you provided a copy to the other part in the trial has not been tampered with, and represent an exact "snapshot" of what was read from the device on a given date.
With perfectly working disks and perfectly working equipment/software, and in theory, what you read is actually what is on the source.
But then we need to leave alone for one moment "pure forensics" and get to "data recovery".
Hard disks do develop "bad sectors" and do have "malfunctionings".
Still in theory any modern hard disk is so intelligent to re-map a "weak sector" to a spare one "transparently", and normally the way the Disk internal OS works is (more or less)
- let me read (because the OS told me to do so) sector 123456
- hmmm, the ECC sector checksum (or whatever) does not match at first attempt
- let me try to correct the data read through my internal (and not documented) recovery algorithm
- hmmm, nope, it still does not work
- let me try to impememnt the parity check algorithm (another not documented feature)
- pheeew, now it matches, good
- to be on the safe side, let me remap sector 123456 to spare sector 999001 (without telling the OS, nor the filesystem) and let me jolt down this new translation in my G-list (pr P-list or *whatever*)
It is perfectly possible (in theory and practice) that in the exact moment you are reading a sector this "becomes" bad.
What happens then?
The sector was "weak", but was *somehow* read correctly, it became "bad" exactly one fraction of a nanosecond after having been read, and the disk managed the issue fine.
But what if a given sector passes from "good" to "bad" immediately after you have read it?
The disk, at next occasion, finds it bad, attempts t recover it and fails (or succeeds but for *whatever* reasons fails in the copying it to the spare sector or fails in updating the list.
When you try to rehash the source drive, you will have either errors or another hash.
On the other hand, I believe that is not "common practice" to write the image from source to several targets at the same time.
So for a given period of time you have only a "source" and a "target", the same malfunctioning may happen to the "target" instead of the source (and you find it only because a new hashing of the target or of a copy of it comes out different), in which case I think that what is done s to re-image from the original.
In other words, the hashing process is an important part of the procedures but it is not the "only" solution.
A better approach could be that of doing a more granular form of hashing.
The smallest "atomic" component being a sector or "block".
So you could hash each sector by itself and create a list of hashes one for each sector or decide to group 10/10/1000/10000/100000 sectors into a "blocklist" and hash these blocklists.
This would bring IMHO two advantages
- you know for sure that ONLY a given "blocklist" is affected (and ALL the other ones are fine)
- if more than one blocklist (or many or all of them) do not hash correctly then something (be it OS instability, hardware issues or *whatever*) is causing it in a "generalized" way before completing the "whole" image
[/listo]
jaclaz
P.S. it seems that not only the previous idea is nothing new, but it has also been ported to a "next" level
Distinct Sector Hashes for Target File Detection
Joel Young, Kristina Foster, and Simson Garfinkel, Naval Postgraduate School
Kevin Fairbanks, Johns Hopkins University
http//
On further check, the idea hinted in the PS above is being implemented within/with the support of digitalcorpora, see here
http//
And there is an article (of course behind the usual paywall, but the abstract is enough)
http//
which provides an interesting empirical analysis
This article reports the results of a case study in which the hashes for over 528 million sectors extracted from over 433,000 files of different types were analyzed. The hashes were computed using SHA1, MD5, CRC64, and CRC32 algorithms and hash collisions of sectors from JPEG and WAV files to other sectors were recorded. The analysis of the results shows that although MD5 and SHA1 produce no false-positive indications, the occurrence of false positives is relatively low for CRC32 and especially CRC64. Furthermore, the CRC-based algorithms produce considerably smaller hashes than SHA1 and MD5, thereby requiring smaller storage capacities. CRC64 provides a good compromise between number of collisions and storage capacity required for practical implementations of sector-scanning forensic tools.
that confirms my initial thoughts that one could use a much simpler algorithm than MD5 for block hashing (thus saving computational resources, i.e. time and space for the hash database).
So, one could "keep" the current MD5 or SHA-1 hashes for the "whole image" but use a much simpler CRC algorithm for "block hashing" and - since the scope here is only verification and not comparing with a database of known hashes, a simple CRC32 would be enough.
Summing this with the considerations about block size in other articles about the same subject, particularly the one here
Using purpose-built functions and block hashes to enable small block and sub-file forensics. Simson Garfinkel, Alex Nelson, Douglas White, and Vassil Roussev.
http//
http//
It would make sense to hash with CRC32 blocks of 16,384 bytes or maybe 32,768 bytes or even 65,536.
The "overhead" of the hash database would be anything (for an "average" 500 Mib to 1 Tib disk image) between 30 and 200 Mb, IMHO not trifling, but not really preoccupying.
jaclaz