Flaw in evidence ve...
 
Notifications
Clear all

Flaw in evidence verification process?

ThePM
(@thepm)
Active Member

Hey guys, I would appreciate your input on a discussion that we had at the office regarding the verification process of evidence files used by most forensic software/hardware.

For years, we were under the impression that after verifying an evidence file/drive, if the MD5 / SHA1 match, that it was confirmation that the source and the destination data was exactly the same and that we had a "forensic copy" of the source drive.

However, what was pointed out is that if there is an error in the bitstream of data that is read from the source drive, the erroneous data will be written to the destination file/drive and the cumulative "source" MD5 will be calculated from this erroneous data. When the target MD5 will be calculated for verification, it will be calculated using the same erroneous data, thus the verification MD5 will match the cumulative "source" MD5.

But this "verification" absolutely does not mean that we have an exact copy of the source drive, since it has been calculated on erroneous data. The only way to be absolutely certain that the evidence data is a forensic copy of the source data would then be to hash the source drive (aside from SSD drives that bring additional challenges).

Am I missing something here? Is there some data validation during the data transfer that I'm not aware of? Because now, from my standpoint, I can't testify that I'm using a forensic copy of a drive just because the hashes match.

thanks.

Quote
Topic starter Posted : 02/05/2014 8:23 pm
Techie1
(@techie1)
New Member

Yes, although highly unlikely, I think erroneous data from the source media could cause this scenario. You may want to introduce a second separate hash of the source media - and preferably using a different tool to negate errors in the tool as well into your procedures.

IIRC a read command to IDE/SATA only has the facility to return data - or just not return data. There is no facility or separate channel to indicate errors. This is why when reading bad sectors a lot of computers seem to hang as they have a long timeout to wait for the data to be presented.

Probably need a post from an expert in HDD controllers etc to chip in on error detection on HDD reads.

ReplyQuote
Posted : 03/05/2014 2:36 am
mscotgrove
(@mscotgrove)
Senior Member

I agree with Techie.

I think that an error in reading is very rare. Also, if there is an error it is normally going to be a repeated block, a block of rubbish, or maybe just a single bit error. Theses error will change the hash values.

However, the important point is if any very rare error will change the evidence. Again the chance is almost zero that it could change a 'no' to 'yes'.

A much bigger concern is how one images a failing disk where one knows that each read of the disk may produce different data.

Overall, a hash value is just one section of the overall 'control' system. If a hash difference is detected, the next stage will be to track down the reason for the difference and then decide if it is significant.

ReplyQuote
Posted : 03/05/2014 1:59 pm
athulin
(@athulin)
Community Legend

For years, we were under the impression that after verifying an evidence file/drive, if the MD5 / SHA1 match, that it was confirmation that the source and the destination data was exactly the same and that we had a "forensic copy" of the source drive.

You may want to try to trace where that idea comes from. It may apply to some particular piece of software, (or even hardware) used in particular circumstances, but it seems unsafe to generalize it beyond that.

The only way to be absolutely certain that the evidence data is a forensic copy of the source data would then be to hash the source drive (aside from SSD drives that bring additional challenges).

A hash can never give you absolute certainty of identity, only absolute certainty of non-identity. Of course, this depends on how you define 'absolute' – my interpretation is obviously 'absolute = with no error at all'.

Besides, hashing is not 'the only way'. You can also compare images bit by bit, without involving any hashing at all. It may be less practical, but it may be more useful, as it also tells you where and how extensive the discrepancies are, which is a base for more informed decision about if the discrepancies affects important evidence or not.

You also seem to assume that a hard disk will give you the same image the next time you image it. While it is probable, under normal conditions, it cannot be taken for granted. If the disk is stored away somewhere, and not actually used, the in formation on it decays. The next time you image it, you may get additional bad sectors, or changes in a known bad sector, and thus get a different hash. At that point, if the hash is all you go by, you're probably stuck.

Am I missing something here? … Because now, from my standpoint, I can't testify that I'm using a forensic copy of a drive just because the hashes match.

If the hash logged on acquiry matches a repeated image hash, it tells you that the image is unlikey (depending on what hash algorrithm is being used) to have been changed between time of acquiry and the time you perform the hash the second time.

But you should also know how the image was performed what the source is, what tool was used, how it was configured, if external conditions affected the operation, and if the acquiry report from it can be trusted or if it omits any information, and if it does, how you obtain it by other means. If there were bad sectors on the source disk, you should have a record of them, and you should know how they were treated (kept or replaced with zero, say).

You should, I think, be able to testify, that within those limits, the image corresponds to the original hard drive.

There are additional issues if an image is taken on an unstable platform, that unstability may affect data acquired. I remember an acquiry I made on a system with bad memory – I was unable to get a solid image until I had identified and removed the bad memory, but I did not get any error indications from the acquiry software. If the power supply is overloaded, or if a laptop has bad batteries (even if it is connected to mains power), you can get get some weird behaviour, which also may affect the behaviour of the image software. And if you're booting a Live CD for imaging, you may have to inspect – and perhaps even save – any system logs both before and after the acquiry to be able to say that there was no detected problems. (After you've ascertained that logging hasn't been turned off completely, of course.)

ReplyQuote
Posted : 03/05/2014 2:09 pm
PaulSanderson
(@paulsanderson)
Senior Member

Here an old post of mine from 2004 that impacts the OPs question.

http//osdir.com/ml/security.forensics/2004-04/msg00016.html

In this case one bit of the 16 bit IDE channel was held at 1 for every single read. Drive was read successfully (and could be re-read by any tool) but the data was corrupt.

Food for thought…

ReplyQuote
Posted : 03/05/2014 6:22 pm
jhup
 jhup
(@jhup)
Community Legend

First, if we want to be purists, you never, ever have an exact copy. You lack the sync, alignment, gap, ECC, bad blocks, various tables, the controller program, and other nuances from the device. But, I digress…

Is the error originates from the device? Is the part that generating the error part of the original evidence?

I think if, for example, there is a bit error generated by an IDE interface on a drive, then that error is part of the data - and should be part of it. Of course tracing it back and identifying it is important. Also, if the error is introduced by the forensic process, it must be removed if possible, or find a mitigating solution for the error generation.

Example - Would malware in evidence data be part of the evidence? This may sound circular, and contain the answer in itself. Yet I have talked to "forensicators" who's first reaction is to clean the malware.

This goes back to a pet peeve of mine.

We do not need exact "bit-by-bit" copies for forensics. Think about it. Is finger print analysis uses 100% of a (already partial copy of) fingerprint? Does DNA analysis uses 100% of the DNA?

Here is something that should blow your mind, if you are stuck on "bit-by-bit". In most other forensics fields the evidence, at least partially is destroyed… 😯

Remember, beyond reasonable doubt.

ReplyQuote
Posted : 04/05/2014 9:11 am
MDCR
 MDCR
(@mdcr)
Active Member

Would probably be better if there were some sort of multi-hash signature that could detect corruption and even if that occurred you would have a signature that said that 99.99% of the media were still intact and you could probably ignore the whole signature problem - as it works today.

Example

A+B+C+D+E+F+G+H+I+J
vs
Q+B+C+D+E+F+G+H+I+J

= 90% still intact.

ReplyQuote
Posted : 04/05/2014 9:39 am
jaclaz
(@jaclaz)
Community Legend

This goes back to a pet peeve of mine.

We do not need exact "bit-by-bit" copies for forensics. Think about it. Is finger print analysis uses 100% of a (already partial copy of) fingerprint? Does DNA analysis uses 100% of the DNA?

Here is something that should blow your mind, if you are stuck on "bit-by-bit". In most other forensics fields the evidence, at least partially is destroyed… 😯

Remember, beyond reasonable doubt.

Yep ) , I would spend a few words on the exact nature of the effects of data corruption (if any) in the imaging process.

One of the most common "issues" (among others) raised by the good guys "obsessed" by the bit-by-bit copy approach (and the use of write blockers, etc.) is that the sheer moment you connect a disk to a Windows NT system (before WinFE approach), it's signature may be altered.
This happens in two cases
1) the disk was never connected to a Windows NT (and thus has a 00000000 disk signature)
2) there is a collision with the disk signature of another disk connected at the same time (probabilistically very rare)
See also here
http//mistype.reboot.pro/documents/WinFE/winfe.htm#signatures
and more specifically here
http//reboot.pro/topic/18953-is-winfe-forensically-sound/
http//reboot.pro/topic/18953-is-winfe-forensically-sound/?p=177532

Think of a car accident, you take photos, you mark on the road where the vehicles are, you take measures and sketches, then you move the vehicles to allow the reopening of the road.

The day after you may decide to re-close the road for a few hours, put back the vehicles exactly where you found them to better understand the dynamics of the crash.

As long as the procedure is adequately documented, it is perfectly "forensically sound".

What I think are "common" false equations are
"forensically sound"="untouched"
"forensically sound"="identical"
"forensically sound"="unmodified"

I see "forensic sound" also something that has been "touched", "modified" or "moved", as long as this has been done along a procedure and of course a "proper" and repeatable procedure.

So, we attach a disk to a running Windows NT OS (with no automount).
In some cases it may change the disk signature. (as seen above there are several ways to avoid this or to make a "snapshot" of it before).

How this will affect the presence on the disk of a compromising exchange of e-mails, or of a folder containing tens or hundreds of CP images?
Will a disk signature change be able to create by sheer magic the above incriminating evidence?
More "widely" would a disk signature produce any change of any kind to other data (except the speciifc 4 bytes)?
Like altering timestamps, delete or make unrecoverable any data?

Taking it a step further, more serious (wrong) manipulations or changes to the filesystem will ever produce those artifacts?
Of course not, in the very worse case, a change to the filesystem will delete (or make inaccessible) some data.

If you think a bit about it, when you carve unallocated space and recover partially overwritten data, what you get is not "really sound" data, but rather fragments (or bits and pieces).
The "sound", "original" data, let's say as an example a Word document has already been altered (by beingn first deleted from the OS and then partially overwritten by another file), yet the parts that you manage to recover and re-assemble can be part of the accusatory or exculpatory evidence.
What if the same Word document becomes corrupt because of a malfunction (or instability, or whatever) of the system while you imaged it?
Are not the bits and pieces you recover from it "as good as" the bits ad pieces you recover form the .doc carved in free space?

jaclaz

ReplyQuote
Posted : 04/05/2014 7:40 pm
ThePM
(@thepm)
Active Member

Thanks everyone for the input.

A couple of remarks about your comments

I think that an error in reading is very rare. Also, if there is an error it is normally going to be a repeated block, a block of rubbish, or maybe just a single bit error. Theses error will change the hash values.

The assumptions of my colleagues were that when using USB to create the image (or transfer any large amount of data), there were more risks of errors during the transfer. I cannot say that I share those assumptions as I have not seen more errors when transferring data via USB than through other connection modes. As for the statement "These errors will change the hash values", this is true if you generate the hash value of the source drive. Not if you rely only on the "verification" process of most forensic software/hardware solutions.

First, if we want to be purists, you never, ever have an exact copy. You lack the sync, alignment, gap, ECC, bad blocks, various tables, the controller program, and other nuances from the device. But, I digress…

Of cours, I was only considering "user data", not the data that is in the system area or the servo metadata.

Is the error originates from the device? Is the part that generating the error part of the original evidence?

I think if, for example, there is a bit error generated by an IDE interface on a drive, then that error is part of the data - and should be part of it. Of course tracing it back and identifying it is important

In the scenario I was debating with my colleagues, the error was introduced during the data transfer, so the error is not present on the source drive. As tracing it back goes, I'm not sure how this can be done if the verification process does not indicate an error in the first place.

We do not need exact "bit-by-bit" copies for forensics. Think about it. Is finger print analysis uses 100% of a (already partial copy of) fingerprint? Does DNA analysis uses 100% of the DNA?

Here is something that should blow your mind, if you are stuck on "bit-by-bit". In most other forensics fields the evidence, at least partially is destroyed…

I agree that we do not need exact "bit-by-bit". However, I believe that if you have the opportunity to use an entire exact copy of the data, why shouldn't you? I'm not a specialist in other fields of forensics, but I guess if fingerprint analysts could have the choice of working from partial fingerprints or full fingerprints, the would choose full fingerprints.

Again, I agree that we do not need exact copies. With proper documentation, we can definitely explain in court why a copy might not be an exact copy of the source drive. However, I would like to know that my copy is not exact and that's the issue I'm raising with the verification process. Because until now, when I created an image (or a clone) of a drive using a forensic software with the "verify" option and that I got a "verified successfully" or "hashes matching" result, I assumed that it meant that, at the time of capture, the user data from my source drive and my destination drive were identical and that I could testify on that. But, right now, I believe that I cannot swear that the drives are identical, despite the results giveng by the imaging software/device. A defence attorney who knows his stuff or that saw this thread might contradict me on this and he could be right.

A hash can never give you absolute certainty of identity, only absolute certainty of non-identity. Of course, this depends on how you define 'absolute' – my interpretation is obviously 'absolute = with no error at all'.

I'm not sure I'm following you here… If there the slightest difference between 2 files (even 1 bit), the hash values between the 2 files will be completely different. So, if the hash values of 2 files match, then it should indicate that they are identical, thus no error at all.

How this will affect the presence on the disk of a compromising exchange of e-mails, or of a folder containing tens or hundreds of CP images?
Will a disk signature change be able to create by sheer magic the above incriminating evidence?
More "widely" would a disk signature produce any change of any kind to other data (except the speciifc 4 bytes)?
Like altering timestamps, delete or make unrecoverable any data?

I totally agree with you on the fact that an error during the data transfer will not make incriminating evidence appear. I believe, as you said, that the worst thing that might happen is make evidence disappear or alter timestamps. And make you look bad in court if you testified that the copies were identical…

ReplyQuote
Topic starter Posted : 05/05/2014 8:04 pm
athulin
(@athulin)
Community Legend

A hash can never give you absolute certainty of identity, only absolute certainty of non-identity. Of course, this depends on how you define 'absolute' – my interpretation is obviously 'absolute = with no error at all'.

I'm not sure I'm following you here… If there the slightest difference between 2 files (even 1 bit), the hash values between the 2 files will be completely different. So, if the hash values of 2 files match, then it should indicate that they are identical, thus no error at all.

I'm afraid that isn't correct – at least not from a strict point iof view. (That's why I explained my take on 'absolute')

If two files (of unknown contents) are hashed, and the hashes are different, then the files are also different. The only way the same file can produce two different hash sums is if the implementation is bad, or the hash function isn't repeatable … and I ignore those possibilities as uninteresting.

However, if the hashes are the same, there is a small probability that the files may be different. (After all, a hash of fixed width can only distinguish so many files, say N. Now add one extra file to that collection of N files that each hash to a unique hash value. What hash does that additional file get – it must be one of those already calculated, hence a collision. This just demonstrates the fact that there will be such collisions, not that they are likely. In some special cases, we can already generate two files of different contents that have the same hash – while this is quite artificial, it's still a sign of unwanted weakness in the hash function.)

The probability for such collision has never been well estimated – the closest anyone comes is by assuming the files have random contents, and that any imperfections of the hash function can be ignored, and so end up with an estimate of once in 2^(nr of bits in the hash) cases. However, as files are extremely unlikely to be random, and as hash functions are known to be less than perfect – though not very much – as regards to bit distribution, all we can say just now that 2^(whatever) is an optimistic estimate. But at present no-one seems to know what the error term is.

While just about everyone seem to prefer the notion that the probability for a collision can be ignored in practice, it seems foolish to insist that it is ignorable absolutely.

That's why I say – different hashes, definitely different files; same hashes, probably same file.

But then I don't have to deal with juries …

ReplyQuote
Posted : 05/05/2014 8:58 pm
jaclaz
(@jaclaz)
Community Legend

As I see it, the hash done while imaging takes into account the "flux" of data read from the source and written to the target disk.

As such when you verify the hash of the written image by comparing it to the one you obtained during imaging you are ONLY saying that there were not "write errors" on your target, not necessarily that there were not "read errors" on the source.

If you prefer the hashing is a way to know for sure that the image that you examined and of which you provided a copy to the other part in the trial has not been tampered with, and represent an exact "snapshot" of what was read from the device on a given date.

With perfectly working disks and perfectly working equipment/software, and in theory, what you read is actually what is on the source.

But then we need to leave alone for one moment "pure forensics" and get to "data recovery".

Hard disks do develop "bad sectors" and do have "malfunctionings".
Still in theory any modern hard disk is so intelligent to re-map a "weak sector" to a spare one "transparently", and normally the way the Disk internal OS works is (more or less)

  • let me read (because the OS told me to do so) sector 123456
  • hmmm, the ECC sector checksum (or whatever) does not match at first attempt
  • let me try to correct the data read through my internal (and not documented) recovery algorithm
  • hmmm, nope, it still does not work
  • let me try to impememnt the parity check algorithm (another not documented feature)
  • pheeew, now it matches, good
  • to be on the safe side, let me remap sector 123456 to spare sector 999001 (without telling the OS, nor the filesystem) and let me jolt down this new translation in my G-list (pr P-list or *whatever*)

It is perfectly possible (in theory and practice) that in the exact moment you are reading a sector this "becomes" bad.

What happens then?
The sector was "weak", but was *somehow* read correctly, it became "bad" exactly one fraction of a nanosecond after having been read, and the disk managed the issue fine.

But what if a given sector passes from "good" to "bad" immediately after you have read it?
The disk, at next occasion, finds it bad, attempts t recover it and fails (or succeeds but for *whatever* reasons fails in the copying it to the spare sector or fails in updating the list.

When you try to rehash the source drive, you will have either errors or another hash.

On the other hand, I believe that is not "common practice" to write the image from source to several targets at the same time.

So for a given period of time you have only a "source" and a "target", the same malfunctioning may happen to the "target" instead of the source (and you find it only because a new hashing of the target or of a copy of it comes out different), in which case I think that what is done s to re-image from the original.

In other words, the hashing process is an important part of the procedures but it is not the "only" solution.

A better approach could be that of doing a more granular form of hashing.
The smallest "atomic" component being a sector or "block".
So you could hash each sector by itself and create a list of hashes one for each sector or decide to group 10/10/1000/10000/100000 sectors into a "blocklist" and hash these blocklists.
This would bring IMHO two advantages

  1. you know for sure that ONLY a given "blocklist" is affected (and ALL the other ones are fine)
  2. if more than one blocklist (or many or all of them) do not hash correctly then something (be it OS instability, hardware issues or *whatever*) is causing it in a "generalized" way before completing the "whole" image
  3. [/listo]

    jaclaz

    P.S. it seems that not only the previous idea is nothing new, but it has also been ported to a "next" level
    Distinct Sector Hashes for Target File Detection
    Joel Young, Kristina Foster, and Simson Garfinkel, Naval Postgraduate School
    Kevin Fairbanks, Johns Hopkins University
    http//www.computer.org/csdl/mags/co/2012/12/mco2012120028.pdf

ReplyQuote
Posted : 05/05/2014 9:26 pm
jaclaz
(@jaclaz)
Community Legend

On further check, the idea hinted in the PS above is being implemented within/with the support of digitalcorpora, see here
http//digitalcorpora.org/archives/391

And there is an article (of course behind the usual paywall, but the abstract is enough)
http//www.tandfonline.com/doi/abs/10.1080/15567280802050436

which provides an interesting empirical analysis

This article reports the results of a case study in which the hashes for over 528 million sectors extracted from over 433,000 files of different types were analyzed. The hashes were computed using SHA1, MD5, CRC64, and CRC32 algorithms and hash collisions of sectors from JPEG and WAV files to other sectors were recorded. The analysis of the results shows that although MD5 and SHA1 produce no false-positive indications, the occurrence of false positives is relatively low for CRC32 and especially CRC64. Furthermore, the CRC-based algorithms produce considerably smaller hashes than SHA1 and MD5, thereby requiring smaller storage capacities. CRC64 provides a good compromise between number of collisions and storage capacity required for practical implementations of sector-scanning forensic tools.

that confirms my initial thoughts that one could use a much simpler algorithm than MD5 for block hashing (thus saving computational resources, i.e. time and space for the hash database).

So, one could "keep" the current MD5 or SHA-1 hashes for the "whole image" but use a much simpler CRC algorithm for "block hashing" and - since the scope here is only verification and not comparing with a database of known hashes, a simple CRC32 would be enough.

Summing this with the considerations about block size in other articles about the same subject, particularly the one here

Using purpose-built functions and block hashes to enable small block and sub-file forensics. Simson Garfinkel, Alex Nelson, Douglas White, and Vassil Roussev.

http//www.dfrws.org/2010/program.shtml
http//www.dfrws.org/2010/proceedings/2010-302.pdf
It would make sense to hash with CRC32 blocks of 16,384 bytes or maybe 32,768 bytes or even 65,536.
The "overhead" of the hash database would be anything (for an "average" 500 Mib to 1 Tib disk image) between 30 and 200 Mb, IMHO not trifling, but not really preoccupying.

jaclaz

ReplyQuote
Posted : 09/05/2014 10:01 pm
Share: