Notifications
Clear all

Email deduplication

6 Posts
4 Users
0 Likes
1,203 Views
 D4CS
(@d4cs)
Posts: 9
Active Member
Topic starter
 

Hello all,

I am new to this site and I have been in the Digital Forensics field for a few months. I recently found out that FTK does not hash emails. I have an email set coming from PST files of an exchange server and from a laptop. I have huge amounts of duplicate emails. Any suggestion on how to dedupe the emails? Thanks in advance. Much appreciated.

 
Posted : 17/01/2013 5:36 am
 D4CS
(@d4cs)
Posts: 9
Active Member
Topic starter
 

I have done research and understand why it is forensically unsound to hash email to begin with. My question now is how do you go about dealing with the huge amount of "duplicate" emails on a server. Its my understanding that the same email to multiple people will result in multiple files but how do you deal with duplicates when referring to separate email sets from an email server and a personal computer that goes through that server?

 
Posted : 18/01/2013 8:41 am
(@c-r-s)
Posts: 170
Estimable Member
 

Hi,
The forensically sound way is to do the same tasks and analysis on both sources. It's not too inconvenient, as email analysis allows heavily automated processes.
But never merge artifacts! In my opinion, the idea so prone to errors, that it isn't even suitable for ediscovery. Consider that your server/client sources serve completely different purposes and are under different human interference. Usually, duplicates should be eliminated when matching individual results in common time lines, link charts etc. Before this, the fact that a communication left traces on two or more systems is information by itself.

-Richard

 
Posted : 18/01/2013 8:28 pm
jhup
 jhup
(@jhup)
Posts: 1442
Noble Member
 

Deduplication of e-mail is a touchy subject.

What are you going to deduplicate on?

In my experience deduplication across multiple mailboxes using to, from, subject, date&time, and sometimes unique ID works, but still fraught with many issues.

For example, date&time - which one? What if there are automagic timezone adjustments by client software? to - is it the verified source, the SMTP "to" field? What about alias, or "sent in name of"?

Experimented with a percentage of content as part of the deduplication, but a simple version change or automatic conversion from HTML to rich text to text would mess the whole thing up. The process requires normalization of all messages to a single format, then deduplicated, then mark the matching originals.

All deduplication methods should be agreed at the meet & confer - and you better be there, or you will end up with a pile of mess on your hand - like agreement to deduplicate a single mailbox . . .

 
Posted : 18/01/2013 11:57 pm
 D4CS
(@d4cs)
Posts: 9
Active Member
Topic starter
 

Thank you for the replies. Email deduplication seems rather blurred from case to case and an extremely touchy subject. Learned a lot in the process though. Thanks again.

 
Posted : 20/01/2013 10:54 pm
(@armresl)
Posts: 1011
Noble Member
 

I can describe what one CF person did in a matter where it was stated that de-dupe was needed.

They de-duped by message number; I'll elaborate a bit.

Many times messages are listed as Message 01 or Message 001. This person thought that they could just take the first Message 001, and delete all other Message 001's. What difference would it make? There were numerous email addresses, each one having a Message 001. PST's, AOL, Pop 3, it was all there.

Sad day for one side when a CF person is making statements that they've dug their hoof in the ground on, when there are up to 10x more emails than they have missed because of a poorly thought out de-dupe, which spelled the case out in black and white.

 
Posted : 21/01/2013 8:42 am
Share: