±Forensic Focus Partners

Become an advertising partner

±Your Account


Forgotten password/username?

Site Members:

New Today: 3 Overall: 33814
New Yesterday: 10 Visitors: 190

±Follow Forensic Focus

Forensic Focus Facebook PageForensic Focus on TwitterForensic Focus LinkedIn GroupForensic Focus YouTube Channel

RSS feeds: News Forums Articles

±Latest Articles

RSS Feed Widget

±Latest Webinars

Compare multiple files for duplicate data

Computer forensics discussion. Please ensure that your post is not better suited to one of the forums below (if it is, please post it there instead!)
Reply to topicReply to topic Printer Friendly Page
Forum FAQSearchView unanswered posts

Compare multiple files for duplicate data

Post Posted: Tue May 06, 2014 4:09 am

I have been posed a query which I think will require some 'out of the box' thinking so thought I'd come here for some help.

I don't have a lot of details yet but should have more when I return to the office.

There is a thought that someone at my clients may have been double invoicing contractors and billing twice for the same work. It's been identified that some invoices have the exact same description for the work, but different invoice dates, numbers etc.

I have been asked if I can go through all the invoices and attempt to locate ones with duplicated descriptions. At this point I don't know if they are word, excel or non searchable PDFs, that was my second issue.

But I'm trying to get my head around a possible way to use software to automate this rather than a hard manual search through thousands of invoices.

I have considered using Xways or Intella and searching for the full description, however as there are multiple invoices and descriptions it's not a case of a single description being used over and over, there are potentially hundreds of different descriptions which may have been used only 2 or 3 times.

Ultra compare is a great little tool for comparing 2 or 3 files at a time but that doesn't really save me much time if I have 10,000 invoices.

Is anyone aware of software that can scan and compare numerous documents, and have the ability to filter the results based on say, number or matching words, or proximity matches?
I'm thinking if I can find some way to identify files that have say more than 20 matching words all withing 100 characters (or something like that).

I would have to change the parameters depending on how templated the invoices are but you get the idea.

Any thoughts?  

Senior Member

Re: Compare multiple files for duplicate data

Post Posted: Tue May 06, 2014 6:56 am

If I understand your problem you are looking for near duplicates among a very large number of files in multiple formats.

I think that software designed to help filter out documents before examination may help. One such example might be orcatec.com/. I am fairly certain that it (Orcatec) can accept documents in many formats.
Michael Cotgrove

Senior Member

Re: Compare multiple files for duplicate data

Post Posted: Tue May 06, 2014 10:24 am

Nuix Investigator has many of the features which you decribe but it is not cheap. It is very powerful for this type of work though through its's use of Shingle Lists etc.




Senior Member

Re: Compare multiple files for duplicate data

Post Posted: Wed May 07, 2014 1:12 am

I would convert everything to text first, then I may extract the descriptions, part numbers, service numbers, and dump them to a file with reference to the files, row per description, or some other reference.

I would try to massage text into a "fixed" format, trim white space, lower case, etc.
Thereafter grep for the content. If the normalization turns out good, maybe even pivot tables could be used.

10,000 invoices , presuming 4 to 10 lines of items. 100,000 rows? Cakewalk Mr. Green

Have you thought of writing a quick python script for this?

If the text file is constructed you could use something as simple as NotePad++

I have used it on much more massive text files for such filtering, sorting, grepping, etc.  

Senior Member

Re: Compare multiple files for duplicate data

Post Posted: Wed May 07, 2014 3:13 am

Here is how I would approach this.

If the data is in a database format such as excel, csv, quickbooks….this makes the entire process much easier….but let's assume it's not (or you haven't determined this) just for this case.

First make sure that the data is in text format. If you are dealing with images or non-text PDF's you will likely need to OCR them. FTK now has a OCR tool built in.

Take a small data set and identify patterns of data.

Customer names
Customer numbers
Total invoice amounts
email addresses
Creation dates
Meta data

Once you can identify a pattern this will help you build a search that can produce similar results

Next, write regular expressions that cull the data and help you refine similarities. Yes, they are a bit difficult to write but they are consistently the most accurate and fundamental way to search text data.

if you are not comfortable writing your own Reg Ex, you can freelance this work. Then verify they work correctly.

I would recommend starting with general search parameters and narrowing down from there.

For example: find all files with .doc, .pdf, .docx where the customer id is = XXX-XXX and the creation date is within 30 days +/-



Re: Compare multiple files for duplicate data

Post Posted: Wed May 07, 2014 2:07 pm

I would also maybe look into a program called "Ultra Compare" it has worked pretty well for something similar I had to do.  


Page 1 of 1