Compare multiple fi...
 
Notifications
Clear all

Compare multiple files for duplicate data

6 Posts
6 Users
0 Likes
643 Views
Adam10541
(@adam10541)
Posts: 550
Honorable Member
Topic starter
 

I have been posed a query which I think will require some 'out of the box' thinking so thought I'd come here for some help.

I don't have a lot of details yet but should have more when I return to the office.

There is a thought that someone at my clients may have been double invoicing contractors and billing twice for the same work. It's been identified that some invoices have the exact same description for the work, but different invoice dates, numbers etc.

I have been asked if I can go through all the invoices and attempt to locate ones with duplicated descriptions. At this point I don't know if they are word, excel or non searchable PDFs, that was my second issue.

But I'm trying to get my head around a possible way to use software to automate this rather than a hard manual search through thousands of invoices.

I have considered using Xways or Intella and searching for the full description, however as there are multiple invoices and descriptions it's not a case of a single description being used over and over, there are potentially hundreds of different descriptions which may have been used only 2 or 3 times.

Ultra compare is a great little tool for comparing 2 or 3 files at a time but that doesn't really save me much time if I have 10,000 invoices.

Is anyone aware of software that can scan and compare numerous documents, and have the ability to filter the results based on say, number or matching words, or proximity matches?
I'm thinking if I can find some way to identify files that have say more than 20 matching words all withing 100 characters (or something like that).

I would have to change the parameters depending on how templated the invoices are but you get the idea.

Any thoughts?

 
Posted : 06/05/2014 10:09 am
(@mscotgrove)
Posts: 938
Prominent Member
 

If I understand your problem you are looking for near duplicates among a very large number of files in multiple formats.

I think that software designed to help filter out documents before examination may help. One such example might be http//orcatec.com/. I am fairly certain that it (Orcatec) can accept documents in many formats.

 
Posted : 06/05/2014 12:56 pm
(@dan0841)
Posts: 91
Trusted Member
 

Nuix Investigator has many of the features which you decribe but it is not cheap. It is very powerful for this type of work though through its's use of Shingle Lists etc.

https://www.nuix.com/How-word-shingles-can-increase-search-relevance

Cheers

Dan

 
Posted : 06/05/2014 4:24 pm
jhup
 jhup
(@jhup)
Posts: 1442
Noble Member
 

I would convert everything to text first, then I may extract the descriptions, part numbers, service numbers, and dump them to a file with reference to the files, row per description, or some other reference.

I would try to massage text into a "fixed" format, trim white space, lower case, etc.
Thereafter grep for the content. If the normalization turns out good, maybe even pivot tables could be used.

10,000 invoices , presuming 4 to 10 lines of items. 100,000 rows? Cakewalk mrgreen

Have you thought of writing a quick python script for this?

If the text file is constructed you could use something as simple as NotePad++

I have used it on much more massive text files for such filtering, sorting, grepping, etc.

 
Posted : 07/05/2014 7:12 am
(@steve_linn)
Posts: 2
New Member
 

Here is how I would approach this.

If the data is in a database format such as excel, csv, quickbooks….this makes the entire process much easier….but let's assume it's not (or you haven't determined this) just for this case.

First make sure that the data is in text format. If you are dealing with images or non-text PDF's you will likely need to OCR them. FTK now has a OCR tool built in.

Take a small data set and identify patterns of data.

Customer names
Customer numbers
Total invoice amounts
email addresses
Creation dates
Meta data
…etc

Once you can identify a pattern this will help you build a search that can produce similar results

Next, write regular expressions that cull the data and help you refine similarities. Yes, they are a bit difficult to write but they are consistently the most accurate and fundamental way to search text data.

if you are not comfortable writing your own Reg Ex, you can freelance this work. Then verify they work correctly.

I would recommend starting with general search parameters and narrowing down from there.

For example find all files with .doc, .pdf, .docx where the customer id is = XXX-XXX and the creation date is within 30 days +/-

==========

 
Posted : 07/05/2014 9:13 am
(@cs1337)
Posts: 83
Trusted Member
 

I would also maybe look into a program called "Ultra Compare" it has worked pretty well for something similar I had to do.

 
Posted : 07/05/2014 8:07 pm
Share: