Notifications
Clear all

OCR options

9 Posts
4 Users
0 Reactions
2,155 Views
pmow
 pmow
(@pmow)
Active Member
Joined: 13 years ago
Posts: 12
Topic starter  

FTK is fast. It will use 32 cores if I let it, on all my machines. One step which isn't fast however, is OCR. It will sit there and use one core to OCR and this can cause OCR to lag behind processing for weeks.

I've tried FTK's Leadtools module, which is an alternative library (no success). Encase doesn't have it, unless you get Encase eDiscovery. X-Ways is the same, although OCR files could be exported and processed separately. The issue is that it makes a mess of hierarchy and long path issues abound. Also, it seems to not be as good for my cases (5M+ items).

Does anyone have a solution for OCR processing?

I'm thinking a workaround might be inserting the fulltext into the database. It doesn't appear to be possible with FTK since dtSearch is a roadblock and there isn't an API, but maybe an Enscript with Encase? Rather than reinvent the wheel I thought I'd ask.


   
Quote
Adam10541
(@adam10541)
Honorable Member
Joined: 13 years ago
Posts: 550
 

FTK is fast

Sorry, couldn't resist that's not a statement usually associated with FTK 😉

Intella from Vound sofware has a good approach to OCR. Through the indexing process it identifies any "empty documents" (non searchable PDF). You can then isolate those documents and export them for OCR via external software ( I use Adobe) then after they have been OCR'd you can import them back into the case and maintain structure and location.

Intella does this by naming the PDF files based on their MD5 hash as it exports them, so provided you make sure the OCR'd docs have the same name after the OCR process, when you import them back in Intella inserts the newly OCR'd documents back to the correct locations.

I use this process on most of my cases and it works well.


   
ReplyQuote
pmow
 pmow
(@pmow)
Active Member
Joined: 13 years ago
Posts: 12
Topic starter  

Yes, it *can* be fast, if you figure out all the unpublished tweaks and best practices over the years. Also it helps to throw tons of cores at it lol

Intella looks interesting. Initially from the matrix it looks like 250GB is the max, but I'll check out Pro which doesn't have that limit. Thanks!


   
ReplyQuote
Adam10541
(@adam10541)
Honorable Member
Joined: 13 years ago
Posts: 550
 

They have various levels, 10, 100, 250 and TEAM which is unlimited and comes with some other goodies.


   
ReplyQuote
pmow
 pmow
(@pmow)
Active Member
Joined: 13 years ago
Posts: 12
Topic starter  

They have various levels, 10, 100, 250 and TEAM which is unlimited and comes with some other goodies.

Thanks Adam. Just wanted to add to the thread an update.

We implemented Intella Connect (which is a license of PRO along with the web-based review product). In a pinch, it was able to do an entire disk image which I found incredibly impressive.

When you OCR, Intella exports using the file hash. So 10k PDFs might turn into 6k uniques. I settled on Aquaforest Autobahn DX (with multicore module) for high volume OCR. I have a dedicated job that picks up from an input folder, and drops them into an output folder with the same name. Once imported, the files are indexed and searchable. The only caveat is that to identify the text within say, a 100 page file, you must do another text search using the preview button for search.

Current versions of FTK actually do use all cores now for OCR. The result is pretty great, thanks for the advice!


   
ReplyQuote
(@hanzelmans)
New Member
Joined: 17 years ago
Posts: 3
 

Hello,

Don't forget to look at the quality of the OCR tool. We use Aquaforrest to OCR all relevant exported Intella data. Text tecognition is very good, and much better then what FTK can produce. The way Intella exports and imports the files is great. Because of deduplicate possibilities there will be no double files with the same data. Analysing imported ocr files is very simple because Intella combines the original and the OCR-red files in the same preview. When found a hit in the OCR-red file you have direct acces to the original file and the parent e-mail (if the is one).

Regards,

Hans Heins
Sr. Forensic investigator @ Hoffmannbv.nl


   
ReplyQuote
(@nizmon)
Eminent Member
Joined: 16 years ago
Posts: 35
 

Have you tried Nuix, it uses the Abbyy OCR engine and is very quick and accurate, does multiple languages and scales with more cores/RAM?


   
ReplyQuote
pmow
 pmow
(@pmow)
Active Member
Joined: 13 years ago
Posts: 12
Topic starter  

Have you tried Nuix, it uses the Abbyy OCR engine and is very quick and accurate, does multiple languages and scales with more cores/RAM?

Tried it, it's great from an efficiency perspective for OCR and processing. Unfortunately for my 32-core machines it would be like $60k+.


   
ReplyQuote
Adam10541
(@adam10541)
Honorable Member
Joined: 13 years ago
Posts: 550
 

The biggest problem with NUIX is it's pricing model, the software is fantastic.

I was a NUIX user when I was LE, when I went corporate the price to keep NUIX was incredibly expensive and at that time there was no SMS, full price paid annually.

Made the switch to Intella for a fraction of the price and never looked back. Intella is developing rapidly and there is very little NUIX does that Intella doesn't do. Also at the time I switched over NUIX was slow and unstable (not sure how it is now) so that was just a bonus.


   
ReplyQuote
Share: