Notifications

Clear all

OCR options

General (Technical, Procedural, Software, Hardware etc.)

Last Post by Adam10541 10 years ago

9 Posts

4 Users

0 Reactions

2,155 Views

RSS

pmow

(@pmow)

Active Member

Joined: 13 years ago

Posts: 12

Topic starter 19/06/2014 3:45 am

FTK is fast. It will use 32 cores if I let it, on all my machines. One step which isn't fast however, is OCR. It will sit there and use one core to OCR and this can cause OCR to lag behind processing for weeks.

I've tried FTK's Leadtools module, which is an alternative library (no success). Encase doesn't have it, unless you get Encase eDiscovery. X-Ways is the same, although OCR files could be exported and processed separately. The issue is that it makes a mess of hierarchy and long path issues abound. Also, it seems to not be as good for my cases (5M+ items).

Does anyone have a solution for OCR processing?

I'm thinking a workaround might be inserting the fulltext into the database. It doesn't appear to be possible with FTK since dtSearch is a roadblock and there isn't an API, but maybe an Enscript with Encase? Rather than reinvent the wheel I thought I'd ask.

Quote

Adam10541

(@adam10541)

Honorable Member

Joined: 13 years ago

Posts: 550

19/06/2014 10:45 am

FTK is fast

Sorry, couldn't resist that's not a statement usually associated with FTK 😉

Intella from Vound sofware has a good approach to OCR. Through the indexing process it identifies any "empty documents" (non searchable PDF). You can then isolate those documents and export them for OCR via external software ( I use Adobe) then after they have been OCR'd you can import them back into the case and maintain structure and location.

Intella does this by naming the PDF files based on their MD5 hash as it exports them, so provided you make sure the OCR'd docs have the same name after the OCR process, when you import them back in Intella inserts the newly OCR'd documents back to the correct locations.

I use this process on most of my cases and it works well.

ReplyQuote

pmow

(@pmow)

Active Member

Joined: 13 years ago

Posts: 12

Topic starter 19/06/2014 5:58 pm

Yes, it *can* be fast, if you figure out all the unpublished tweaks and best practices over the years. Also it helps to throw tons of cores at it lol

Intella looks interesting. Initially from the matrix it looks like 250GB is the max, but I'll check out Pro which doesn't have that limit. Thanks!

ReplyQuote

Adam10541

(@adam10541)

Honorable Member

Joined: 13 years ago

Posts: 550

20/06/2014 6:12 am

They have various levels, 10, 100, 250 and TEAM which is unlimited and comes with some other goodies.

ReplyQuote

pmow

(@pmow)

Active Member

Joined: 13 years ago

Posts: 12

Topic starter 09/09/2015 2:30 am

They have various levels, 10, 100, 250 and TEAM which is unlimited and comes with some other goodies.

Thanks Adam. Just wanted to add to the thread an update.

We implemented Intella Connect (which is a license of PRO along with the web-based review product). In a pinch, it was able to do an entire disk image which I found incredibly impressive.

When you OCR, Intella exports using the file hash. So 10k PDFs might turn into 6k uniques. I settled on Aquaforest Autobahn DX (with multicore module) for high volume OCR. I have a dedicated job that picks up from an input folder, and drops them into an output folder with the same name. Once imported, the files are indexed and searchable. The only caveat is that to identify the text within say, a 100 page file, you must do another text search using the preview button for search.

Current versions of FTK actually do use all cores now for OCR. The result is pretty great, thanks for the advice!

ReplyQuote

Hanzelmans

(@hanzelmans)

New Member

Joined: 17 years ago

Posts: 3

05/10/2015 11:49 pm

Hello,

Don't forget to look at the quality of the OCR tool. We use Aquaforrest to OCR all relevant exported Intella data. Text tecognition is very good, and much better then what FTK can produce. The way Intella exports and imports the files is great. Because of deduplicate possibilities there will be no double files with the same data. Analysing imported ocr files is very simple because Intella combines the original and the OCR-red files in the same preview. When found a hit in the OCR-red file you have direct acces to the original file and the parent e-mail (if the is one).

Regards,

Hans Heins
Sr. Forensic investigator @ Hoffmannbv.nl

ReplyQuote

Nizmon

(@nizmon)

Eminent Member

Joined: 16 years ago

Posts: 35

06/10/2015 2:58 am

Have you tried Nuix, it uses the Abbyy OCR engine and is very quick and accurate, does multiple languages and scales with more cores/RAM?

ReplyQuote

pmow

(@pmow)

Active Member

Joined: 13 years ago

Posts: 12

Topic starter 09/10/2015 4:28 am

Have you tried Nuix, it uses the Abbyy OCR engine and is very quick and accurate, does multiple languages and scales with more cores/RAM?

Tried it, it's great from an efficiency perspective for OCR and processing. Unfortunately for my 32-core machines it would be like $60k+.

ReplyQuote

Adam10541

(@adam10541)

Honorable Member

Joined: 13 years ago

Posts: 550

12/10/2015 6:40 am

The biggest problem with NUIX is it's pricing model, the software is fantastic.

I was a NUIX user when I was LE, when I went corporate the price to keep NUIX was incredibly expensive and at that time there was no SMS, full price paid annually.

Made the switch to Intella for a fraction of the price and never looked back. Intella is developing rapidly and there is very little NUIX does that Intella doesn't do. Also at the time I switched over NUIX was slow and unstable (not sure how it is now) so that was just a bonus.

ReplyQuote

Article: The Balance Between Digital Forensic Examiners And Digital Evidence Technicians: Expertise Vs. Efficiency

Can digital forensic labs cut backlogs without cutting ...

By Zoe , 23 hours ago
Prefetch Question

Hello All, I have a question regarding Windows prefet...

By Forensic_Tester , 1 day ago
Webinar: Collaborative Forensics: Overcoming Challenges In Multi-Jurisdictional Investigations

Discover expert strategies for cross-border investigati...

By Zoe , 1 week ago

8 Forums
15.7 K Topics
92.3 K Posts
188 Online
41.1 K Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed