Hello everyone,
Trying to start a discussion concerning OCR challenges.
Over the past 10 years, I have seen a steady increase in artefacts that can not be indexed / searched adequately because they are image related files, specifically png's and jpg's originating from mobile devices. In a recent case i had over 100K (case relevant) photo images to process and analyse.
What i do to process and analyse now is exporting the image files, OCR them with a commercial non DFIR OCR tool and then import them back into the case as a logical image file. This is however a very time and resource consuming process, sometimes taking over a week, which is too long. Also, cost wise this is becoming more and more of an issue as OCR companies developed a tendency to charge by the core instaed of the machine. When contacting these companies the first question they ask is how many files you need to OCR on a yearly basis. From a DFIR perspective this is a question that is difficult to answer.
I have tried the OCR capabilities of several of the commercial "standard" DFIR tools, and also various commercial non DFIR OCR tools. My personal opinion is that commercial non DFIR OCR tools are becoming too expensive (received a quote for 60K) for small DFIR companies. The DFIR tools that have OCR integrated (a lot still don't!) are actually not really good at it, causing loss of information that could be crucial for the case.
I am curious how other small DFIR firms / consultants are currently addressing OCR challenges.
Your feedback is higly appreciated.
Cheers!
Lex
I've been using OCR for several years outside of forensics, and so I have a reasonable idea of things that go wrong in general. (Tools: Calera WordScan, Caere Omnipage, Xerox Textbridge, ABBY Finereader. Somewhat dated, as you see.) However, I have worked mainly on good resolution, good contrast scans of high-quality paper documents, not on random quality images from random devices, some of which may not even produce correct images or images that would need to be massaged before being used as input (TIFF images with inverted black/white values is one example). That is, my knowledge and experience is not necessarily on point: I won't use OCR on poor material, where accuracy is likely to be less than 97% over the relevant corpus. (And that requires rather extensive proofreading if 100% is the goal.)
If I were faced with the issue on taking on a job that would likely involve significant amount of OCR, I'd ask for sample images with known textual content, to decide if they were suitable in the first place (low-resolution, low-contrast images of car registration plate, from a bad angle/unexpected rotation, in rain or snow would almost certainly be a quick no), and then evaluate those images that passed with those OCR tools that I could use. If success rate was too low. again, it would be a no, though not quite as quick as before. And of course, OCR tends to require proofreading to catch various errors, and that has known error-rates as well.
OCR works best if it can be trained on the relevant corpus of images. If training is not a possibility, and glyph location/orientation is not predictable, accuracy is likely to be low.
It might be possible with some products to get both a best-guess at content, as well as an assessment of input quality, perhaps even output quality (like weird Unicode characters found in text that should be just registration plates): too low quality, and best-guess text is no longer interesting.
An expert lab (i.e. expert in OCR application) could probably make faster decisions, based on their experience, and probably have their own tools, but are targeted at homogeneous corpuses (more like processing of uniform printed pages), rather than random images.
I suspect that third world typing sweatshops still lead in terms of raw accuracy, though. There used to be some testing company that published OCR tests (mainly office documents) -- and it at least one of those they included some non-OCR solutions just to have a baseline to compare with. I wish I had saved that report ...
OCR is not something I would expect a small forensic lab to be particularly adept at themselves, and I would expect them to stay away from anything involving post-OCR processing, such as proofreading.
At the very least I would expect a lab to ask for wanted/expected accuracy, and relate that with the performance of their tools/processes (actual characters misread or omitted, spurious characters injected, etc.).
I still use an old version of NUIX (up to the version we're licenced for) because the OCR (which uses an ABBYY Finereader plugin) is much better than our standard forensic tools, so I can find more data.
It's not perfect, but it actually surprises me how much text it gets out of pictures, often in small sections of a large picture, at an imperfect angle. We'd likely still pay for NUIX if they hadn't made the cost so bonkers!
I still use an old version of NUIX (up to the version we're licenced for) because the OCR (which uses an ABBYY Finereader plugin) is much better than our standard forensic tools, so I can find more data.
It's not perfect, but it actually surprises me how much text it gets out of pictures, often in small sections of a large picture, at an imperfect angle. We'd likely still pay for NUIX if they hadn't made the cost so bonkers!
The open source OCR library Tesseract is performing very well, it features OSD (orientation and script detection) functionality and in my experience it has been very successful in correctly parsing low quality and rotated images.
@athulin Thank you for your reply, to most of what you are saying i fully agree. However, local circumstances usually don't allow for pre selecting or proof reading of image files prior to start working on a case. They are simply part of it. Also, outsourcing to an expert lab is complex, specifically when the case has been past to me by a court / judge. There are alway NDA's in place. Then again the costs of outsourcing will put tremendous stress on the budget of the case. Maybe the circumstances in Europe differ a lot but i am literally on a little rock in the ocean and that comes with a lot of restrictions unfortunately. Again, thank you for reply, it is highly appreciated.
@rich2005 Thank you for taking the time to reply. I tested Abby fine reader but found the results not that great. But this might also be caused by the fact that i used a trial license that only allowed me to convert and OCR 100 images / pages. Maybe i should give it another try. Concerning the hugely increased costs of DFIR suites over the past years, it is indeed a great concern for me as well, i am a small firm / consultant. I was recently quoted EUR 14K for 1 year non perpetual use of Axiom. In 2022 i paid USD 2K2 for 1 year SMS for my perpetual license. Pricing might be another interesting topic for a discussion.
@tic-tac Thanks for mentioning Tesseract, i was going to give that a try but wanted it to embed in a piece of own written software. It is definately something that is on the list of projetcs. How are you using it exactly if i may ask? Are you using it as a stand alone product or did you use the API and built your own tool?
@rich2005 Thank you for taking the time to reply. I tested Abby fine reader but found the results not that great. But this might also be caused by the fact that i used a trial license that only allowed me to convert and OCR 100 images / pages. Maybe i should give it another try. Concerning the hugely increased costs of DFIR suites over the past years, it is indeed a great concern for me as well, i am a small firm / consultant. I was recently quoted EUR 14K for 1 year non perpetual use of Axiom. In 2022 i paid USD 2K2 for 1 year SMS for my perpetual license. Pricing might be another interesting topic for a discussion.
Only thing I can think to mention, regarding the results of the OCR, was that I use the “text extraction - accuracy” option. Just in case you’d left it on the default and maybe this would help. Not in front of my machine right now but I’m guessing it defaults to speed not accuracy in Nuix.
I’d also have the auto deskew and rotate options on.
Or maybe I just had lower expectations for the results and so was less disappointed! 🤣
@tic-tac Thanks for mentioning Tesseract, i was going to give that a try but wanted it to embed in a piece of own written software. It is definately something that is on the list of projetcs. How are you using it exactly if i may ask? Are you using it as a stand alone product or did you use the API and built your own tool?
This open source forensic tool uses Tesseract for OCR and it is my absolute favorite 🙂