raw scan for a string

General (Technical, Procedural, Software, Hardware etc.)

Last Post by NeghVar 6 years ago

3 Posts

2 Users

0 Reactions

2,838 Views

RSS

NeghVar

(@neghvar)

Active Member

Joined: 11 years ago

Posts: 9

Topic starter 07/01/2021 11:18 pm [#18842]

I need to search for a document that had a particular address. By now it has be corrupted or overwritten. But I still want to try to scan just for this address as a string. The original file was a pdf. Is there a way to do this?

Quote

Anonymous 6593

(@Anonymous 6593)

Joined: 18 years ago

Posts: 1158

08/01/2021 8:30 am

Not necessarily. PDF is a pre-press document format, and is only intended to write/show a document on a printing device, such as a digital typesetter. That it can be used for a screen is just a special case of that use case. It allows use of compression as well as encryption; and PDF is basically a programming language with focus on output. So there are lots of possibilities to obfuscate a text, deliberately or not.

Some PDF-creating applications rely on 'natural' character-to-font mapping, and don't bother about kerning or other niceties. In PDF file created by such application you may find the address straight on. If it is compressed, you have to undo the compression first. If all you have are recovered document fragments, you also need large amounts of luck.

Other PDF-writing software do take kerning into account. They will write a text in chunks, and break of when a reposition is necessary (where two glyphs need to be set closer or wider apart than default font parameters say). In those documents text gets broken up in what to a normal viewer looks like unpredictable fragments. There's no easy way to scan for text in these unless you know if the text you look for contains kerning points and where they are: you have to do large parts of an OCR application.

Additionally, some PDF-writers minimize embedded fonts: they take the original font, recode it to contain only the glyphs that are required to print the document, and then use that. This is usually done to reduce document size. These documents usually do not have legible strings at all, and you need to know the reverse mapping to interpret the result.

I've seen a design of a PDF writer that output one glyph at a time: that is, it rendered the 'a' glyph, and for all locations in the current page that contained that glyph, output the bitmap at those coordinates. It produced somewhat legible text on a screen, but the result looked odd. It worked better on high-resolution printers. Again, to find a raw string in that kind of PDF file, you need to basically OCR the document first. (A simpler approach might be to render lines in reverse, upsetting the possibility to just scan for raw strings.)

So ... no, you can't rely on a scan to work in general.

But if you know any other documents produced by the same application, and in the same way, you can check if those PDF files can be scanned for known text. If they contain uncompressed PDF, you can try creating a document containing the text you are looking for, and see how it it converted into PDF text. But if the documents are compressed, you can probably assume that it is not going to work.

ReplyQuote

NeghVar

(@neghvar)

Active Member

Joined: 11 years ago

Posts: 9

Topic starter 09/01/2021 4:41 am

@athulin

Thanks, but I found it. I did a search on Google maps with that address a few months ago. Did a data dump of that account and found it in the maps search history.

ReplyQuote