Notifications

Clear all

Web Crawling and PDF Documents

General (Technical, Procedural, Software, Hardware etc.)

Last Post by PaulSanderson 12 years ago

5 Posts

4 Users

0 Reactions

621 Views

RSS

liguoroa

(@liguoroa)

Estimable Member

Joined: 16 years ago

Posts: 43

Topic starter 10/03/2013 9:28 pm

Dear All,

I analyzed the whole web site downloaded (using the command wget) and I found some pdf documents
including "compromising" words.

Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.

Best Regards,
Andrea Liguoro

Quote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 18 years ago

Posts: 5133

11/03/2013 1:01 am

Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.

Sure they do ) , otherwise how would you find (as you normally do) a number of .pdf's in (say) google results? 😯

Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".

Try googling for "forensic investigations BEST PRACTICES" (without quotes).
You will get among other results a number of .pdf's.

jaclaz

ReplyQuote

Belkasoft

(@belkasoft)

Estimable Member

Joined: 17 years ago

Posts: 169

11/03/2013 3:57 pm

Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".

jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else. Apparently, google uses an OCR engine to index such documents.

Prooflink
http//googleblog.blogspot.de/2008/10/picture-of-thousand-words.html

In addition, Google can OCR and index such PDF files when uploaded to Google Docs
http//support.google.com/drive/bin/answer.py?hl=en&answer=176692

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 18 years ago

Posts: 5133

11/03/2013 7:20 pm

jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else.

Thanks for the heads up / update ) , I remember reading something about that but I thought it to be highly experimental.

After all it is not that bad, though

AS foT the softWal'e altornatives，. ilT1I)fenlenting solutionS in haTdware

is a bit debatable wink

jaclaz

ReplyQuote

PaulSanderson

(@paulsanderson)

Honorable Member

Joined: 19 years ago

Posts: 651

11/03/2013 7:27 pm

I did a job a number of years back that involved a number of scanned PDF's and found that the PDF spec allows for files that have been scanned to also (as well as the scanned image) have the text 'hidden' within the document, so the PDF retains the look of the scanned document but is also searchable.

ReplyQuote

8 Forums
15.7 K Topics
92.3 K Posts
264 Online
41.1 K Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed