Web Crawling and PD...
 
Notifications
Clear all

Web Crawling and PDF Documents

5 Posts
4 Users
0 Reactions
621 Views
(@liguoroa)
Estimable Member
Joined: 16 years ago
Posts: 43
Topic starter  

Dear All,

I analyzed the whole web site downloaded (using the command wget) and I found some pdf documents
including "compromising" words.

Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.

Best Regards,
Andrea Liguoro


   
Quote
jaclaz
(@jaclaz)
Illustrious Member
Joined: 18 years ago
Posts: 5133
 

Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.

Sure they do ) , otherwise how would you find (as you normally do) a number of .pdf's in (say) google results? 😯

Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".

Try googling for "forensic investigations BEST PRACTICES" (without quotes).
You will get among other results a number of .pdf's.

jaclaz


   
ReplyQuote
(@belkasoft)
Estimable Member
Joined: 17 years ago
Posts: 169
 

Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".

jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else. Apparently, google uses an OCR engine to index such documents.

Prooflink
http//googleblog.blogspot.de/2008/10/picture-of-thousand-words.html

In addition, Google can OCR and index such PDF files when uploaded to Google Docs
http//support.google.com/drive/bin/answer.py?hl=en&answer=176692


   
ReplyQuote
jaclaz
(@jaclaz)
Illustrious Member
Joined: 18 years ago
Posts: 5133
 

jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else.

Thanks for the heads up / update ) , I remember reading something about that but I thought it to be highly experimental.

After all it is not that bad, though

AS foT the softWal'e altornatives,. ilT1I)fenlenting solutionS in haTdware

is a bit debatable wink

jaclaz


   
ReplyQuote
PaulSanderson
(@paulsanderson)
Honorable Member
Joined: 19 years ago
Posts: 651
 

I did a job a number of years back that involved a number of scanned PDF's and found that the PDF spec allows for files that have been scanned to also (as well as the scanned image) have the text 'hidden' within the document, so the PDF retains the look of the scanned document but is also searchable.


   
ReplyQuote
Share: