Dear All,
I analyzed the whole web site downloaded (using the command wget) and I found some pdf documents
including "compromising" words.
Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.
Best Regards,
Andrea Liguoro
Does anybody know if pdf documents are analyzed by web crawler during the search engine indexing phase?
I would like to establish if these words may potentially connect to this site in a web search.
Sure they do ) , otherwise how would you find (as you normally do) a number of .pdf's in (say) google results? 😯
Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".
Try googling for "forensic investigations BEST PRACTICES" (without quotes).
You will get among other results a number of .pdf's.
jaclaz
Of course that applies only to "wordprocessor documents printed/save to .pdf" and not to "scans".
jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else. Apparently, google uses an OCR engine to index such documents.
Prooflink
http//
In addition, Google can OCR and index such PDF files when uploaded to Google Docs
http//
jaclaz, to my surprise, this also applies to scanned PDF files that only contain images (scanned text) and nothing else.
Thanks for the heads up / update ) , I remember reading something about that but I thought it to be highly experimental.
After all it is not that bad, though
AS foT the softWal'e altornatives,. ilT1I)fenlenting solutionS in haTdware
is a bit debatable wink
jaclaz
I did a job a number of years back that involved a number of scanned PDF's and found that the PDF spec allows for files that have been scanned to also (as well as the scanned image) have the text 'hidden' within the document, so the PDF retains the look of the scanned document but is also searchable.