Extract indexed web...
 
Notifications
Clear all

Extract indexed websites

5 Posts
3 Users
0 Likes
786 Views
LeGioN
(@legion)
Posts: 51
Trusted Member
Topic starter
 

Hi,

This might be a really dumb question..
But here is the scenario

Somebody creates a webpage.
It gets indexed by google.
It then gets deleted.

The webpage is no longer accessable, but you can still see bits of it through just good ol' fashion googling as it has been indexed.

Is there a way to extract everything that google has indexed?

If this even makes sense )

/LeGioN

 
Posted : 25/03/2019 8:25 am
LeGioN
(@legion)
Posts: 51
Trusted Member
Topic starter
 

Additional info
Have tried the wayback machine website unsuccesfully as the page needed was not captured.

 
Posted : 25/03/2019 8:47 am
(@tootypeg)
Posts: 173
Estimable Member
 

not sure i fully understand the scenario. Maybe its still in the browser cache of a suspect? For example, make Chrome work offline and rebuild the page from the cache?

 
Posted : 25/03/2019 9:46 am
jaclaz
(@jaclaz)
Posts: 5133
Illustrious Member
 

As I see it a page (not existing anymore) has EITHER been archived (on wayback machine or on other services) or not.
If not, and if it has been crawled by google (usually it has, since the google crawler is d@mn efficient) it may be in the cache.
The google cache is temporary only, so you might (or might not) be "on time" to still get it.
Also, unlike archive.org/Wayback Machine the google cache is "last" time google visited it only, so if the page has been - even briefly - replaced by another page, you will find this latter in google cache.

To access easily the google cache you may want to try
http//cachedview.com/

There are other archiving/caching resources, even if they are "tiny" when compared to Google or archive.org, it costs nothing to check if - by sheer luck - something of interest has been cached/archived by them, example
https://www.waybackmachinedownloader.com/blog/alternative-sites-like-archive-org/

A "complete" list is here
https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
(though most are dedicated to "institutional" websites)

jaclaz

 
Posted : 25/03/2019 10:01 am
LeGioN
(@legion)
Posts: 51
Trusted Member
Topic starter
 

As I see it a page (not existing anymore) has EITHER been archived (on wayback machine or on other services) or not.
If not, and if it has been crawled by google (usually it has, since the google crawler is d@mn efficient) it may be in the cache.
The google cache is temporary only, so you might (or might not) be "on time" to still get it.
Also, unlike archive.org/Wayback Machine the google cache is "last" time google visited it only, so if the page has been - even briefly - replaced by another page, you will find this latter in google cache.

To access easily the google cache you may want to try
http//cachedview.com/

There are other archiving/caching resources, even if they are "tiny" when compared to Google or archive.org, it costs nothing to check if - by sheer luck - something of interest has been cached/archived by them, example
https://www.waybackmachinedownloader.com/blog/alternative-sites-like-archive-org/

A "complete" list is here
https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
(though most are dedicated to "institutional" websites)

jaclaz

This was the sort of stuff I was hoping you'd show up with!
Tried both cachedview and wayback with not much success, but I am going to give wayback another go.

I had some success with Google Index Retriever by elevenpaths, but I did not quite get me all the good stuff I was hoping to get.

Any my bad tootypeg, I did not specify the fact that there is no physical devices involved. Just a deleted URL. )

/LeGioN

 
Posted : 25/03/2019 10:42 am
Share: