Notifications
Clear all

Website imaging

16 Posts
9 Users
0 Reactions
1,978 Views
(@armresl)
Noble Member
Joined: 21 years ago
Posts: 1011
Topic starter  

Very interesting. I would have thought that it was into the few hundred MB especially based on content. Did it capture all Flash and Java or most

So this was with offline explorer?
And for viewing what steps did you take? I've not used that software before.

Thanks for going the extra mile, that was a pretty cool thing to do.

Have you pointed the product you mention at a site like the site I mentioned or similiar to see how it works?

I ran snickers.com "quick and dirty" using default settings Acquisition is no problem at all, getting 240 files, 48 folders at ~96 MB.
Playback on OE's (or any local) webserver is perfect, no difference to the original. Export/presentation is a little tricky for such sites. To avoid any difference in appearance, you have to use the exe viewer which is in fact its own web server. It is no problem at this size but loading executable archives of several 100 MB would take some time.
This site itself is quite an easy task as the project was not configured to follow external links, except pictures.
Real problems are youtube and social media as you don't get a truly functional copy and it is difficult to reasonably restrict the crawling to necessary objects.


   
ReplyQuote
(@c-r-s)
Estimable Member
Joined: 14 years ago
Posts: 170
 

Yes, I did it in Offline Explorer because Inquiry seems not to be able to export the objects properly. In addition, my copy is not up to date but OE is.

This is a file list of the site after exported to a directory, sorted by extension
http//www.2shared.com/document/l71AiNZX/snickerscom.html

On a quick look I didn't figure out, how the videos under /media/flv can be reached from the main site?

The projects can be viewed directly in OE, which has a built in webserver. It can be accessed externaly as well and the downloaded projects can be browsed from LAN (except you've used "URL macros", something like wildcards).
As mentioned, the export to a directory or some other options (MHT, CHM) is not suitable for viewing here. I don't really know why, but the flash elements do not play this way. Maybe a web security policy, but I added a trusted path in the flash settings, of course. I'm not really experienced in fine tuning the export function, as I seldom use it for getting exact copies or these snapshots stay on the OE machine for LAN access.
But viewing on a webserver or within the executable viewer works fine. This viewer is an option for export as like as the other formats and (zip-)folders.


   
ReplyQuote
(@Anonymous 6593)
Guest
Joined: 17 years ago
Posts: 1158
 

Looking for your opinions on a good way to get a copy (as best can be had) of a website.

Depends on what 'a copy' means. Something you can browse locally? Something you can search over? or something else

Even web application testers (who also want to see all parts of a web site) seem agreed that it is a question of some automation and appreciable manual work to 'spider' a web site (ie. walk over all parts of it). Web sites that react to the browser you use require particular care. Not all such testing saves a copy, though, so the ultimate goals are not quite comparable, even if the 'view all parts' bit is.

This is particularly true for web sites where contents is very dynamic Flash and Java and other client-side executables are the main problem you need a full browser engine with plugins for that.

I've been trying out Burp Suite a bit lately. It's a rather smart logging proxy, so it is possible to 'log' all parts of the web you visit, manually, through a web browser, or automatically, through the 'Spider web site' mode. As it is a proxy, the web pages are still rendered in your browser, so it manual work is always possible, and as you have access to the HTTP/HTML response, you can always search for pages /frames with Java and Flash contents for manual browsing.

You *don't* get a nice static web site to browse off-line, which may be a problem if you need that. You do get the full responses from every URI you've visited. (You can 'scope' your work, so that only URIs belonging to a certain domain or a certain IP address etc. are logged/spidered.)

Google for 'Burp Suite' – there is a free version that does do practically all of the proxying/spidering stuff – you should be able to get an idea if it suits your goals. It's not updated quite as frequently as the commercial edition, which is more targeted to web application testers.


   
ReplyQuote
jimmy
(@jimmy)
Eminent Member
Joined: 18 years ago
Posts: 47
 

Well way back we did use a tool named BlackWidow.
BlackWidow will download all file types such as pictures and images, audio and MP3, videos, documents, ZIP, programs, CSS, Macromedia Flash, .pdf , PHP, CGI, HTM to MIME types from any web sites.
Not sure of the latest versions but in 2008-2009 it did some great job for us to solve some mysterious cases…


   
ReplyQuote
azrael
(@azrael)
Honorable Member
Joined: 19 years ago
Posts: 656
 

Once upon a time I wrote up a quick a dirty Perl script with wget to image a site - it had the advantage that I was able to hash each file as it completed, and I could follow links / get images / download javascript & css etc. I recall building it based around the Perl Cookbook "find broken links" code …

However, I caveat that this was a few years ago now and I imagine that it would barf horribly with streaming media !


   
ReplyQuote
(@sentinel)
Active Member
Joined: 18 years ago
Posts: 5
 

The UNIX* command 'wget' is very efficient at copying a live website and storing the content for subsequent review or as a backup utility.


   
ReplyQuote
Page 2 / 2
Share: