Notifications

Clear all

Website imaging

armresl · 2011-05-24T00:05:06Z

Hello,Looking for your opinions on a good way to get a copy (as best can be had) of a website.The site contains a lot of Java, Flash, and a few movie files.No physical access to the computer, server, all done through www. can't FTP into it either, to sum it up (NO Physical access to the box, NO remote access to the box, NO help from the owners of the actual website)HTT doesn't seem to grab the entire site or even most of it, even after various option tweaks (maybe someone knows a good set of options to grab a site high in java and flash.Surfoffline pro doesn't capture but maybe 1/3 of the siteAdobe - Walking the site gives also approx 1/4 1/3 of the siteA good like site would be something along the lines of snickers.com where there are links off the main as wells a movies and such.

Page 2 / 2 Prev

General (Technical, Procedural, Software, Hardware etc.)

Last Post by sentinel 14 years ago

16 Posts

9 Users

0 Reactions

1,978 Views

RSS

armresl

(@armresl)

Noble Member

Joined: 21 years ago

Posts: 1011

Topic starter 24/05/2011 7:32 am

Very interesting. I would have thought that it was into the few hundred MB especially based on content. Did it capture all Flash and Java or most

So this was with offline explorer?
And for viewing what steps did you take? I've not used that software before.

Thanks for going the extra mile, that was a pretty cool thing to do.

Have you pointed the product you mention at a site like the site I mentioned or similiar to see how it works?

I ran snickers.com "quick and dirty" using default settings Acquisition is no problem at all, getting 240 files, 48 folders at ~96 MB.
Playback on OE's (or any local) webserver is perfect, no difference to the original. Export/presentation is a little tricky for such sites. To avoid any difference in appearance, you have to use the exe viewer which is in fact its own web server. It is no problem at this size but loading executable archives of several 100 MB would take some time.
This site itself is quite an easy task as the project was not configured to follow external links, except pictures.
Real problems are youtube and social media as you don't get a truly functional copy and it is difficult to reasonably restrict the crawling to necessary objects.

ReplyQuote

C.R.S.

(@c-r-s)

Estimable Member

Joined: 14 years ago

Posts: 170

24/05/2011 9:28 am

Yes, I did it in Offline Explorer because Inquiry seems not to be able to export the objects properly. In addition, my copy is not up to date but OE is.

This is a file list of the site after exported to a directory, sorted by extension
http//www.2shared.com/document/l71AiNZX/snickerscom.html

On a quick look I didn't figure out, how the videos under /media/flv can be reached from the main site?

The projects can be viewed directly in OE, which has a built in webserver. It can be accessed externaly as well and the downloaded projects can be browsed from LAN (except you've used "URL macros", something like wildcards).
As mentioned, the export to a directory or some other options (MHT, CHM) is not suitable for viewing here. I don't really know why, but the flash elements do not play this way. Maybe a web security policy, but I added a trusted path in the flash settings, of course. I'm not really experienced in fine tuning the export function, as I seldom use it for getting exact copies or these snapshots stay on the OE machine for LAN access.
But viewing on a webserver or within the executable viewer works fine. This viewer is an option for export as like as the other formats and (zip-)folders.

ReplyQuote

Anonymous 6593

(@Anonymous 6593)

Guest

Joined: 17 years ago

Posts: 1158

24/05/2011 2:04 pm

Looking for your opinions on a good way to get a copy (as best can be had) of a website.

Depends on what 'a copy' means. Something you can browse locally? Something you can search over? or something else

Even web application testers (who also want to see all parts of a web site) seem agreed that it is a question of some automation and appreciable manual work to 'spider' a web site (ie. walk over all parts of it). Web sites that react to the browser you use require particular care. Not all such testing saves a copy, though, so the ultimate goals are not quite comparable, even if the 'view all parts' bit is.

This is particularly true for web sites where contents is very dynamic Flash and Java and other client-side executables are the main problem you need a full browser engine with plugins for that.

I've been trying out Burp Suite a bit lately. It's a rather smart logging proxy, so it is possible to 'log' all parts of the web you visit, manually, through a web browser, or automatically, through the 'Spider web site' mode. As it is a proxy, the web pages are still rendered in your browser, so it manual work is always possible, and as you have access to the HTTP/HTML response, you can always search for pages /frames with Java and Flash contents for manual browsing.

You *don't* get a nice static web site to browse off-line, which may be a problem if you need that. You do get the full responses from every URI you've visited. (You can 'scope' your work, so that only URIs belonging to a certain domain or a certain IP address etc. are logged/spidered.)

Google for 'Burp Suite' – there is a free version that does do practically all of the proxying/spidering stuff – you should be able to get an idea if it suits your goals. It's not updated quite as frequently as the commercial edition, which is more targeted to web application testers.

ReplyQuote

jimmy

(@jimmy)

Eminent Member

Joined: 18 years ago

Posts: 47

24/05/2011 3:55 pm

Well way back we did use a tool named BlackWidow.
BlackWidow will download all file types such as pictures and images, audio and MP3, videos, documents, ZIP, programs, CSS, Macromedia Flash, .pdf , PHP, CGI, HTM to MIME types from any web sites.
Not sure of the latest versions but in 2008-2009 it did some great job for us to solve some mysterious cases…

ReplyQuote

azrael

(@azrael)

Honorable Member

Joined: 19 years ago

Posts: 656

24/05/2011 5:29 pm

Once upon a time I wrote up a quick a dirty Perl script with wget to image a site - it had the advantage that I was able to hash each file as it completed, and I could follow links / get images / download javascript & css etc. I recall building it based around the Perl Cookbook "find broken links" code …

However, I caveat that this was a few years ago now and I imagine that it would barf horribly with streaming media !

ReplyQuote

sentinel

(@sentinel)

Active Member

Joined: 18 years ago

Posts: 5

25/05/2011 12:54 am

The UNIX* command 'wget' is very efficient at copying a live website and storing the content for subsequent review or as a backup utility.

ReplyQuote

Page 2 / 2 Prev

8 Forums
15.7 K Topics
92.3 K Posts
368 Online
41.1 K Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed