@4n6art The trick in the automated tools on the wayback machine, at least when I did it, was to not follow the robot.txt. Also because of the way that the wayback machine functions you have to view source on the start page and then Identify the links as they correspond to the wayback machine not the original server. You have to do a little bit of code play and creative thinking to get it to play right. But you're right it is a PITA.
Dealing with web based evidence in courts is becoming an issue for investigators and courts worldwide. There is a methodology for dealing with it and making the evidence collected legally defensible. Digital evidence from the Internet should be handled no differently then digital evidence on a hard drive. Documenting web based evidence should include procedures for collection, preservation and presentation. These procedures include the identification of the collection location by IP address, domain registration and geolocation of the server. Documentation can include archiving the entire webpage including the HTML source code, snapshoting the page (taking a picture) or videoing the page as it exists. Verification of the documentation should include logging and automated process. Even keystroke logging or TCP/IP traffic of the machine used to do the collection. Presevation should include documentation of the collection process and the logging through automated processes and saving of the evidence in a secure manner. Presentation needs to be the reporting of the methods and process in a manner that is easy for the end user to view and understand. Internet evidence can be used successfully in court and collected in quick and defensible manner.
Did you know that the job was to capture a social networking site before you took it?
I ask from a billing perspective so that you know the amount of time you will need to spend in advance. Also if the tool is unfamiliar there may be a learning curve.
I've been asked to capture a MySpace, Facebook and LinkedIn website for a client. I have the usernames/passwords of the people whose websites I need to capture.
Questions
- Will website capture programs like HTTrack, WebSnake, Wget etc work with social networking sites and get me everything I want?- If not, aside from PDFing each page, how are others capturing these sites?
Any and all suggestions appreciated.
Thank you folks!
-=Art=-
When I've done this in the past I've used VMWare. Setup an clean VMWare machine, test and accounts, set the built-in record video running, setup a network capture of all traffic to/from the VMWare machine and then run your test. One VMWare machine/packet cap per test. Check everything worked.
Then run it for real, one new VMWare machine per test.
Now image each VM.
I ended up with 20+ test cases each with a VMWare machine, video, notes and associated packet caps! It can be very time consuming.
Snagit. Great software! Will capture the whole page, but I dont believe there is sound.
Good Luck.
It is insane to capture a commercial web site in my opinion.
Most (if not all) large sites use dynamic creation of pages. That is, there is no static HTML as such in the background. Most pages seen are compilations of code chunks depending on variables. These variables can range from simple date, time, weather, to as complex as how often user visited a specific "page" or looked at an image.
I would have to write twice as long disclaimer as the actual capture.
By the way I also use "ScreenGrab!" to generate PNG images of the full blown pages.
It is like asking to map fog…