Searching the inter...
 
Notifications
Clear all

Searching the internet for images by metadata

9 Posts
6 Users
0 Likes
659 Views
(@gmarshall139)
Posts: 378
Reputable Member
Topic starter
 

The goal is to find images posted to the internet from a specific camera. So if the camera's serial number (or some other unique identifier) is contained within an image's metadata, could it be searched for to locate other pictures created by the same device?

How about creating a web crawler to do the same?

 
Posted : 31/08/2009 9:08 pm
keydet89
(@keydet89)
Posts: 3568
Famed Member
 

If you can extract EXIF data from a JPG on an analysis system, I don't see why you couldn't do the same thing via this sort of functionality.

Good luck.

 
Posted : 31/08/2009 11:55 pm
watcher
(@watcher)
Posts: 125
Estimable Member
 

The goal is to find images posted to the internet from a specific camera. …

A crawler that extracts and compares EXIF data is certainly technically possible, in fact it wouldn't be that hard. That said, there are some practicalities to consider.

Size and time alone would necessitate some kind of focus narrowing criteria. I suspect that new pictures are added to the web faster than you could crawl them.

Additionally, most images on the web outside of photo sites tend to be small low resolution images. Generally, external editing/resizing software does not keep the EXIF data in the new smaller images. Unless the image was taken directly by the camera in question and posted unedited, there is a good chance the EXIF data is gone.

 
Posted : 01/09/2009 12:26 am
(@wmpwi)
Posts: 1
New Member
 

The goal is to find images posted to the internet from a specific camera. So if the camera's serial number (or some other unique identifier) is contained within an image's metadata, could it be searched for to locate other pictures created by the same device?

How about creating a web crawler to do the same?

You just happened to hit real close to a hot button of mine right now. I've been doing a lot of research on how I might be able to use exif data in our investigations and we've already had enough luck to keep me encouraged. The crawl is an interesting idea, but I see some wisdom in what Watcher said.

If one can design the crawler, then it shouldn't be a big leap to focus the crawl to selected web sites. Any one prolific enough to catch a manufacturing case and careless enough to leave in the metadata may also frequent Picasa, Flikr, or the like. I've pulled exif data from those. Now just figure how to crawl or scrape their data and I'll sign on. I'll be watching for updates.

 
Posted : 01/09/2009 4:02 am
(@gmarshall139)
Posts: 378
Reputable Member
Topic starter
 

Additionally, most images on the web outside of photo sites tend to be small low resolution images. Generally, external editing/resizing software does not keep the EXIF data in the new smaller images. Unless the image was taken directly by the camera in question and posted unedited, there is a good chance the EXIF data is gone.

I agree that these two points are the major limitations. Any good resources for programming web crawlers?

 
Posted : 01/09/2009 5:56 pm
(@gmarshall139)
Posts: 378
Reputable Member
Topic starter
 

I found something of a modular crawler here to play around with

http//www.cs.cmu.edu/~rcm/websphinx/

 
Posted : 01/09/2009 6:11 pm
(@kovar)
Posts: 805
Prominent Member
 

Greetings,

Be aware that if you crawl large volumes of data from a professionally run site, they may shut you down. This often happens when imaging web sites without a throttle set on your application. The site will detect excessive bandwidth demands from your IP and either shut you off completely or throttle you way back. wget and other similar tools have settings to limit the bandwidth demands.

Crawling a photo site would likely trigger similar responses.

-David

 
Posted : 02/09/2009 2:10 am
(@seanmcl)
Posts: 700
Honorable Member
 

Look at

www.rxfn.com/projects

I have installed APF and BFD on a couple of client sites and it does exactly what you say, namely, it looks for an unusual number of connect attempts within a small period of time and creates an IPTABLES rule to block your site if it sees them.

Of course, this is Linux based, but most of the contraband that you see out there would likely be on Linux hosted servers.

 
Posted : 02/09/2009 2:45 am
(@gmarshall139)
Posts: 378
Reputable Member
Topic starter
 

I've found it pretty easy to pull down images off of web sites with a crawler. So I can now process them in the normal way with a metadata parser or even a simple keyword search since I am looking for one specific string. Not an elegant or efficient solution though.

So far I haven't been blocked but I'm just running some tests and not really pushing it. It's also not a suitable technique for child pornography investigations either, for obvious reasons. But that's not what I need now anyway.

 
Posted : 02/09/2009 6:11 am
Share: