Notifications

Clear all

Child Exploitation Hash Sets

airo · 2016-09-13T11:50:55Z

HiCan anybody help us locate the Child Exploitation Hash Sets. We are currently looking at writing scanning software for images and classify them in different categories. Having access to these hash sets would be useful.We know that these hashsets should be free but failed to get access to them, and not sure whom to reach.Thanks & regardsIan

Page 2 / 3 Prev Next

General (Technical, Procedural, Software, Hardware etc.)

Last Post by BraindeadVirtually 10 years ago

28 Posts

10 Users

0 Reactions

9,350 Views

RSS

tracedf

(@tracedf)

Estimable Member

Joined: 11 years ago

Posts: 169

15/09/2016 2:21 am

So there is a given image with a given hash.

Knowing that the given hash is known, I can change just one byte of it and obtain an image indistinguishable from the original when seen but that will pass under the radar of a hash comparison.

Publishing the known hashsets has consequences.

<snip>

You don't need to know the hash to change the images. Any collector/distributor of child pornography would be smart to write a program that can toggle a random pixel in each image to break the hash. Releasing the hashes does nothing to aid the child pornographer.

I can't see trying to use hashes to filter images being downloaded–too much latency–but it would be useful for identifying child pornography stored on a workstation or file server. If it is detected, the best move forward may depend on the locality but I would run it by my organization's attorneys and coordinate with local law enforcement to determine what our response should be. With ordinary content filtering, we get a lot of false positives because many sites are categorized based on keywords so a NY Times article about sexual assault on college campuses can get categorized as pornographic. With hashes of known images, a positive result should be definitive 99.99% of the time; the only exception being images that were added to the hash set by mistake (a mis-identification of an adult pornographic image maybe).

In the K-12 environment, we had school resource officers who were sworn police officers so we could have leveraged them in our response.

I think there is more benefit to sharing the information than keeping it secret (excepting new images from open investigations).

For testing software, any hash set works so I agree that these are not needed for that purpose.

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

15/09/2016 1:30 pm

You don't need to know the hash to change the images. Any collector/distributor of child pornography would be smart to write a program that can toggle a random pixel in each image to break the hash.

But then the whole hashsets concept is totally useless. 😯

I mean, if every collector/distributor/redistributor actually "injects" a few bytes and creates a "random" hash, the hashset will never find any positive, not even if it grows to billions of hashes, but it will likely start giving lots of false positives, for each hash that is added to it, the same image will be regenerated several times creating several new hashes, and if they are added to the hashset, before or later the hashset will contain every possible hash.

Maybe it's time to have image recognition techniques instead of hashes …

jaclaz

ReplyQuote

tracedf

(@tracedf)

Estimable Member

Joined: 11 years ago

Posts: 169

15/09/2016 9:15 pm

… if they are added to the hashset, before or later the hashset will contain every possible hash.

Maybe it's time to have image recognition techniques instead of hashes …

jaclaz

Even with a 128-bit hash, exhausting the hash-space really isn't an issue. Even a handful of individual collisions is pretty improbably. As far as I know, the people who commit these crimes are not doing this, but it would be relatively easy to do if they had any programming skills. Supplementing hashsets with image recognition would be a good move and the technology exists (e.g. Google's reverse image search).

This is a bigger problem in computer security/incident response where the bad guys are constantly tweaking their tools and use techniques to generate new versions with trivial differences. In those cases, it is more difficult to identify their tools as you might have many different hashes or signature strings for the same tool.

-Steven

ReplyQuote

armresl

(@armresl)

Noble Member

Joined: 22 years ago

Posts: 1011

16/09/2016 8:20 am

You are 100% right. Most of the time, it's just cops being cops and objecting just to object.

The argument will happen a lot of times if you happen to work for the defense. More to the point, the number of road blocks placed in your path if you are non LE grow very quickly.

LE would (and should IMHO) be very cautious about releasing hash sets externally.

Why are they so restrictive about the hash sets? They can't be used to recreate the images. If they made these more widely available, I think they would find that many organizations would proactively scan for them and report offenders to law enforcement. I worked in a K-12 school district and we would have loved to have a way to identify if any of our staff/teachers ever downloaded child exploitation photos.

ReplyQuote

Chris_Ed

(@chris_ed)

Reputable Member

Joined: 17 years ago

Posts: 314

16/09/2016 1:30 pm

While it of course is programmaticaly easy to change a file in order to generate a new hash, there will always be enough uncertainty in detection that there won't be huge efforts made in this regard. With a readily available hashset, there is no uncertainty.

If the hashsets were made publicly available I would be utterly shocked if there weren't sites on Tor which, within 24 hours of this availability, would guarantee (and advertise) that their entire collection of material is not found in any LE hashset.

Hash sets are not entirely ideal, yes, and perhaps with a large enough collection of child abuse images then we could effectively train a machine to spot them with decent accuracy, but for right now it is still the fastest way to detect this sort of stuff.

I work in the private sector and I can appreciate the frustrations, but IMO there are some things which rightfully should remain in the LE domain.

ReplyQuote

PaulSanderson

(@paulsanderson)

Honorable Member

Joined: 20 years ago

Posts: 651

16/09/2016 1:58 pm

If you are developing a tool that you want to work with a particular has set then you don't need the hashes themselves as others have said.

The format of the hashset(s) is all you need and I see no reason why they can't be publicly available.

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

16/09/2016 6:09 pm

Hash sets are not entirely ideal, yes, and perhaps with a large enough collection of child abuse images then we could effectively train a machine to spot them with decent accuracy, but for right now it is still the fastest way to detect this sort of stuff.

Allow me to partially disagree.
Image recognition is already accurate enough to identify images, maybe not as accurate as one might like it to be, and possibly the real issue is processing power/times.
But one could use the approach to EXCLUDE "not-CP" images.
I mean you have to analyze 10,000 images.
You pass them through a program that recognizes categories, like "panorama", "buildings", "trees" etc. or more generally any image that does not contain a human figure, what remains is further analyzed.

There is this site by Wolfram (the people that make Mathematica) that (at least to me) is impressive
https://www.imageidentify.com/

And even something *like* google image search, once provided with a large enough set of images would be capable of doing at least the exclusion.

It seems to me like we are already a generations ahead of what was discussed here a few years ago
http//www.forensicfocus.com/Forums/viewtopic/t=9693/

jaclaz

ReplyQuote

Chris_Ed

(@chris_ed)

Reputable Member

Joined: 17 years ago

Posts: 314

16/09/2016 6:19 pm

That's actually a really good idea - I hadn't considered using it to exclude irrelevant stuff. It could even give you a breakdown of the general subjects so that you could dip test, if you wanted. Like

54043 - cats
3334 - cars
849332 - flowers

It still has the problem of requiring a significantly-sized server farm to do this in any sort of reasonable time, so it's probably out of the question for many provincial police forces, but on a nationwide scale..

ReplyQuote

jaclaz

(@jaclaz)

Illustrious Member

Joined: 19 years ago

Posts: 5133

16/09/2016 7:48 pm

That's actually a really good idea - I hadn't considered using it to exclude irrelevant stuff. It could even give you a breakdown of the general subjects so that you could dip test, if you wanted.

That is the good thing about exchanging ideas ) , everyone may have different ways to see the same thing, as an example I saw the real drawback of "public CP hashset", in a totally opposite way

If the hashsets were made publicly available I would be utterly shocked if there weren't sites on Tor which, within 24 hours of this availability, would guarantee (and advertise) that their entire collection of material is not found in any LE hashset.

I would expect to find soon in the "normal" web all kind of images, including and especially demotional posters and lolcats modified to compute a hash included in the CP hashset ….
… creating tens, hundreds, thousands, millions of false positives on all computers … 😯

jaclaz

ReplyQuote

tracedf

(@tracedf)

Estimable Member

Joined: 11 years ago

Posts: 169

16/09/2016 10:53 pm

I would expect to find soon in the "normal" web all kind of images, including and especially demotional posters and lolcats modified to compute a hash included in the CP hashset ….
… creating tens, hundreds, thousands, millions of false positives on all computers … 😯

jaclaz

That's not possible given any of the currently known attacks on MD5 or SHA-1. There are two basic criteria for a hash function

1) It should not be feasible to find two inputs with the same hash (a collision).
2) Given a hash, it should not be possible to find an input that produces that specific hash.

MD5 fails at #1; there are known collisions. Researchers have been able to produce collisions in SHA-1 (using 64 GPUs for 10 days), but only if they can pick the Initialization Vector which, in practice, you cannot.

Collisions, which violate criteria #1, are different than what you need for criteria #2 because you don't have to match a specific hash. If you want to produce two messages, A and B, that produce the same hash value X, you are free to modify both A and B. They don't have to land on any specific value as long as they are the same.

Think of the difference this way

1) Find two people that have the same birthday.
2) Find someone else that has MY birthday.

What you are talking about is an attack on criteria #2. Given known hash values (from the CP hashset) find additional images that produce the same hash values. There is no currently known attack that would allow you to do this for MD5 or SHA-1.

SHA-2 and SHA-3 (which I haven't seen used in forensics as far as I can recall) are currently safe against #1 and #2.

Without a serious advance in the cryptanalysis of MD5 and/or SHA-1, poisoning the hashset by producing non-CP images that match the CP hashset is an impossibility. If such an advance occurs, all forensic products and hash sets would need to move away from MD5 and adopt another hash algorithm.

-Steven

ReplyQuote