These days computers can use high-performance AI technology to optimize energy grids for entire cities, provide accurate sports analysis, create movie trailers, detect the spread of cancer and so on – the areas where it can be used are almost endless. At Griffeye, our AI technology detects child sexual abuse content in massive image and video datasets. Here’s how we trained it.Training AI technology on the right data, that is relevant and big enough, is absolutely necessary for it to work and produce high quality results. This was one of the most important aspects for us, with the goal of detecting with high precision the distinctive features that characterize child sexual abuse (CSA) images and videos.
Our AI technology has been trained on categorized CSA material at Taskforce Argos of the Queensland Police in Australia, one of the world’s – if not the world’s – most renowned victim identification units.
Taskforce Argos’ database is quality assured, and each image has been reviewed and categorized by police in Australia according to the criteria that collectively describe CSA material. But we have also let our AI technology be trained on adult pornography or otherwise irrelevant content, so the technology learns the difference between relevant and irrelevant material.
The Griffeye AI technology was pre-trained to know what an image is and what’s important in an image such as faces and people. At that stage, our AI technology could understand images but couldn’t recognize sexual abuse of children.
The training set at Taskforce Argos consisted of around 300,000 unique images. The algorithm was exposed directly to the illegal images to make it understand what illegal images look like and find similarities and relationships between them. In addition, legal images were added during the training session, some of which were visually similar to the illegal images so that the algorithm learnt the specific details that determine whether an image depicts child sexual abuse or not.
In addition to the training set, a validation set of images is used to find out how the training is going. This set is smaller than the training set and is used to validate whether the technology has been properly trained. Does the algorithm understand the relationships? And how specific or general is the classification? As long as the classification performance on the validation set improves, training continues.
The last step of the training process was a final test to measure the accuracy with which the AI technology correctly identified and classified the majority of images that depict CSA. This final test was made on yet a new set of material.
One of the biggest challenges with machine learning in general is “overfitting”, which means that the algorithm simply becomes too good at classifying the images it is trained on. The classification becomes too specific, instead of helping find similarities and relationships between the objects in the pictures. When that happens, it’s no longer possible to generalize the learned connections for new images. That’s largely why the validation set is so important. However, there is a risk that “overfitting” will also affect the validation set after a while. Which in turn explains the need for the final accuracy test to give a true answer about how good the technology really is.
Another difficulty – and perhaps the greatest challenge of all when training AI technology on CSA material – is the availability of data to train on. In other cases where you train AI technology, you can find and even create the pictures you need. But with child sexual abuse images it becomes difficult. We can neither create nor collect illegal material to train the technology on. So we had to find different ways to work around the problem.
Although there are relatively few users who currently have access to the AI technology (since it’s currently a beta version), several users and organizations have provided detailed feedback. The general message is unquestionably positive. Some have already managed to work the Griffeye AI into their daily work, and with good results. But as a beta version, there’s still a lot to work on.
We’ve received valuable feedback from users suggesting patterns in the mistakes the technology makes. For example, it’s difficult to classify small images in a collage, because the algorithm can’t find the details in the pictures. At present, we scale down the image because larger images require more computing power, but that makes finding details in collages difficult.
Another example where the classifier sometimes makes mistakes is certain types of images of children, such as holiday pictures on the beach. For the algorithm, these images are often too similar to illegal pictures. However, by training the algorithm on these types of images, the AI technology can learn to understand the difference, and quickly classify what is illegal.
Training AI technology is an ongoing process. The more categorized material we get access to the better it becomes. Therefore, the next step will be to train our technology on new data from other organizations and authorities. In addition to training the technology on more illegal material, we will also train it on larger data sets of legal images.
The ambition is also to train the AI technology to distinguish with great precision how CSA is defined and categorized in different countries. Imagine that you can have a baseline AI technology that can be used straight away in most countries, but also specific AI algorithms for how CSA material is legally defined in different countries.
Research in AI and machine learning is advancing rapidly, but the basis for how the models are trained is the same. The amount and quality of data used to train the algorithm determine how good the results are. The massive volume of child sexual abuse images is deeply concerning, of course, but at least we can turn that to our advantage using AI technology. New and improved GPUs are released yearly, with more focus on machine learning tasks, which makes it possible to keep up with the increased volumes of data. AI technology can help to lower the physical and mental burden for investigators, help in identifying victims and perpetrators more quickly and accurately, and ultimately stop and prevent child sexual abuse.