by James Zjalic
Much has been written about the CSI phenomenon within digital forensics circles, but is there a way we as experts can reduce this effect, maybe not globally but at least amongst our own clients? In just the last couple of weeks, I’ve had requests to enhance a speaker on the other end of the phone, on a recording in which the voice on the other end of the phone sounded like something you would hear on a cartoon. It had the rhythm of somebody speaking but that’s about all it had going for it. Another request asked to enhance a video recording in which the two individuals were seated at a distance, in a dark room, with sunlight streaming through a window across the camera lens and a lamp in front of one of the individuals faces. A third and final example is being asked to enhance the screen of a mobile phone from pictures of said mobile phone. That shouldn’t be a problem you think. Until you consider the pictures were being taken from distances of over a meter of a phone that was turned away from the camera lens.
The CSI effect is based on the hit TV show of the same name, but its roots have now spread much further afield with the rise in shows and films featuring digital forensics such as Homeland, James Bond & Spooks . As entertainment blurs with reality, people believe that much more can be achieved than is actually possible within the laws of physics. The 100% voice-match algorithms and facial mapping techniques used on TV break both the laws of logic and science. For example, to be able to take accurate measurements of the human face such as the distance between the eyes, the length of the face and the width of the nose would require high definition photos from the perfect angle as to achieve measurements accurate enough to differentiate 2 individuals. Forensic images are generally from CCTV cameras located high above the suspect’s face (poor angle) and of poor resolution due to the encoding format or the distance of the individual from the camera. So, sorry CSI, that facial mapping technique just isn’t going to work.
Limitations of Physics
There are basic limitations of audio, image and video forensics based on the physics of the real world. I will attempt to break a few down to serve as an example.
Digital audio is captured by taking samples of real-world acoustic events through a microphone. The sampling process doesn’t differentiate between the spoken word, the noise from a truck driving past, or the music playing in the background. It just convolves those sounds into a single entity, just like the mixing of ingredients within a soup. Imagine you have 5 ingredients in your tomato soup: tomatoes, oil, basil, garlic and salt, and the tomatoes are analogous to the speech within a recording. You can finely strain the soup to remove the basil (this might be a filter removing the low-frequency hum). You can then attempt to remove some of the oil (which may be the noise), but in doing so you remove some of the tomatoes, which you certainly do not want to do. Do you see the problem? There are advanced techniques such as Blind Source Separation – that can take a sample of the noise (or the oil), and through deep learning can predict which areas are noise and which are speech, but you can not expect to remove all of that noise and end up with a crystal clear high-quality recording. It has also been shown through experiments that audio enhancement does not improve the intelligibility (or the number of words that can be determined accurately) , only the quality of a recording (or how pleasurable it is to listen to).
As a quick experiment, open any image on your computer and zoom in as far as you can go. Here you will see the building blocks of an image, the individual pixels. Now if you zoom out you may still be able to see them, but the image is a little clearer, and zoom out a little more and the image becomes clearer and clearer as the pixels become too small for our eyes to make out and we can only see the bigger picture. If there is a suspect in the distance that requires zoom, the image quality will decrease as the magnification increases, and at worst the face may be composed of a block of pixels just 10 x 10 in dimension. It may be clear that it is a human face (just about), but that’s all that can be seen. Even enhancement doesn’t change the fact that the camera has only captured 100 pixels of resolution with which to show the amount of detail in a human face. If the image is doubled in size in an attempt to improve the clarity (increasing the number of pixels to 20 x 20), all that has happened is pixels have been interpolated between the pixel values of the 10 x 10 grid based on the average values of the 10×10 grid . Brightening the image can’t change the number of pixels used. Neither can an increase in contrast. Not even sharpening can save you now, it’s just an impossible task, governed by the laws of physics.
There are many instances when CCTV cameras may capture a vehicle, but the license plate appears blurred. This is due to the frame rate of the CCTV camera system operating at a slower capture speed than the vehicle is moving. Speed cameras utilize high frame rates as their sole function is to capture license plates and enough images to predict a car’s speed from point A to point B. If you were driving at 200mph you may beat the speed camera, just as driving at 30mph will probably beat the CCTV camera, but that experiment isn’t advised, not even in the name of science.
How do you solve a problem like the CSI Effect?
So how can we solve the CSI effect? Maybe by re-framing the problem and asking, “how can we manage clients’ expectations?”. Although this article has been focused on the discipline I am most familiar with, it is a condition that plagues all aspects of forensic science. One method may be explaining to clients before work commences the limitations, the reason for the limitations and how it relates to the case at hand. In the media forensics field, I have found it to be beneficial to take the time to send a small sample to the client before the work is instructed (for example a 10-second segment of the enhanced audio), to ensure that their expectations are managed beforehand. It is certain to be an issue that will always haunt us as forensic scientists, as TV shows set expectations for our craft higher and higher, and a rate that science will never be able to match. If it does nothing else for forensic science, it may at least bring new scientists to the field, attracted by the flashy technology of these TV shows. And in doing so bringing in more expectations to be managed.
 Honorable Donald E. Shelton, “The ‘CSI Effect’: Does It Really Exist?,” National Institute of Justice, Mar-2018.
 Shoko Araki, Shoji Makino, Ryo Mukai, Tsuyoki Nishikawa, and Hiroshi Saruwatari, “Fundamental limitation of frequency domain blind source separation for convolved mixture of speech,” in 2001 IEEE International Conference, 2001, vol. 5.
 M. G. Jafari and M. D. Plumbley, “Convolutive blind source separation of speech signals in the low-frequency bands,” in Audio Engineering Society Convention 123, 2007.
 T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete representations,” IEEE Signal Process. Lett., vol. 6, no. 4, pp. 87–90, 1999.
 P. C. Loizou and G. Kim, “Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions,” IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 1, pp. 47–56, Jan. 2011.
 Hany Farid, Image Forensics. London: MIT Press, 2016.
About The Author
James Zjalic is a Media Forensics Analyst and partner at Verden Forensics in the UK. Education includes a 1st Class Bachelors Degree in Audio Engineering and an expected Masters Degree in Media Forensics from the National Centre for Media Forensics in Denver, Colorado. Research includes work on image authentication for The Pentagon’s Defense & Advanced Research Project Agency (DARPA) and peer-reviewed publications on subjects including forensic acoustics and audio authentication.