Benjamin, you're an Associate Professor of Information Studies at McGill University – can you tell us more about the role and how you entered academia?
Certainly. As you say, I am currently an Associate Professor of Information Studies at McGill University and previously was an Associate Professor of Information Systems Engineering at Concordia University. I am particularly interested in developing new, scalable data mining methods for privacy protection and crime investigation.
In 2003, after working in the software industry for four years, I noticed there was a need for scalable data mining methods. As a result, I resigned from my job at SAP Business Objects and studied a Ph.D. in computing science, specializing in data mining, at Simon Fraser University. Recently, there is a hot research topic called “big data”, but data miners have been working on “big data” for more than 20 years already.Your research focuses on designing intelligent systems for the purpose of crime investigation. How did you become interested in these topics?
After joining the Computer Security team at Concordia in 2007, I had a lot of opportunities to interact with different law enforcement units in Canada. In the meetings, I found that there is a big gap between the state-of-the-art data mining methods in the literature and the current software tools used by law enforcement officers. A lot of important evidence can be collected from the suspects’ digital devices, from laptops to smart phones. The challenge is how to efficiently retrieve the relevant information from such a large volume of (unstructured) textual data.
Your paper "Subject-based semantic document clustering for digital forensic investigations" has recently been published in Data & Knowledge Engineering. What prompted this line of research?
We had a collaborative research project with Sret du Qubec, the Quebec Provincial Police. In the meeting, we found that the law enforcement officers were using a traditional keyword-based approach to search the documents from the confiscated computers. The quality of the retrieved results were pretty much based on the experience of the investigator. We were confident that we could do a much better job than that.
Can you give us a brief overview of the paper and the conclusions you reach?
We assume that the investigator has access to a large volume of textual data collected from a confiscated computer. The textual data can be the entire hard drive in a computer, e-mails, chat log messages, etc. The traditional approach is to build an index of the documents, and let the user enter some key terms for searching. However, this approach does not work well for crime investigation. For example, a drug dealer will never use the word “drugs” in his e-mails. When an investigator searches for “drugs”, she is indeed searching for the documents (emails) that are related to the topic “drugs”. An e-mail may be related to drugs even though the word “drugs” may not appear in it. This is the motivation of developing a subject-based search engine for crime investigation.
The scientific contribution we made in this paper is to capture the vocabularies of the suspect, and then use the suspect’s vocabularies to search against his own documents. Experiments show that this approach is much more effective than the traditional keyword-based approach, in which the search terms were soley provided by the the investigator.
It seems like your semantic document clustering technique may also be useful in artificial intelligence, particularly the word sense disambiguation algorithm. Is this something you've considered as a future application of your research?
Actually, it is the other way around! We are utilizing word sense disambiguation algorithms in our method. We probably won’t make contributions in word sense disambiguation in our future research, but we will utilize the algorithms from this area.
You've developed a Cyber Forensic Search Engine for indexing, clustering and search. Have you seen any cases of it being used in investigations so far?
We conducted our experiments on real-life criminal data, and the results are encouraging. Recently, we shared our software program and source code with law enforcement units, but I am not sure if they have utilized our tools in any case yet.
What do you think the next major developments will be in digital forensic examination? Which areas need to be explored further?
Social media, such as Twitter and Facebook, will be a valuable source of information for crime investigation. Yet, the general public has raised serious concerns in the privacy aspect. We need to seek a balance between privacy protection and effective data mining. This brings us to my another research direction on privacy-preserving data mining, which aims at achieving effective data mining without compromising individual privacy. I have demonstrated some success in privacy-preserving data mining in the healthcare sector. It will be interesting if similar techniques can be applied to crime investigation.
What do you do in your spare time?
I have been interviewed many times. You are the first one asking this question. 🙂
My time is consumed by research, teaching, and my two kids. If I still have space time, I enjoy reading about astrophysics and the history of different religions.
Dr. Benjamin Fung is an Associate Professor of Information Studies (SIS) at McGill University, an Affiliate Associate Professor of Information Systems Engineering (CIISE) at Concordia University, and a Research Scientist of the National Cyber-Forensics and Training Alliance Canada (NCFTA Canada). He received a Ph.D. degree in computing science from Simon Fraser University in 2007.