Cinthya, tell us a bit about yourself. What's your role, and what does a day in your life look like?
I am a U.S. Army veteran. I recently fulfilled one of my biggest achievements, which was being awarded a Bachelor’s degree in Cyber Systems from the University of New Haven and now, I am working on my Master’s degree in the same field. Thus, my days a are a little busy since I’m a full-time graduate student.I also work for the Cyber Forensics Research and Education Group (www.UNHcFREG.com) conducting research and administering a digital forensics platform known as the Artifact Genome Project (AGP).
What was it that first sparked your interest in digital forensics, and how did you get started in the field?
When I first started working towards my Bachelor’s degree, it was not on Cyber Systems. In fact, the degree did not even exist at the University of New Haven. I believe it was two years or even more into my degree that I decided to take two classes, one in cyber security and one in cyber forensics. I became very interested in the field.
After taking those two introductory courses and seeing how enthusiastic, full of wisdom and very knowledgeable my professors were in their fields, they inspired me to become more interested as well. Honestly, the second day of class in cyber forensics sparked that interest in digital forensics, the class was an almost three-hour hands on lab on bagging and tagging evidence from a fictitious crime scene.
You've recently published a paper about datasets for digital forensics. Could you outline some of the challenges you address in the paper?
Digital forensics is fairly a new domain. One of the main themes that is addressed in the paper is the challenge scientists face when conducting research in digital forensics. The lack of real world datasets is a big obstacle when trying to produce realistic results. This is because realistic data in most cases cannot be shared with the forensics community due to privacy restrictions and proprietary rights.
Besides that, other challenges addressed include; lack of dataset sharing within the community, scarce research conducted in new technologies, insufficient efforts, the absence of standards on how to release datasets, and finally, the absence of a central dataset repository for the digital forensics community.
Why is replication of results so important, and how can we ensure this happens more often?
In the scientific domain, scientists share datasets to reproduce results to advance their respective fields. The forensics field should not be any different.
Obviously, we know that there are other limitations involved, however, in order to progress and continue to discover and overcome new challenges, it is especially important for the forensics community to support each other. Not only by conducting research, but also by releasing the datasets that were produced or used in the research. This helps others that want to conduct the same research either with the same or different methodology to at least have the same datasets on hand. This would ensure that the expected results would be reproduceable and comparable.
It is not a surprise for researchers to use datasets to test new tools, but if the datasets used to originally test the tool are not provided, then there is always a risk that the result would not be the same as the original.
The only way to ensure this happens more often is for researchers to step up and start sharing more of the datasets used in their research. It is understandable there are times when sharing datasets would not be possible due to privacy rights, but I believe that there are ways to mask private information and avoid exposing private data.
Sharing datasets does not only help reproduce results but also helps other researchers use the same datasets in different experiments. Moreover, certain types of datasets and case scenarios are of great benefit for academia to teach students about digital forensics.
What are some of the main differences between real world and synthetic data sets, and what effect does this choice have on research?
Experimenting on real world data is crucial for developing reliable algorithms and tools. Often, researchers in digital forensics want to solve real world problems and in order to produce realistic results they need to use real world datasets. Synthetic datasets are not favored when the goal is to produce realistic results. Simulated or generated datasets are more ideal when real world data is not available and thus, researchers have no other option but to use that. It is important to realize however, that the main route we can learn from our past and advance in this field is try our best to conduct experiments on real world data.
Can you recommend any example data sets for researchers, or places where they might be able to find these?
Part of our research was to do just that. We examined 715 published articles and conducted Google research for datasets that the forensic community can utilize on their experiments. We offer this platform on centralized source dataset repository, where all of our results are published and vast details with the dataset location link is provided.
Examples of datasets include hard disk drives, network traffic and so on. Dataset repositories include Digital Corpora and Cyber Impact Trust. You can find our results on our website.
We also encourage researchers to contribute datasets to our site that might be of great interest to others. Another platform for sharing digital forensic artifacts, is the project I am currently the administrator for at the University of New Haven, and the project can be visited here.
Could you briefly outline for us the results of your research?
Based on our research we discovered that out of the 715 articles, 351 (49% articles) used datasets in their experiments while 51% focused on studies informing the community about standards, techniques, policies and topics on programming, algorithms, etc.
Out of the 351, articles 54.4% favored reusing datasets that were available prior to their research. These were datasets acquired from other parties.
However, the rest of the 351 articles, 45.6% created their data through the research and unfortunately, only 3.8% of those articles released their datasets after the experiment was concluded. Thus, we concluded that only a total of 29.0% of the articles have data available to use now that was shared within the community.
Furthermore, it was concluded that most articles relied on experiment generated datasets than any other type. We believe this result is because in many cases, there is a lack of real world datasets available to the community. However, real world data was the second most used type of datasets and synthetic datasets were the least used. These are the main results; nevertheless, our article includes more results based on other things that we analyzed. I encourage anyone who is interested in this topic to read the article, and it is freely available here.
Are you working on any new research at the moment?
Yes, I am working on new research and we are working towards completing it this year. However, due to the magnitude of the research and unpublished results, it is currently not possible for me discuss them here.
Do you have any advice for students of digital forensics?
Yes, never stop learning about this field. Technology keeps on advancing, and so should your knowledge. As we’ve seen with recent hacking stories at major organizations and government entities, hackers never rest nor does technology. There is so much data available online that can help you grow in this field as well, so never stop looking.
Finally, when you're not researching, what do you enjoy doing in your spare time?
In my spare time, I love spending time with my family and two dogs. I also love spending time going out to concerts, the movies and of course attending or watching my favorite team’s football games.
Cinthya Grajeda Mendez is a Graduate Student Researcher at the University Of New Haven. You can read the full research paper here.