Presenters: James Billingsley, Principal Solutions Consultant, Cybersecurity & Investigations, Nuix; Robert J. O’Leary, Senior Solutions Engineer, Investigations SME, Nuix
Join the forum discussion here.
View the webinar on YouTube here.
Read a full transcript of the webinar here.Robert O’Leary: My name, as I said, is Robert O’Leary. I’m a Senior Solutions Engineer, here with Nuix’s North American commercial sales team. I have 20 years of experience with the New Jersey State Police Field Operations and Investigations sections, and I started the High Tech Crimes & Investigation Support Unit in 1996. I also served as the Director of the National Institute of Justice Electronic Crime Technology Center of Excellence. I have 20 years of experience in conducting and supervising criminal and civil digital forensic investigations, incident response, litigation support, tool testing and evaluation, and training. I am a certified computer forensic examiner, and I’ve also been qualified as an expert witness in both US state and federal courts.
James, our presenter for today, has over a decade of experience in computer forensics. Before joining Nuix, James worked as a Senior Breach Investigation Consultant, leading PCI investigations for clients such as Visa and MasterCard. As a senior eDiscovery consultant, he has supported legal reviews as part of enterprise-scale global investigations. James has completed over a hundred cases supporting police forces and government agencies in the United Kingdom, and he has also served as an expert witness in the UK courts. He is co-author, internet browser forensics tools that have been used by SANS and law enforcement agencies and corporations in over 65 countries. James works with the Nuix team of advisors and training instructors supporting the United Nations Cyber Security Awareness program.
James Billingsley: Thanks very much for the introduction, Rob, and good afternoon, everybody. I hope you’re all having a good day so far, and thanks for spending your available time with us in order to talk about data visualizations today – data visualization specifically in the world of digital forensics. We’re going to start today by taking a step back and looking at what techniques people have used to interpret data throughout history, and how these techniques benefit us as the interpreters, because in order to understand how we’re going to apply data visualizations in a meaningful way to forensics, it’s important to consider the purpose of the technique we’re using, the right way and wrong way that we can go about doing that, and why, as investigators, we need to be aware of the benefits these techniques can bring to our work. One of the first people to add some format or structure to the data they were recording was this man, Claudius Ptolemy, who, nearly 2000 years ago, in Alexandria, Egypt, was known as something of a polymath, but it’s his approach to astronomy and maths that interests us particularly.
Claudius created something he called the handy tables. These provided a look-at reference table for calculating trigonomic functions and the positions of the planets. This is one of the earliest examples of arranging data into columns and rows, and the aim here was to allow for a more practical arrangement of the data and allow it to be more easily digested and utilized. And this proved very popular and was widely adopted as the technique at the time.
It wasn’t then until the 17th century that the beginnings of visual thinking arose, starting with the French philosopher René Descartes. He took the next step and developed a two-dimensional coordinate system for displaying numerical values to assist with performing things like mathematical operations. So we’re talking your classic horizontal axis for one variable and vertical axis for another variable, and you plotted the results of these equations on to graphs, and observed the different graph line shapes that emerged. Although he may have not immediately known this, this idea was later to be recognized as a highly efficient means to present information to others. The data can be more easily interpreted in this form, and it also tells a story about the data.
Fast forward to the next century – people really started to explore the potential of graphics for the communication of quantitative data. Leading the way was a Scotsman called William Playfair. He was the creator of many of today’s graphs, such as the pie chart and the bar chart shown here. This bar chart shows the value of a quarter of wheat in shillings and also in days of wages of a good mechanic at the time. And as well as some impressive calligraphy going on there, what this bar chart shows us is the change of this variable over time, with price rising sharply into the pink into the 19th century on the right, far beyond the wages of the mechanic at the time, shown in the bottom in green.
So this type of chart is great at quickly and easily allowing us to see the change of variables over time and compare them to each other. But just because data can be visualized doesn’t mean it’s immediately better. There are many stories that we can extract using that data, and we need to clearly identify what story we want to extract and view the data from a point of reference that reveals the story the data is telling in a simple, clear format.
So let’s have a look at a quick example here, of a good, effective way to tell a story using a visual, and also a bad, ineffective way. Here’s a list of people who have varying amounts of financial savings. And if I asked you to tell me who has the most money in the bank using this linear, textual view of the data, it might take a few seconds to extract that answer. And if this was a much longer list, then it would take even longer to extract that information. Well, how about now? Well, not only does it still take a few seconds to find the answer, but you’d be forgiven for initially interpreting that David, at the top, has the most savings. And if we spend a little more time looking, we can see that’s not the case.
So sure, this chart looks fancy here, but it’s failed in its primary purpose, which is to show us the simple story that we’re trying to see, answering the simple question that we have to ask.
And how about now? So the chart doesn’t have to be pretty as long as it serves its purpose. So we can now quickly see that Michael, on the right, has the most savings. So it tells us that story, tells us that answer, in a simple, quick fashion.
If the field of digital visualizations interests you at all, I can highly recommend checking out books written by Stephen Few. He’s written some great books on data visualization and how the visuals themselves should be designed with a focus on leveraging how the human brain interprets and processes visual information. And we want to keep this idea in mind, otherwise our data visualizations we use won’t achieve their primary purpose.
So it’s not to say that our visualizations can’t be aesthetically pleasing or beautiful to look at as well as functional, but we should keep in mind that without the efficient delivery of our message, our story, then the visual is entirely redundant, no matter how visually pleasing the design may be. And here at Nuix we’re acutely aware of the abundance of ineffective visualizations which have been baked into software products in the past. So we focus each of our visuals on the efficiency that it carries out to tell a story from the data. Each of those visuals needs to be able to answer your question more clearly and not slow you down in your task.
Now we’re going to talk about graphical representation of data for human consumption from two different standpoints. Firstly, the role of visualizations – the role they play in the relationship between man and machine, how we interact with the computer data – and then we’ll move on to visualizations for investigation purposes. It’s easy to highlight the importance of visualization of computer data for humans if we just take the graphical user interface – so the primary communicator between the human who’s interacting with the machine, and what they want to do with the data stored on the machine, and how they want to drive it, and how that’s displayed to the user. And the fact that this is how computer data is best interpreted by humans is exactly the driving force behind the progression of the graphical user interface on modern machines.
So we started interacting with the computer data using simple text representations, and then, come the late 1970s, the majority of advances in the field occurred, largely due to the computer power required to visualize data becoming more accessible. And here, on the right we see one of the first graphical user interfaces ever designed, by Xerox’s Palo Alto research center. So this interface, with its draggable, overlapping windows, and mouse-driven navigation pop-up menus, this was the birth of the interfaces that we’re all really familiar with today.
Now, this idea seems very simple to us now – we progressed from that linear text into action, to a graphic-based interface over 40 years ago. And this shift from linear text review to graphical visualization is a logical progression for all fields involving data analytics. And computer forensics really isn’t any exception to that.
So we know that visuals are the way that we prefer to work with the data, to manipulate it, to drive it, but what about visuals for investigation purposes?
So let’s talk about how we visualize data in order to extract the investigation answers or the stories that we need, and how people started to do this in the past.
Probably one of the most famous uses of data visualization in an investigative capacity in history was just over 160 years ago when London was experiencing a devastating outbreak of cholera in the area of Soho. And here on the map we can see, drawn by a Dr John Snow, where each of the black marks on the map in the middle highlight a death on the streets from cholera. And although the prevailing theory was that cholera was an airborne disease, Dr Snow hypothesized that the evidence suggested that cholera was actually a water-borne disease. So he proceeded to map the location of all the water wells in the area on a map. This immediately started to tell a story with the data and helped reveal the exact root cause of the outbreak – the well on Broad Street. He immediately ordered the decommissioning of the well, saving many lives, and targeting the outbreak of the disease at its source.
So this is a really fantastic example of how visualizing the data that they had in the right way solved a key investigation into the cholera outbreak and highlighted the cause of the problem that they didn’t know existed before. So visualizing the data both led to the critical investigation finding and allowed simple communication of that finding.
Florence Nightingale was another person to understand the importance of data visualizations around statistics. She created many visuals like this one, based on the conditions of medical care in the military field hospital she managed, and presented these reports to Parliament, who would have then likely been able to read or understand traditional statistical reports, and using this kind of view, would have been in a much better position to understand the major issues and tackle them more efficiently. So here, visualizations proved a key reporting method for actually communicating her findings.
More often than not, forensic investigations involve an analysis of data representing human interaction with a computer and other computer users. So it’s no surprise as computers play a key role in human interaction in modern society, visualizations can be a great way to expose these communication patterns and relationships. But long before Facebook ever existed, scientists were attempting to record and analyze relationship patterns. So if we go as far back as the 1930s, graphical representations were being used to simply map social networks. Here on the left we have a typical linear text record in tabular form, which attempts to map social interactions in a fourth grade classroom. And it was a psychiatrist named Jacob Moreno who was the first to leverage visualizations as shown here on the right.
So we have our boys represented by triangles and our girls represented by circles, and the lines showing the social interactions recorded between them. Now, if I ask you questions about the data set, you can really quite easily answer them now. So is there any segregation between groups of students? Yes, at this young age there seems to be quite a clear split between the males and the females. Who were the key actors or key players in the groups? We can see some actors have more lines connecting to them, so they’re playing stronger relationship roles within the group. Who acts as the go-betweens connecting the group, so which actors are playing the roles between the groups? And lastly, are there any outsiders? And we can see some separation in the girls’ group on the top right-hand corner there.
So again, if we ensure the visualization used is simple and appropriate for the questions we need to ask, then the answers are actually very easy to extract.
The graphic shown on this slide relates to data-driven journalism, but I thought the steps were actually largely relevant to what we as digital investigators do. The model refers to journalism as primarily a workflow. But the essential steps are actually very largely relevant to everything we do in an investigation workflow – so digging into the data, structuring and cleansing the data, filtering for specific interests, and then visualizing the data and making a story. And this model is focused on enabling reporters to tell untold stories or find new angles.
So this really struck me as not being unique at all to reporting, but essentially the core process behind revealing a relevant story within the data regardless of the field of expertise. So firstly, you’ll be mining and triaging the data at a detailed level, then you’ll be analyzing and visualizing the remaining relevant data, and it’s this visualizing – this is the part that’s been historically done quite poorly in digital forensics in the past. So it’s our final, important step before we need to translate the data into a report, a story, or some kind of real-world event. So if we fail to give this step of visualizing the information sufficient importance, it’s meaning we’re slowing down the overall investigation process, we’re losing some time, and we’re possibly missing key facts and overlooking important trends and anomalies within the data potentially. And these all impact on our investigation efficiency and the accuracy of our end story.
So let’s talk about how we approach investigation technique in the field of digital forensics. These are the source of data representations we’re all familiar with. You’ve got your linear text analysis playing a big part in digital forensic analysis at present. On the left we’ve got our binary text view of the data, and on the right we’ve got a typical text listing of files and their metadata and properties. So together with some sort of native file viewer, this is really the playground of forensic practitioners for more than a decade.
So what about finding the data that’s relevant to your investigation? This is largely based on keyword searches, which provide the cornerstone to every digital investigation. And depending on who is conducting the investigation, these keywords might be compiled by the forensic investigator themselves, the investigating officer in charge may submit keywords of interest, or in the realm of eDiscovery, the list of keywords will typically be chosen by the legal team involved. And the problem that arises here is that you’ve got non-technical investigators compiling keywords, which the examination will be based on. So this presents a number of challenges, which we’ll touch on shortly.
There are typically two ways to apply these keywords to search for notable data. The first is entire disc searching, and secondly, searching against a text index of the data being examined. So the first entire disk searches used to be the typical approach for a number of years as they provide the ability to run search terms in hex across the raw disk data, and they’re very accurate in their application. However, they are very time-consuming to run, and create problems when new keywords or amended keywords are presented. So each new presentation of keywords, which is not uncommon to receive throughout the duration of an investigation, means a very long search time overhead added on.
So then a more favored approach in recent years has been the second one – to conduct keyword searches across the text index of the digital data. So data is prepared up-front, including recovering deleted data, decoding data, conducting OCR of any scanned documentation, which is… when that’s completed, it allows for a near-instantaneous search result to be carried out, and that means if we do have any new keywords they need to be added at a later stage in the investigation. We can do that on an ad-hoc basis, with no major time loss.
So we have our keywords, we have our text index, and we can run them across in a time-efficient manner. So what are the other issues to be aware of?
Well, there are a number of other issues to consider. Keywords may be too specific. So if we pick a very small, exact list of keywords, we run the risk of missing large quantities of data that may be related to our investigation but simply don’t feature the keywords that we’ve been selecting. If we’re missing data, it’s possible we’re going to miss key evidence to the investigation. So the success of the investigation is heavily resting on the quality of those keywords that we’re choosing. We must also consider that keywords can be too vague. So if we select keywords that cast the net too wide, then we risk bringing in too much data into scope, which will be almost entirely irrelevant to the investigation. So such large quantities of data can simply not be manually reviewed by an investigator within a practical timeframe, and it would take far too long and be hugely inefficient to carry out each time.
So standalone keyword searching with a linear text review can present a very slow workflow and can often lack any real context of the results. Effective keyword searching is a field in itself that requires a good amount of experience and expertise to get the appropriate results back. Knowledge of the proper syntax of the application is important, so the technical knowhow of the forensic investigator is required here. Detailed knowledge of the background of the case is often important to select accurate keywords. And many investigators also know various pitfalls and best practices they’ve picked up over experience over time.
It’s easier than you might think to choose bad keywords – so for example, many non-technical investigators may use the suspect name as one of the keywords, which sounds practical enough. However, what if the user’s operating system user account name is their name, which it typically is? This means we’re going to get huge volumes of irrelevant results for this keyword, as the name now forms part of the operating system directory structure and will be stored in a great number of locations on disk. The same applies for words which are typically always part of the operating system directory structure – so things like “windows” or “program”. This may sound obvious, but we’ve all probably encountered this sort of issue on many occasions in the past in our investigations.
Keyword length is also something that can be causing issues. If a short keyword of less than, say, four characters is selected, then the frequency that this text string is going to randomly occur across the data will typically be quite high. So searching for acronyms such as company names can also return huge volumes of irrelevant data.
Traditional forensics relies heavily on keyword searching and linear text review, which in themselves do present a number of issues to overcome. So what visualizations would investigators typically use to help improve the efficiency of the overall workflow? Well, from my experience, the only visual I ever really saw being used in forensic investigations for many, many years looked something like this. This represents an overview of a computer hard drive and allows you to visualize how the data is distributed and clustered across it. And in this example, the blue cells are areas of allocated data and the black cells are areas of unallocated data. So this means an investigator can directly access specific locations on the drive if they want to.
Although this visualization is quite handy, it really provides very limited scope for any real data analysis benefits, and forensic investigation workflow would really benefit by moving beyond this and utilizing more visual analytical tools. So let’s have a look at some examples of a few data visualizations that allow us to investigate a communication data set using Nuix to do that, and let’s also keep in mind that each visual has a functional purpose and is designed to interpret the data in a way that allows us to answer our questions more efficiently. Otherwise, the visuals themselves are just redundant.
So here is some communication data. This data is extracted from a handful of mobile phones and we may have some call records in here, some digital photographs, some email, a range of different data types here, and it’s all presented to us in a linear text format. So how would we interact with this data? Obviously, we’ve got the application of keyword searching and date ranges, together with some sort of sorting and filtering capability. This will form the basis of a traditional analysis workflow. As we all know, this is fairly time-consuming work, and if you think back to our financial savings spreadsheet from earlier, we’re actually fairly slow to extract information from the data in this form.
Now, without applying any keyword searches or sorting, let’s simply change our visual standpoint around this data set. So here is the same data, viewed as a communications network map. At a quick glance, the picture is capable at communicating information about these communications that the text list just wasn’t able to do. An investigator can process this information in a fraction of the time that it would take them to read the text list of communications and extract the same answers. We can still interact with this data through the application of keywords and date ranges, and even physically moving through the nodes on the visual. But before we do that, we can immediately start to comment on the primary communication accounts or telephone numbers in use, and the lines of communication each account was using.
If we change the visual around our data again, here now we’re looking at the Context Timeline interface. The same methods for interaction apply, but instantly now again, I can look at the data and say that it’s coming from three mobile devices. I can now comment on how much data is reported in each device, suggesting how heavily used it was. I can see gaps in device usage across the timeline, and I can comment on which device was used most recently. So with very little interaction – here we’re just using a static image and a slide deck – we’re actually learning about our data. So from here we could drill into this timeline and start to interact with it, and learn even more about the data.
Let’s shift for a final time around this dataset. We see our data plotted out on a geo-location map. Now, some of this data will be Exif data from photographs, maybe resolved IP addresses, and other data involving GPS coordinate information. Again, I can immediately comment on locations the data … the device has been used or has interacted with in some way, maybe communications of some form. Imagine this case was involving an illegal diamond trade in Sierra Leone, Africa. Within seconds, I could drill into that location and confirm that there is data I’d like to investigate further in this case. So is it photographs taken in the area, is it communications associated with the area? You’ll see on the map as well, interestingly, we have a hotspot of data focused over the northeastern coast of the US. So does this information correlate with our case investigation intelligence? Now we notice how our investigation is already becoming more fluid before we’ve even considered the compilation of keywords.
In Nuix we understand the need for visualizations in order to review results sets in as many ways as possible. So it allows you, as the investigator, to change your point of reference and see the data from a visual standpoint appropriate to the question that you need to answer quickly. The visuals we just touched on are all available in the Nuix interface and developed with a keen focus on investigation efficiency and functionality above all else.
So as well as the traditional text results view, graphics view, your binary hex view, we’ve got these additional visualizations, some of which we’ve touched on already. So the network mapping view, the geo-location view. And some of these visuals, such as the communication map on the left, has actually been baked into our tool for over ten years now. So we’ve been pushing visualizations since day one. And these have been proven to be highly effective in delivering functionality to the user, and that’s why they’ve remained core components of the platform ever since they were put in.
Here, in the event map, another way of mapping communication threads over time. A number of different focus timeline views, to help us identify patterns and trends emerging over time. And last, but not least, the context interface, which we added to the tool last year, and this provides a powerful way to analyze and move through your data visually in one single interface. So from here we can get a high-level overview of the data, we can map data relationships, we can expose critical data sources, or follow events on the timeline.
At this point, I’ll finish off today just by dropping into Nuix Workbench application, to quickly demonstrate how you move between a number of those different visualizations available that we’ve touched on today. If you’re not familiar with the interface already, this is the Nuix Investigation Platform, and we’ve got data from a few different mobile phones loaded into a case here. And I’m going to start looking at this data using a communication network map.
So now let’s say that I’m interested in whichever person they’re communicating with most heavily. I’m going to increase this top menu up to a weighting of around 100, and you’ll start to see the interface filter down there. And we’re left with just one interaction here, and this is the strongest line of communication in the case. I’m going to go ahead and look into this conversation, these exchanges. And we can have a quick look at these on one of our focus timelines.
As we can see, there was a lot of communication happening over on the left, and then nothing for a period of about a year, and then we’ve got this burst of activity, which happens in July, over on the right.
Now I’m going to go back to my Results view, and I’m going to throw this over to our Context interface. And I’m just going to arrange the layout a bit more appropriately, of this data. You may just be a few seconds behind me here, but what you should be able to see in a moment is that immediately, I can tell you, now that we’re looking at this line in communication, it’s all coming from mobile device #3. And I’ve thrown it on to another timeline, and we can see this burst of activity happening in July over on the far right.
Now I’m going to focus in on this and find out what type of things they’re talking about. I can expose entities. And we can quickly see that they’re discussing a person called Stacey, for a start. Or maybe I could grab that whole conversation, throw it up back at the two-hour Workbench view, and look at the text [shingles] to start to understand fragments of that conversation. And you might notice a few references talking about the exchange of drugs here.
At this point, we haven’t use a single keyword yet. We’ve started to learn about the data available simply by shifting our visual standpoint in a number of different ways. I’m just going to pop back to our slide deck now, and quickly summarize the session today.
So why are visualizations increasingly an important part of modern-day digital forensics? Quite simply, because they’re the single easiest way for the human brain to interpret information. By leveraging data visualizations more in our investigation workflow, we’ll be able to discover more and new information that we might otherwise have missed, and get to the key evidence in a much more efficient manner suitable for growing data volumes. I think there’s always going to be a place for some form of keyword searching in digital investigation, but it’s how much we rely on that technique and how we visualize the results of the searches that will dictate how effectively we as examiners can handle data as part of modern-day investigations.
Visualizations will also, I believe, form a core part of reporting and communicating investigation findings to others in the near future as well. So this is an area we’re also very keen to work on within Nuix.
I’ll just say a quick thank you again, for joining us today.
End of transcript