Presenters: Paul Slater, Director of Forensic Solutions EMEA, Nuix and Ady Cassidy, Director of Investigation Solution Consultancy – Global, Nuix
Paul Slater: Hello, everyone. I’d first of all like to extend a big thank-you [indecipherable] webinar on how to accelerate digital investigations using advanced eDiscovery techniques. Before we begin, I’d like to thank Jamie at Forensic Focus for making this possible, and I wanted to let you know that there is a discussion board on the Forensic Focus website [that has been set up] specially for this webinar.
We’ve got quite a lot of stuff to cover over the next 45 minutes or so, but we do welcome your questions, and you should have the ability to send us a question via the panel on the screen. We’ve divided the webinar into a number of distinct sections, and we’ll of course try to remember to ask if there are any questions when we move on to the next topic.
We’ve also put aside 15 minutes or so at the end of the webinar for any questions, so again, if you didn’t get a chance last one or we forget to ask, then you can also do so at the end. Finally, just to let you know that this webinar is being recorded, and I believe it will be made available on demand in the next few days. So it will be available both on the Nuix website and also on the Forensic Focus as well.
So again, thanks for taking the time to join. I do hope the information provided is both thought-provoking and useful.
Before we begin, and [start and go on to the agenda], I thought it might be useful to spend a few minutes just to introduce ourselves to you, and provide a little information on our backgrounds. So in true corporate style, we’ve included our pictures on the slide so you can put a face to the voices.
My name is Paul Slater, and I am the Director of Forensic Solutions here at Nuix. I’ve been with Nuix for about six months, eight months now. So whilst I’m quite new to the investigations team that we’re building here at Nuix, I do bring over 20 years of experience within investigation. And I’ve probably got around 15 years worth of hands-on experience, both within digital forensics and eDiscovery, having worked both within law enforcement and also in the corporate world.[indecipherable] my background, I’ve been a police officer, and spent 7 years working within two of the big four consultancy firms, where I would often help clients to look at ways to improve their digital forensics [workflows], helping them to integrate eDiscovery [workflows] and best practices within traditional digital forensic environments. In essence, I helped them to work smart and not harder. That’s part of the reason why I’m pleased to have the opportunity to speak to you all today. Finally, part of my role is to look after the product development [indecipherable] investigator, and this means I get to spend quite a lot of my time talking to our development teams, helping to ensure that as we build out the product, that it has all the functionalities that digital investigators need it to have.
And I think it’s worth saying that as part of this work, I work very closely with my counterparts who have similar roles within our electronic discovery and information governance teams, and this crossover [ultimately] benefits wider development of the product, and we’ll also show you some of this today in the webinar.
So that’s enough me. And with that, I’m going to hand over to my co-presenter, Ady Cassidy. Ady.
Ady Cassidy: Yeah, thanks, Paul. Hello, everybody. My name is Ady Cassidy, and I would also like to extend my welcome to you to our webinar. Thank you very much for giving us an hour of your time today. As you can see from my profile here, I also come from a law enforcement background, having spent 19 years as a serving police officer, and many of those years as a digital investigator, in what’s known in the UK as the High Tech Crime Unit.
In the later years of my career, and prior to joining Nuix in 2011, I worked for a forensic and eDiscovery services company in [the city of London], where I was instrumental in building and managing an end-to-end investigation and discovery solution. And notably, it was during this time that I became most interested in the blend of tools available from investigations and eDiscovery, and how they could become of mutual benefit to each other and of course a benefit to myself as an investigator. On a number of occasions, I’ve headed investigations and used both of my skills from both disciplines to enhance the investigative practices. So I very much look forward to having this opportunity to sharing with you today some of my observations and experiences.
Yeah, Paul, I think it’s probably a good time to kick off.
Paul: Okay, thanks, Ady. So moving on, to today’s agenda. Over the next 45 minutes or so, we’d like to talk to you about how we can improve the efficiency of our digital investigations. As you already heard, both Ady and I have worked both in corporate and law enforcement, so hopefully we can provide a balanced perspective on some of the topics we’re going to talk about today. We’ll also try and cover [up to in] the session where key evidence is often found during an investigation, and we plan on sharing some of our real-world experiences in order to help illustrate some of these points, especially when it comes to some of eDiscovery-based [workflows] [indecipherable], and how that [indecipherable] can be used within a traditional forensic investigation.
Finally, the main part of this webinar will be to show in detail three specific workflows, how these are implemented within Nuix, and why we think that if you [worked in corporate] and within your current investigations, they would allow you to maximize the [volume] of the digital data and would allow you to spend more time doing the important aspects of your job.
Our starting point for today is that we need to improve the efficiency of our digital investigations. And to be clear, this is not something that we [are new] in saying. It’s a message that we hear time and time again from our clients. Part of the reason is that over the last few years, the mass rise of electronically stored information has meant that investigators now face a constant battle to find the truth in ever larger, more varied, and increasingly complex stores of digital evidence.
Within both corporate and law enforcement environments, the need to find the needle in the digital haystack has never been more challenging. And when I speak with almost any law enforcement agency about their software, their workflows and processes, I typically get the same response back. They tell me that they have long backlogs of cases still awaiting processing and investigation. They tell me that their teams are overstretched and often understaffed, and they tell me that they’ve been asked to do much more in less time, often with smaller budgets than in previous years.
And the same is actually true when I speak to corporate people as well, where the need to report to the board or to ensure that business as usual is restored often drives the investigative process. And here, where many investigations are driven by cost, the need to quickly find relevant information is critical. But what’s more interesting though is that when I get the opportunity to talk [indecipherable] process in digital evidence, it’s clear that in many cases, they are still using tools, workflows, and process that were designed before this explosion of digital data.
This is not to say though that they’re doing things wrong. In almost all the cases that I’ve looked at, the forensic processes are robust. But times have changed, technology has moved on, and all the disciplines, including that of ED, have shown that there are now much more efficient workflows, techniques, and toolsets available to digital forensic investigators. And this is what we’re going to also show you over the next few sessions.
First, before we move on, I want to just spend a few minutes perhaps dispelling a few myths. Historically, digital forensics and eDiscovery have been considered as distinct and separate professions dealing with the handling of digital evidence, and I’m sure that some of you have experience of digital evidence [through] electronic discovery, and perhaps quite a few of you on this call have your own views as to [its place] within the digital forensics community. However, I believe that by adopting some of the tools, processes, and workflows used in eDiscovery, which incidentally, typically works with even larger volumes of digital data than are found in traditional forensic investigation, we can look to improve the efficiency of these investigations, and we can allow investigators to zero in on critical data and focus their time, to allow them to undertake digital forensics on all of the important data.[To prepare for] this webinar, I looked up some of the definitions out on the internet, and there’s one on the screen here now. And it was clear that digital forensics is seen by many as a painstaking analysis of bits and bytes, a digital [indecipherable], if you old enough to remember who that person was. Someone whose job it is to piece together fragments of deleted or recovered data to prove or disprove a crime. Whereas eDiscovery often gets painted in a different, less technical light, and is seen by many as a lightweight version of its digital forensics cousin, with practitioners just often finding, extracting, and processing emails and documents for lawyers to review.
So let’s take a deeper look at the definition. Digital forensics investigates everything, including deleted files or remnants from former files that have been [partially overwritten]. And this definition on the screen goes on to suggest that a forensic examiner must pay particular attention to certain operating system files, and log files, and temporary files, and the file remnants [indecipherable] allocated clusters.
And if you compare this with the definition for eDiscovery, it says – whereas eDiscovery filters out program, temporary and system files, and processes only active user accessible files, and this usually involves Microsoft Office or other document types and such. And these types of files are then processed in an eDiscovery engine, where they are indexed and catalogued, and then usually loaded into a Litigation Support Platform for lawyers to review.
So these definitions go back a few years, so they’re a few years old. And as I’ve said, technology and processes have a habit of moving on quickly in this industry. So my question is: are these definitions true today? I’m sure many of you on the call can see parallels in both definitions. It’s true that in digital forensics we will often look within system files, wanting to establish the owner or user of a particular computer system. And it’s also true that in eDiscovery, we often zoom in on the emails and user files, than within the live file system.
But I also think it’s true the other way around. When I was performing digital investigations for both police and [indecipherable] corporate, whilst I was cognizant of the need to examine system files, log files, and [indecipherable] allocated, I would always start my investigation looking for the low-hanging fruit, often finding crucial evidence within the live file system, and in many cases, particularly those involving [indecipherable] images, as they’re known in the UK, I would find that the names and locations of files and videos within the live file system would often add significant weight to the prosecution case. I’d often find copies of incriminating documents stored in the My Documents folder. And very often, the evidence I would exhibit and put forward to the court was produced without the need for me to perform deep, forensic, technical analysis.
I would of course use my digital forensics skills to attempt to make the case as water-tight as possible, [in effect trying to put things on keyboards], and would of course use my experience and knowledge to consider the possible challenges to my work, but more often than not, most of my time was spent doing eDiscovery work, examining documents, emails, and images. And I found that as time went by, the size and number of [seizures] increases, and also the time I had available to me decreased, I would have to focus much more of my time and effort to ensure that I could find [potentially relevant] material for analysis. And I’m sure many of you have actually had similar experiences.
So I think today there are many similarities between digital forensics and eDiscovery, and I think that now both of these disciplines show similar processes [in terms of] workflows. And as the volume, variety, and complexity of storage devices increases, it’s even more important that investigators can quickly identify potentially relevant material for analysis. So one of the ways we can accelerate our investigations is to consider a change in workflow. At Nuix, we advocate a methodology that we call content-based forensic triage.
This workflow allows investigators to quickly identify all potentially relevant information across multiple evidence sources at once, using a single tool, and a single [pane of glass] view into the data. Because [Nuix was built with] big data, it provides investigators with a complete picture across all data sets at the same time, and this workflow was also designed to allow an iterative process to the investigation. It allows investigators to build the investigation in layers, rather than simply running every forensic process against every single device. Because when you consider the basic laws of return, the more forensic processes that we perform on a device, the more data that we have to examine as a result of that. This approach allows investigators to search for evidence across all sources at the same time, and to use some of the most powerful discovery [tool out there] [indecipherable].
Let’s consider a typical digital forensic investigation. Today, we have an ever-growing number of complexity of digital sources of data. Things like hard drives, PCs, USB external devices, memory flash cards, email files from systems such as Outlook and Exchange and Lotus Notes, the rise in cloud systems which provide both email and storage services. In corporate environments we have things like Microsoft SharePoint; we’re all aware of systems to back up data and back up [text], and who can forget those interesting and complicated mobile devices and [forms] and tablets that we see much more of these days?
I think it becomes clear that something really needs to change as it becomes more difficult to examine all of these different devices, as sometimes, arbitrary decisions are made as to which sources are analyzed first, and more worryingly, which sources are not examined at all, which of course makes it even more difficult to find that ever-elusive needle in the digital haystack.
So finally, what lessons can we learn from eDiscovery? Well, I’m going to hand it over to Ady in a minute, who is going to discuss the first of three workflows and has a firm grounding in it within eDiscovery. And I believe that if we apply these… using tools such as [Nuix, we can aid] a digital investigation.
Before we move on, Ady, is there anything you want to add to what I’ve just said?
Ady: No, thanks, Paul. There was just one thing – I think it’s great that you brought the definitions to this discussion, because although they are somewhat out of date, they’re still very strong in the relevant industries. I think one of the observations I have of that is the fact that in eDiscovery and eDiscovery workflow, there’s an assumption of where the actual data is going to be, and usually before it’s even been collected. And that assumption, that agreement, kind of defines that collection, whereas in investigations, it’s a much wider collection, and there’s no assumption as to how that collection is to take place. However, when the actual workflow, when the analytics is taking place, then we tend to find – certainly in my experience, and I’m sure people will respond to this – in a very high percentage of the jobs we get involved in investigations, we do tend to find the evidence where we expect to find it. So if we are doing document… looking for emails, we tend to find them in email containers. If we’re looking at an internet-based investigation, we tend to find our answers in the internet history, or the [index.bat] or the internet cache files. So very much, it’s a case of we do actually undertake a discovery process, even in investigations, and because we do indeed go after that low-hanging fruit. I just wanted to add that before moved on.
Paul: Well, thanks, Ady. Thanks for that. Okay, so moving on then. Over to Ady, and [hopefully you’ve got control].
Ady: Okay, thanks, Paul. So we’re going to look at three very distinct areas that are very popular and used in discovery tools these days. Some of you may have heard of some of these and some many not. But they are essentially… the first one we’re going to discuss is [near-duplication]. The second is one of [indecipherable] identities, and the third, we’re going to wrap it up with the visual analytics, and specifically looking at the Nuix visual analytics application.
So we’re going to tackle this first one head-on – near-duplication. What exactly is a near-duplicate? Firstly, we need to define a duplicate. I’m sure many people on this call fully understand the process of duplication and how we go about that. But just to clarify – it’s one of the most important workflows in any eDiscovery, and certainly also in digital forensics. It’s the ability to identify and filter – and filter either in or out – duplicate items and duplicate documents. So in discovery, typically, both lawyers and investigators only really want to see a single copy of a document or an email during an investigation or a review. Also, from a perspective of provenance, they may need to know of the existence of other copies. It’s important to know where the duplicates are, even though we’re not actually looking at them. But for them to undertake a review of the data set, they really just want to see one item. We can draw parallels to this in our investigations. On many cases that I’ve worked in, I’ve been required to view and assess inappropriate and illegal material, which, at the nicest way of saying it, it’s a most unpleasant task. So to decrease the impact of this, we use MD5 to duplicate, and so we’re only really reviewing a single instance of each one of those images or media files.
So as I’m sure this audience all knows, the most common method of identifying duplicates is to perform a cryptographic hash on the [contents] of each file. Commonly used are MD5, and also, the [indecipherable] algorithm is used in this process too.
So by comparing a file’s hash value to another file’s hash value, this allows the system and ultimately the investigator to identify them as duplicates. This works very, very well for exact binary duplicates, and a lot of our workflow is around this process. But obviously it does not allow for instances where documents or emails or portions of text are visually identical or similar, i.e. they contain the same text though they’re not cryptographically identical. So let’s look at this just a little bit deeper here.
Okay, so on the face of it, the two documents that we see here appear identical. They look the same, they read the same, the formatting is the same. In other words, in essence, if I was to print these two documents and put them side by side, they would in fact be identical. But as we can see, one’s been created in the Microsoft Word application, and one’s been created in Adobe. So let’s have a look at the binary level of these items. And we can see now there’s a very distinct difference between the two items that we have.
So in an investigation, we would have to review both of these, because our MD5 algorithm, our rule of algorithm that we’re using [de-duplicate] documents simply wouldn’t match these two as documents, even though contextually, they are the same. However, it’s worth noting that in some investigations, we do want to find duplicates. That could be crucial. For example, for us to find the existence of textually duplicate documents in a different format. And I think I’m just going to invite Paul – because I was speaking to Paul the other day, and he certainly gave a very good example of this.
Paul, can you add to… a [used case] where this has been useful?
Paul: Yeah, sure. One of the things [indecipherable] a few years ago was an investigation related to the alleged theft of company IP, so intellectual property. The company suspected that one of their critical documents had been copied and was being provided to a third-party competitor. So there was obviously an investigation as part of all of this. And we went down the traditional route of looking for the documents and looking for the documents across the various components of data that the client had. And we didn’t find it. But what we actually noticed was that the suspect had previously installed a PDF-creator onto his machine. So we used this kind of… this similar document technology to look for things that looked similar, and we were able to see, using the context of the text, that there was actually a document that had been PDF’d which was the critical document, and we could show that it had been copied across on to an external device. The point being there that we wouldn’t have found this using normal forensics, because it wasn’t the same document, in effect, albeit it had the same contents.
Ady: Yeah, thanks, Paul. And I think that the need to find duplicates in this manner, at a contextual level rather than a binary level becomes incredibly more important. We’ve just got to look at how we live our lives these days – and at the moment I’m travelling, so I’m sitting in front of a computer, I’ve got an iPhone, I’ve got an iPad, I’ve got backed up to the cloud service. [Of course I’ve got] a computer sitting at home. If I send a single email right now, that email is then present in at least five or six devices, and probably more. And if I was to come under investigation myself, then I would, in the forensic methodology, I would collect all of the devices. And I would have many, many duplicate, simply because of the way we live and the mobile world that we live in.
By that effect, really, the need to find contextual duplicate becomes incredibly more important to us as investigators.
I think I’ve gone one slide too far ahead there.
My apologies for that. I was a bit haphazard on my mouse-pad.
So this is where near-duplicate technology, found in such tools as Nuix, can really start to help us. Nuix technology gets around this problem by comparing the textual content of documents, and it does this by extracting and using the all-familiar MD5 hash, but multiplying over phrases of around five words that are found within the documents. So this technology is actually called [w-shingling] or shingles, for short. You may have heard of it in such a way.
Now, I don’t profess to understand the mathematics, because I don’t have a math degree. But in essence the algorithm first identifies and extracts all of the text in each file or each item that we come to. It then removes superfluous characters, leaving letters and numbers [indecipherable] are then normalized and converted into a single case… or into lower case. And then it splits this text into tokens, or overlapping words, to build a shingle list.
And on the next slide, I’ll be able to give you a graphical example of this. [Shingle] does not attempt to understand the meaning of the text, so there’s no sort of natural language processing. It’s all purely agnostic in relation to the language that’s being spoken. So it does work across language platforms as well. So in essence, it works to be able to match the context of documents, regardless of any format – whether it’s a PDF file, an a email, a dot, or a docx file. We’re not really interested in [indecipherable] any more, we’re more interested and looking at the text level.
So once we’ve built this [shingle list with a set of documents], we can query the index to find out documents that contain the same or even similar textual documents. To do this, we use an algorithm. It’s a very well published… and it’s easy to research the algorithm. It’s called the Jaccard similarity algorithm, which is a statistical method for comparing the similarity and diversity of sample sets. It’s used very much so in a lot of the search engines, and [indecipherable] been produced. It’s also used in academia to identify errors of plagiarism.
Today, the method is, as I said, in a lot of disciplines, including biology, genealogy, mathematics, computer science, and now, to our benefit, we can add to that list digital forensics. So by way of example, just to make sure that we’re all understanding this, let’s use the phrase “Sometimes we can compare apples to oranges.”
This slide tries to go somewhere to attempt to show this from a graphical perspective. I’ve already said that Nuix uses five tokens. But for the purpose of this graph, we’re going to equate that to two. So we’ve taken that phrase, and now we’ve split it across… tokenized it in portions of two. And now we can see that we’ve got six segments of that sentence. So what we will do – what the algorithm does – it produces an MD5 value for each of those six segments. So now, instead of having an MD5 for each item or each document, we have an MD5 list, which we call a shingle list, of values, for each contextual value for each of the documents.
So we can very, very quickly see – what we’re looking at on this screen is the Nuix application – we have… part of the tab that we offer is the Differential tab. So we can use this technology and put these two documents side by side. Now we can see that these are two identical documents, and one is, as we’ve already mentioned, [inaudible] PDF file, but in the Nuix environment, these are identical documents, given this algorithm and this technology that we’re using.
One of the most powerful things about shingles is that we can use this technology not to only find items that are the same as each other, but items that are similar to each other. So what we’re essentially doing is broadening the algorithm and saying that we don’t… we accept less matches within our MD5 shingle list, to be able to [broaden, to find] items that are similar to each other, based upon their textual value.
This can really offer us some value when we’re in our investigations. For example, if we can identify similar documents found across different devices, this could help us to prove that an original Word document was printed in a PDF, just as Paul’s scenario expressed itself. Or even emailed out of the company.
It also shows how a document has evolved over time. I’ve been involved in my time in a number of cases involving contracts – employment contracts and agreement contracts. Well, contracts, by the nature of how they’re produced, produce many different iterations of a single document, crossing lots of different formats. [So it’ll] go through many, many versions, gets amended by different people, maybe even converted between lots of formats. And using shingles, we can quickly identify all versions of the document, and perhaps use visual analytics, which I’ll talk about a little bit later in the session, to show this on a timeline and how a document has evolved. But our first step is to actually cluster them together and bring these documents together with our application, so that when we do have to review one of these items, we can immediately see that it has a related item [count], based upon this algorithm.
Perhaps the most important one, for me, is to help to link file fragments. [This is certainly exciting], because obviously, Paul and I come from that forensic background. So we’re used to dealing with artifacts that we pull out of unallocated space, artifacts that we’ve pulled out of system files, that prove our evidential point. And quite often, they’re very difficult to deal with. So this technology also allows us to start linking items that we’ve found in transient and [volatile] areas of the hard drive, and start linking them to areas that are allocated in allocated space. And I have a slide a little bit later on, before we wrap this part of the session up, just to sort of demonstrate how Nuix manages that.
So these are two similar documents. We can see that Nuix has now presented to us the documents again side by side, but [in this time], it’s highlighted portions of the text for us. These aren’t duplicates – these are near-duplicates. The highlighted portions basically are the Nuix technology showing us where the differences are. The portions [indecipherable] are the same, there is no highlighting at all, that is clear text. And the portions of the documents that are similar are highlighted in blue for us, on this example. And the portions of documents that differ completely are highlighted in green.
So very, very quickly, we can start taking that contract investigation to a whole new level. Because we can put these documents side by side and start seeing the evolution of any particular document.
And as I said, we can also use shingles to cluster groups of items together. And this is just as important when we’re clustering them in or filtering them in as evidence, or filtering them out, [to remove] some types of emails and documents. For example – a great example that is very, very common in eDiscovery and large email investigations is to find [indecipherable] emails, such as market forecasts and weekly forecasts, that really [don’t matter] to the investigation, to pull those items together. And because they’re forwarded on to people and replied to, the iterations change as they move through a corporation.
A typical one is [arranging the] Christmas party – even with a modest company, you have 100-150 employees, arranging the Christmas party could generate anything up to 5-600 emails on that subject. So to bring them all back together again, I actually filter them out, it helps to de-clutter our investigation.
One of the by-products of having these lists of shingles is the way that we can actually access them within our application to make our keywords much more intelligent, and the way we come about and approach our keywords. I’d like to just call on a [huge case] that I was involved in a few years ago, and for those of you who met me or heard me speak, [inaudible] story. But I was asked to carry out an investigation on a computer. And owner of the computer, the suspect’s name was Mouse. So I was asked to use the word “Mouse” as a keyword and of course draw evidence from the results of that keyword.
Now, it was indeed a Microsoft-based, Windows-based computer, and so the keyword of “Mouse” really did give me a lot of false positives. What I was actually looking for was almost impossible to find in the results that I was working with or that I’d been asked to produce. And working with false positives is something that we tend to get very used to as investigators, but this technology can help us to actually move away from that, to be able to [indecipherable] the database about the context of our keywords before we actually take action on it.
This next slide shows a very small data set; it’s much smaller than the one that I was involved in. But it does example this quite well. So we can now see that I can ask Nuix to show me all of the shingles – in other words, all of the words in the context that surrounds them – and I can apply my keyword. In this case, I’ve applied the keyword “mouse” because it’s topical to the example. And now I can see very clearly where “mouse” appears in my data set. But I can also see where I can exclude it very, very quickly. Because I can move on from this screen, and I can run through this very short list here and see that only two of the items within my data actually refer to something that isn’t a computer mouse or that isn’t something that’s generated or refers to a computer.
So this is how we can use Nuix, and use this shingling technology, to be able to test our keywords and test for relevance, and try [indecipherable] to exclude those false positives.
Finally, before we move on from near-duplicates and this part of the discussion – I did mention how we can start taking items in clusters from [unallocated] space and use this technology again to be able to link them to [errors] that we do know about in allocated space. And the significance of this is that we can now… Nuix processes [indecipherable] [unallocated items] when we process the data. So we can quickly zoom into a potential [indecipherable]. Rather than traditional methods of having to manually review all of these [transient]… and now, quite frankly, very large areas of data repositories, we can now start making some [movements into the…] how we can review the data.
So what we look at on the screen now – hopefully you can see it – is I’ve highlighted an item. This item has been recovered from unallocated space. And I’ve highlighted in yellow the actual, logical location where this item was carved from. But I’ve also circled, in the review screen there, that Nuix is telling me, although I’m looking at this item, there is an item within my database that has a contextually similar value. And the screen that’s popped out is actually the matching item within that database. And we can see from the filepath that this file I’ve recovered from unallocated space is actually [a pure] identical to the one that’s been found and can be accounted for in allocated space.
And these are great examples about how we can start using this technology to really streamline the way that we have to review the data, and how we review our [indecipherable], and potentially trying to reduce the amount of false positives. Paul’s already talked about finding the needle in the haystack. Our job, with eDiscovery and using this technology [makes] that haystack a lot, lot smaller, and so therefore quicker to find the needle, and hopefully makes our job a lot quicker with these investigations.
I’m going to hand over to Paul, but before I do so, I’m just going to check the questions [screen] to see if we got any questions on the…
Paul: We do have one, Ady. One which might be interesting to you. One question is: Isn’t the use of black box algorithms quite controversial in eDiscovery? Isn’t this what we were talking about [indecipherable]
Ady: Yeah, that’s a great question. Yeah, absolutely. I would concur that it is [at the moment] a bit of controversy. However, near-duplicates and the algorithm, the shingling algorithm that you use, I wouldn’t define as being black box technology. As I said, it’s a very open-source technology that’s openly used, and it’s agreed as a scientific method of being able to use this Jaccard algorithm to find similar items. And it’s actually very, very easy to understand. If you actually google w-shingling, you’ll be able to see that the actual algorithm itself is only about ten lines long, and it’s actually quite easy to get your head around. So there’s no black box kind of phenomenon around that part of the technology.[crosstalk]
Ady: … thanks very much for asking it. I’m going to hand over to Paul, who’s going to take us through for the next article.
Paul: Okay, thanks, Ady. Let’s hope that I’ve got control back.
Okay, so thanks, everyone. The next thing I want to talk to you about is named entities. The second technique which [we’ll spend the next five to seven minutes about]… I’m sure many of you on the call are familiar with regular expressions, which, within digital forensics, is a method for looking for certain types of information within an evidence base, by searching using a pattern structure. So for example, if I ask you to describe a credit card number, I’m sure that many of you on the call will be able to say, “Well, it’s four lots of numbers, sometimes separated by spaces, and sometimes separated by a dash.” A regular expression converts this English interpretation into something that the computer can understand. So, for example, we might represent that by a number, a number, a number, a number, a dash or a space, followed by a number, a number, a number, a number, dash or a space, etc etc.
So at its simplest level, named entities are basically Nuix’s interpretation of these regular expressions. And interestingly, because we have already indexed the file contents and its metadata, as part of our processing workflow, we can automatically and intelligently identify and filter these named entities at the beginning of the investigation. And these can include things such as company names, credit cards, emails or IP addresses, monetary values or information relating to people, such as passports and IDs. And I think this allows the investigator to quickly focus and zoom in on potentially relevant material based on the various classifications of entities that we can use.
So as I said, named entities, within Nuix, follow the standard, regular expression structure. And I’ve got a slide in a minute just to give you an example of what that might look like. But named entities can also provide a powerful source of intelligence, allowing the investigator to pre-load an investigation with specific data types to automatically bring these to the surface for analysis. And because of the Nuix, we are able to process lots of different sources of evidence at the same time within the single pane of glass approach, as we talked about previously. We can also rapidly cross-reference this intelligence across the whole data set in order to reveal relationships between people, places and devices, in a far easier way, and perhaps a far more robust than when done by a normal human being during an investigation, perhaps using some third-party tools or plug-ins.
And we can even build automated workflows around this. So for example, perhaps you want to always identify any email in a data set that contains references to credit card numbers. Perhaps we want to further filter this down to include all of those emails that were sent externally to our organization. Perhaps we want to look at one that was sent to a country known to be perhaps on the hot-list of fraud or corruption. With named entities, we can do this incredibly quickly, and we can even save these as a dynamic search to allow us to do this time and time again.
So, just summarizing that, within Nuix, we extract intelligence from all the items whilst the data is being processed, and at the same time it’s being indexed. This allows investigators to quickly see what is potentially important in the data set. Because named entities follow the standard [reg-ex] structure, it’s easy for investigators to build their own [indecipherable] into the tool. On the slide here, you can just see a quick example of what the syntax would look like for a named entity to identify monetary values.
The final slide [indecipherable], to finish off, just a brief conversation around named entities. This is what Nuix looks like when we open up and we look at entities within Nuix. It allows us to group them together by type, to help structure our workflow. So for example, we can keep together similar types of entities, such as those relating to locations, for example, or social media, or even entities relating to identifying passwords. And Nuix provides users with a number of different ways of viewing and interpreting these named entities. We can, for example, quickly filter on credit cards to show all the credit card numbers that we’ve found within the data set, which as I said previously, can be incredibly useful to quickly [gain] intelligence. For example, you may be investigating a particular type of crime, and you simply want to get a list of all the credit card or account numbers that somebody was in possession.
Or, I should have said earlier, that’s one of the things that we also do, to help with this [indecipherable] that named entities can [indecipherable] is that when we look at things like credit cards, we [value] the intelligence [indecipherable], so what we actually do is take all the entities that look like credit cards, and we pass them through an algorithm called the [indecipherable], which is used by many of the banks already to validate that a credit card is actually genuine. So what this actually means is that when we provide you the results back, we actually provide actual credit card numbers, and not just things that follow a similar pattern, such as a sequence of 1111, for example. And the second thing you can do within this is actually you can choose to see the list of files that are responsive to the named entities. So for example, you can drill down into the documents there, or the emails, to see which ones need further analysis. And again, from a workflow perspective, this is incredibly powerful, as it can allow you [for example to find] all the emails that contain references to, say, a personal ID or monetary values, [bundle these] together into a review package, and perhaps pass this on to the financial crime team to investigate further, or in a corporate environment maybe pass these on to compliance for their attention.
That [kind of covers that up, really]. Ady, is there any comments about that before we move on?
Ady: Yeah, there is a question that’s just come in, and I’ll just pull up the question screen here. The question is – and it’s a fair question: Does the use of named entities not slow down the indexing? As I said, it’s a good question. Nothing in this world is free. So if we decide that we are going to switch on the function of extracting named entities, we are indeed doing more work with the source data when we process it. So I would say it has an impact, but it doesn’t necessarily slow it down. But also, it brings on to quite a good discussion about how Nuix can refresh its data sets. So even if it’s not switched on in the initial part of processing, so that we can push the data through to the application as fast as possible, we can come back and refresh this function, by isolating certain areas or certain file types.
So as I said, it’s a great question, but [indecipherable] multiple answer to it, in the way that we deal with the named entities.
Ady: There is another question actually – how does the company regular expression work? Does it look for [limited PLC] etc? And I can see how the other expressions could work, but I’m not sure how the company one would be structured.
It’s a good question. If the person who’s asked that question has actually got the application Nuix, these can be found in the library, and the actual construction of the [regular expression] files, it’s called a [regxp] file. But yes, essentially, it does look for [LTD] and [PLC], and allows us to use it to go into this file and amend it. And the one that’s commonly amended is the monetary values. Because we only… out of the box, Nuix locates dollar signs, Euro signs, and of course the pound sign. But of course there are many more monetary values that need to be accounted for. So that one does tend to be changed quite a lot by our users. But is a completely open library. So if there’s any concern about what’s being used or which regular expression, and how that’s being used, then that is freely available within the program files within a very simple file path, to be able to go and have a look at that, and of course make your own changes.[crosstalk]
Paul: I was just going to say – and there are obviously lots of websites out there on the internet which do provide these libraries as well, so again, people can use those websites to actually build their own libraries of [named expressions] and drop them into Nuix to use…
Ady: I do think that the most empowering aspect of this is the fact that the library is open to the user, and a very, very quick [indecipherable] in my experience, a couple of years ago, I was speaking with a major airline, we were putting in an installation, in Europe, and they wanted to have the airport codes and the airline tail numbers extracted from every piece of data that was actually brought into the investigation. To them, that was a most important entity, because of the nature of the business. So, similar case of accessing the file, [quoting] the regular expression file. behind it, so it was available just as you see on the screen that Paul’s showing right now. So it is, it’s very powerful in the way that we can allow the users to interact with it.
Paul: Thanks, Ady.
Ady: There is just one more question, and that’s a very, very good question, and I just want to cover this off before we move on, Paul, if I may. It’s [indecipherable] question that I… my fault, I missed it in relation to near-duplicates. And the question was: Can you transfer data that tells us what the near-duplicates are into another review platform? And the [requester] has mentioned relativity here. [indecipherable] despite this near-duplicate [indecipherable] [is most important to] Nuix in this new environment.
It’s a good question, and the answer is yes. We very much take the near-duplicate value, but we create these values, we create them as meta-data fields. And we also collate the pivot value, the pivot item, and also the similarity index marking – so basically, how closely similar it is to the pivot. So we can actually… because [if we push these] into meta-data fields, when we’re opening them up [in our relativity] desktop client, or our meta-data mapping client, whichever process we’re using, then we can match these fields in, and then make them relational fields in the [subject of review] platform. And I’ve worked with this with a few customers, and it actually works very, very well. It just adds another relation field so that the items are clustered within the item. So thank you very much for that question.
Okay, I think we’re good to move on, are we, Paul?
Paul: Yeah. We can pick up anybody’s questions [indecipherable] time.
Ady: I’m aware of the time, and I just wanted to… there was one more subject that we want to cover off. And that’s of visual analytics. Those of you who are using Nuix and have seen Nuix before may or may not be aware that there is a visual analytic program that’s available. Just a few brief words about visualization. What visualization can do – it can obviously clearly help us to see what’s in our data, from a visual representation. Equally, and sometimes more importantly, it helps us to show what’s actually missing from our data. So we can start… we can have trends of data, and it’s doesn’t actually restrict itself to email trends. I’ve got a couple of slides where we can see where credit card numbers and monetary values start coming in and out of the data set.
So using techniques such as network diagrams and timelines, investigators can very, very quickly see connections and flows of information between suspects and custodian, between different data types, between different file formats. This can very, very quickly narrow down data sources, and allow [indecipherable] to get into greater depth far more quickly. Also, as I said, it may reveal information that… of gaps in our evidence, which may indeed itself warrant further investigation.
When we open the visual analytics platform, we actually come across this dashboard, which is highlighted in the screen in front of us. So we can immediately come here after processing, before we actually start the investigation, to answer some very basic questions, some very useful questions, through a visual representation. For example, we may want to know how much data there is. Now, when I started in digital forensics, 15-16 years ago, there was actually quite a simple answer to this question. If I put a 512 MB hard drive through [indecipherable] or my application program, then I pretty much know that that’s the size of my data that I’m dealing with. But today, we’ve already alluded to the fact that there’s cloud repositories, there’s multiple devices in even the simplest of investigations. So the answer to that – how much data – is actually quite a difficult one to answer. So we can do it the visual aspect, and actually sort of make those counts immediately to the investigator.
And certainly, in eDiscovery, “how much data?” is usually the first question that’s being asked. What data types – what type of file formats have we got, have we got a lot of document files, have we got a lot of emails, have we got a lot of internet cache, have we got a lot of specific types of file types that are important to the investigation. And again, it goes back to the discussion is once we’ve identified that we have large amounts of a single file type, we can make the decision, are they important to my investigation? Because if they’re not, I can suppress them. I can make my haystack much, much smaller from a visual representation.
And what forensic artifacts exist? What are the user habits? I’ve used this one to great effect in my time, about showing the habit of offending, of an offender, on a little timeline graphical representation. This is extremely powerful. It’s not actually evidence in itself, but it’s evidence of offending, which is a whole new aspect that we can start [inaudible].
And where should I focus my investigation time, immediately, knowing how much data there is, [the baseline] that it crosses, and the make-up of the data? It’s rather like walking into a crime scene and being able to make an immediate assessment of what you have to deal with, and making sure that your targeting to the right areas before you go head-on into that investigation.
So I’m a real big fan of visualization, as you may gather, and I use it on every case. When we were putting this webinar together, I was thinking of [use] cases for near-duplicates and [use] cases for [extracting entity]. But thinking of a [use] case for visualization – I eventually use it on everything I get involved with these days, because it literally lets the data talk to me, it tells me a story, without me having to make any kind of investigation, keyword searching, or enquiries into the data set. It gives me a very good and very powerful picture.
So what we’re looking at at the moment is a screenshot of how we look at email gap analysis. It may be important that these emails have dropped during these years, or it may be unimportant. And [it could be] – we’re looking at emails right now – but it could be credit card enumerations, it could be web cache, it could be somebody signing on and singing off of the computer. So we can very quickly start using these graphical representations to get a much, much clearer and bigger picture.
One of the most exciting things about visualization, from a Nuix perspective, is that it’s not just a picture that paints a thousand words. It’s also interactive, it can be part of our workflow. And the reason why I’ve put this screen here is so that you can sort of see that the right-click menu within visual representation. So once you’ve found your data, you can actually send that back to the workbench, because Nuix visual analytics works alongside the workbench itself. And this is where it’s all-empowering, because you can bring visualization into your workflow when and if you need to.[indecipherable] a very quick example – if I’ve done a keyword search and I’ve got 2000 hits, my choice is to look at them in a linear manner, or I could throw them into a visualization and get a much bigger picture and a much broader aspect about what my keyword searches and my results are actually reporting to me. And then I can perhaps interact with that and throw that back to the workbench to continue my workflow. So it’s the fact that it’s interactive that really does make it a more exciting product.
The next screenshot here shows the extracted entities. We’ve already discussed extracted entities and how they can be useful. We can visualize those extracted entities. What we’re looking at at the moment is a data set that has credit card and monetary values in it. However, there are no instances between April and July. And this is to demonstrate the importance of gap analysis – why are there no instances between this time? Was it because the computer was broken, it was in a computer repair shop? Was it because the suspect was on holiday? Or was it simply because the internet was down or… some other reasons. So we can start to very, very quickly see our results, based upon timeline analysis.
I’ve already mentioned a couple of times during this last hour that we now live in a mobile world, we all have handheld devices, so geo-tagging becomes very, very prevalent in our lives, and should be also equally as prevalent in our investigations. What we’re looking at on the screen here is a screenshot of the images that were located in an iPhone that’s been pushed into Nuix. I’m sure many of you [may, and some of you are] not aware that we can ingest a [indecipherable] file formats into our application. So we naturally start to inherit new meta-data, and quite exciting meta-data that’s into meta-data fields, and [indecipherable] fields, that actually can start to plot a person, or at least that device, in a specific place on the planet that we’re living.
And this can all-empowering. One of the great by-products of this is that if you bring in certain [indecipherable] images off certain devices, you can start seeing how that device has communicated with the cell tower, because there’s a meta data field that shows cell tower information. So it literally draws a plot of where that device has been on a map, as you’ll see right here.
Okay, so just to finally wrap this up – this just gives a visual sort of aspect to what I’ve been saying. We’ve carried out a keyword search, we’ve got this list of items, and we’re very used to dealing with these lists in our [other] applications. We go through them one at a time, we read the text, we look at the meta-data, and we move on to the next one. But the human brain is only really so… [cognitive as it can be dealing] with lists. At some point, it becomes a laborious task. And that point of it becoming a task [indecipherable] quite short amount of time. So it makes [indecipherable] to be able to say, “I’ve now got these emails, I’ve now got these SMS and BlackBerry messages, I now want to see a little bit of the bigger picture. And the bigger picture is to be able to draw a graph, the bigger picture is to be able to see who’s talking to who, whose our common communicator, and how many times have those communications taken place. So we can click on any of these links between these identities and [custodians], and immediately go back to those specific communication ones. And that now includes Skype, SMS, call records, and then all of the other communication devices that Nuix supports.
And just finally, we could throw this into an event map. So once we’ve finally found the data we’re looking for, we can plot it into a date and time. And this is ideal for seeing messages that have been forward on or blind copied through, so again, a visual representation is all-empowering, not just from an investigation perspective, but also report-writing. For many, many times – and I’m sure Paul has got many stories about how many times he’s stood in court and had to try and explain something – well, the visualization of that [indecipherable] the visualization of our process becomes all-empowering if we have a tool like this at our disposal.
Okay, I think that’s the last slide on visualization.
Ady: Any specific questions? I think, well, given the time, we’ll open that to any specific questions about anything.
Paul: Yeah. Ady, just one thing to add from me on that one. We have recently [or will be recently adding] support for [Oxygen] as well to the product, so that, I think, will cover all of the main mobile device tools, so again, we’re just increasing the capability of the product really, that…
Ady: Yeah, that’s great. Thanks for adding that in, Paul.
Paul: Okay, so – obviously, we are kind of getting pretty close on time. I just want to get to the Summary slide if we can.
Okay, so I can see the summary. So thanks very much for your time today, listening to Ady and I. We’ve hopefully shown you just three of many eDiscovery workflows that are built into Nuix that can help make digital investigations far more efficient. We’ve talked about near-duplicates, which are incredibly powerful [indecipherable] find similar documents, both within live file sets and file fragments. We’ve talked about [indecipherable] again, how we can use them to kind of narrow in some potentially important information within our data set. And of Ady’s… one of Ady’s favorite topics is visual analytics, and hopefully you’ve got a good understanding of how that can allow us to [do and allocate assessments] in effect, and to generate dynamic views of the data, helping to bring to life [normally very flat] data, into life really.
And just to be clear, for me really, these techniques are not really there to [dumb down] the value of the forensic investigation. It’s exactly the opposite. I honestly believe that by adopting tools and process and these workflows, within our investigations, we can actually improve the efficiency of them, and make so much better use of our available time, and then just perform digital forensics on all of the most important data, which of course helps us to find that ever-elusive needle in the digital haystack.
So thanks again very much, everybody, for your time today. We don’t really have time for questions I’m guessing, but as I said at the beginning of this session, there is a webinar forum on the forensic focus website, and we would obviously invite you to put any questions to us on that site, and our email addresses are obviously on the screen, [indecipherable] wanted to contact either Ady and I or perhaps you wanted a copy of the presentation. We’re more than happy to do that.
So that’s it from me, Ady. Any comments from you?
Ady: That’s a fantastic roundup, Paul. Thanks for that. And I just really wanted to add my thanks to everybody who’s listened to Paul and I talk for the last hour. It’s always great to share our experiences. We both come from the field of hardcore forensics, so we are still [indecipherable]. And what we’ve shown you today is just a couple of the tools, a couple of the functions, where eDiscovery and investigation start to meet each other and start to really show a benefit to each other and appreciation. So I think you for your time and the opportunity to speak to you.
Paul: Thank you very much, everybody. Thank you.
End of Transcript