±Forensic Focus Partners

Become an advertising partner

±Your Account


Forgotten password/username?

Site Members:

New Today: 0 Overall: 35868
New Yesterday: 0 Visitors: 163

±Latest Articles

±Follow Forensic Focus

Forensic Focus Facebook PageForensic Focus on TwitterForensic Focus LinkedIn GroupForensic Focus YouTube Channel

RSS feeds: News Forums Articles

±Latest Videos

±Latest Jobs

Scalability: A Big Headache

Scalability: A Big Headache

by Dominik Weber

Dominik Weber
About the Author

Dominik Weber is a Senior Software Architect for Guidance Software, Inc.

In this month's installment, I will take a break from a specific problem and talk about a fundamental issue with deep forensics: Scalability.

Scalability is simply the ability of our forensic tools and processes to perform on larger data sets. We have all witnessed the power of Moore's law. Hard drives are getting bigger and bigger. A 2 TB SATA hard drive is to be had for much under $100. With massive storage space being the norm, operating systems, and software is leveraging this more and more. For instance, my installation of Windows 7 with Office is ~50GB. Browsers cache more data and many temporary files are being created. After Windows Vista introduced the TxF layer for NTFS, transactional file systems are now the norm, and the operating system keeps restore points, Volume Shadow Copies and previous versions. Furthermore, a lot of the old, deleted file data will not get overwritten anymore.

This "wastefulness" is a boon to forensic investigators. Many more operating and file system artifacts are being created. Data is being spread out in L1, L2, L3 caches, RAM, Flash storage, SSDs and hard drive caches. For instance the thumbnail cache now stores data from many volumes and Windows search happily indexes a lot of user data, creating artifacts and allowing analysis of its data files.

That was the good news. The bad news is that most of this data is in more complex, new and evolving formats, requiring more developer efforts to stay current. For instance I am not aware of any forensic tool that analyzes Windows Search databases - not that I had time to look (if you know of such a tool, post in the forum topic, please - see below). Worse than that is the need to thoroughly analyze the data. Traditionally, the first step is to acquire the data to an evidence file (or a set thereof). The data must be read, hashed, compressed and possibly encrypted. All this does take time, despite new multi-threaded and pipelined acquisition engines appearing (for instance in EnCase V6.16). High speed hardware solutions are also more prevalent. Luckily, this step is linear in time, meaning that a acquiring a full 2TB hard drive will take twice as long as a full 1TB drive. Note that unwritten areas of hard drives are usually filled with the same byte pattern (generally 00 or FF) and these areas will compress highly, yielding faster acquisition rates.

The next step is the interpretation of the file systems by the forensic tool. And here we run into some issues. We want to detect and display deleted and overwritten files properly in their hierarchy, detect cross linked files and compute unallocated space. This process is non-linear with the number of files and with the hard drive space. A lot of processes are n-squared. That means that a doubling of the input would quadruple the output. For instance, if a 125 GB NTFS drive would take 1 minute to process, a 2 TB would take about 4.5 hours (256 minutes). In reality this figure is actually worse. This is due to smaller datasets being able to be handled in memory without paging; the file system cache and large amounts of RAM really help here. This impact is because we need to analyze the file system as a whole, rather than having the luxury of just being concerned with a single non-deleted file or directory. This is why there is no slowdown noticeable on your own computer. Well that and that the CPU, graphics and disk I/O are more powerful to handle the new load of the applications and operating system. In large sets, the system starts paging and the performance degrades due to the I/O subsystem needing to read from the evidence while paging memory in and out (or accessing the database files).

The next steps also take more time; carving, detecting and mounting file structures (zip/cab files, office documents, PST/DBX and similar). Here we run into similar issues, and this step adds more items to the whole data set.

Now we have a large amount of files and data to sift through in order to find the data of interest. New and inexperienced investigators can be quickly overwhelmed by the data flood. So what can we do about this? There are some practices that are currently used that help with this data flood at the expense of some computing / automated analysis time:

1) Hash Analysis: All the files get hashed and these hashes get looked up in Hash sets. This will determine if a file is part of a standard software packet/ operating system installation. There are also hash sets for malware and known child pornography images.

2) Indexing / Search analysis. The data gets either searched raw with keywords that would hit on terms of interest (victim's name, credit card number etc.) or the whole text is indexed.

3) Custom analysis: Images can be carved out, custom scripts, plug-ins and processes that analyze the data can be run; these can range from image finding and skin tone analysis to detection for encrypted data and email thread or web cache data mining.

4) Data sets can be reduced by filtering and other criteria and this data is put into its own reduced data set (logical evidence files).

While these things do help in human processing time, they also increase the automated processing time. This time is usually spent up-front, leading to larger lag times. Unfortunately most of these algorithms are worse-than-linear. Thus, the fundamental challenge in forensic tool design / programming is to use advanced algorithms that have better run times. Multi-threading, pipelining and a better use of the hardware platform to optimize I/O, CPU and RAM throughput are needed. These techniques take much longer to design, implement and debug because they are much more complex, the data sets ran against them are large and the result must be validated. This large effort is very hard to achieve in open-source projects and even in commercial products it takes effort, commitment and larger architectural changes. This effort must be balanced against the need to release products and provide updates and new features.

So, dear reader, as a forensic developer, I am curious, what do you think about the current state of forensic tools? Are they adequate? What other things do you do in order to deal with the data volume? How should the "new feature" be balanced against the scalability of the existing process?

Click here to discuss this article.


Read Dominik's previous columns

Dominik Weber is a Senior Software Architect for Guidance Software, Inc. He has a Masters of Computer Science from the University of Karlsruhe, Germany and worked for video game companies (Activision) and on computer animation / motion-capture projects (Jay Jay the Jet Plane) before joining Guidance Software in 2001. He can be reached at [email protected]

Guidance Software is recognized worldwide as the industry leader in digital investigative solutions. Its EnCase and Enterprise platforms provide the foundation for government, corporate and law enforcement organizations to conduct thorough, network-enabled, and court-validated computer investigations. Worldwide there are more than 30,000 licensed users and thousands attend its renowned training programs annually.