Hello again forum members. Does anyone have any direct experience of forensic examination of a Hadoop cluster and if so, would you be prepared to share some commentary on tools and methodology you have used in such an environment?
Thanks
What do you need to know? … there's a lot ) How big is the cluster and what are you looking to find?
I don't have a specific scenario in mind - rather I'm interested in understanding general aspects of the methodology you would apply if investigating a Hadoop cluster.
But to get started, let's assume something like
single HDFS filesystem of 1 petabyte of data spread over ~1000 1 TB drives shared equally between ~100 data nodes
for the map reduce crunching, 100 multicore machines (8 - 16 cores each) with 32 gig of ram each.
and as for what you are looking for - 2 scenarios..
1) needle in a haystack. you are looking for potentially small amounts of evidence hidden somewhere in the filesystem with little clue as to where.
2) assessment of scale. you have already uncovered problem material and have evidence to suggest there's a lot more and need to discover how big a problem you have.
Interesting question. I certainly wouldn't image all of them. Best practices of working off a copy kind of go out the window when you get that large. )
Keyword searching and hashing come to mind first. There are content keyword searching and hashing examples for Hadoop / HDFS. You could also dump file metadata pretty easily.
This also interests me, as I believe cloud solutions often rely on HDFS and similar distributed FSes which relocate and duplicate data across multiple machines.
It is sort of a chicken and the egg. The only saving tool is - knowing what one is looking for . . .
Does HDFS able to "lock" data located on compute/slave nodes, that is, not allow relocation, further replication and rebalancing of the data?
Maybe it is my old fashioned notions, but the inability to mount or access files from HDFS outside of HDFS disjoints my thinking…
Geoff, I agree absolutely that the best practices of "take a forensically sound image and work from that" fall apart completely here - and this is a very modestly sized cluster I've described. Those practices were written assuming that the object under investigation would be a discretely packaged unit - ie a single disk. The question is, what are the NEW best practices for these different circumstances?
With respect to keyword searching, the use of hashes and the dumping of file metadata, we have a problem with direct examination of the information located at any one node, as any particular data object coming in is always "chunked" and spread across N data nodes, as well as being duplicated to N others (*) - such that each data node only holds 1 Nth of anything. Thus running such searches relies on us being able to talk to the whole filesystem. Some of these options are built in to the tools which Hadoop itself provides us - however therein lies the rub that JHUP raises - we are forced to use the system to examine the system. If I have managed to bypass the name node and write directly to disk on a number of data nodes, then the filesystem doesn't even "know" what's there.
(*) depending on configuration. One further interesting aspect of Hadoop data distribution is that it deliberately contains a randomizing element for the distribution of (tertiary) copies - such that you literally do not know where those copies are without consulting the tracking index (which you could lose in a worst case scenario).
I was, of course, referring to searching/hashing/metadata within HDFS, likely using a MapReduce app written in Java, as the example code shipping with Hadoop. Though, if you had a way to mass-search normal filesystem data across all the nodes while ignoring the HDFS data, that would be worthwhile as well (automated F-Response connections, perhaps). The real problem with examining a system with itself is that you have to have an expert in the system to determine first whether it's been tampered with.
jhup there are FUSE drivers now for HDFS that let you mount it if you really want to… that would probably be the slow way of doing things.
Geoff, VERY interesting to hear that FUSE has extended to Hadoop - I had not heard that and it provides for some interesting alternative ways to get access to the data!
I don't think I've tested them recently, but they've been worked on for a while.
The question is, what are the NEW best practices for these different circumstances?
Best practices are borne from experience… so it's up to those of us with the experience to define them. D
With respect to keyword searching, the use of hashes and the dumping of file metadata, we have a problem with direct examination of the information located at any one node, as any particular data object coming in is always "chunked" and spread across N data nodes, as well as being duplicated to N others (*) - such that each data node only holds 1 Nth of anything. Thus running such searches relies on us being able to talk to the whole filesystem. Some of these options are built in to the tools which Hadoop itself provides us - however therein lies the rub that JHUP raises - we are forced to use the system to examine the system.
I tend to think the old notion that we should be able to examine any given system without affecting it is just that, an old notion. For example, memory forensics has turned into a huge sub-field of forensics, and it's pretty much impossible to image RAM from a system without having some effect on it.
With HDFS, there are some nice principles at play, though. First, it's a read-only filesystem, where files cannot be modified once they're created (some newer versions of Hadoop allow for appending to the files, but you can't overwrite any existing data). Second, HDFS guarantees integrity, with blocks being subject to automatic hashing and verification. So, if you want to do forensics on HDFS itself, you can use the system without jeopardizing any of the data already stored in HDFS. However, you can, of course, jeopardize other system data on the individual nodes (e.g., syslogs, dmesgs, etc.).
If you really need to keyword search files within HDFS, like you would in a normal forensics tool, please shoot me an email.
If I have managed to bypass the name node and write directly to disk on a number of data nodes, then the filesystem doesn't even "know" what's there.
Well, HDFS is a distributed filesystem created on top of the nodes' filesystems. If you have access to a DataNode directly, then you can write data to the underlying filesystem on that system, but it doesn't mean that the data will be considered part of HDFS. To a large extent, the NameNode is HDFS, just as the MFT is NTFS… if a file isn't in the MFT, it's not properly part of an NTFS filesystem.
(*) depending on configuration. One further interesting aspect of Hadoop data distribution is that it deliberately contains a randomizing element for the distribution of (tertiary) copies - such that you literally do not know where those copies are without consulting the tracking index (which you could lose in a worst case scenario).
Without a NameNode, secondary NameNode, or dumps from them, I think it would be well nigh impossible to restore any understanding of the HDFS. You may be able to come up with heuristics for restitching common file formats in Hadoop, though, like sequence files, which have a pretty strict structure.
Interesting discussion!
Jon