Looking for fragments
I have an assignment that really doesn't come under the heading of "forensics" per se, and what I need to do is very straightforward but there may be better or worse ways to go about it.
What I need to do is to examine hard drives on Windows machines to determine whether an application used to create healthcare documents "cleans up" after itself when it shuts down, and doesn't leave any identifiable patient information scattered around.
I will have a file that provides me with information about the patient records that have been processed on the machine, and this would provide the text search terms I'd be looking for. Each line of the file would contain delimited information about one patient
BROWN, MARY, 030538, F, 555, ELM, STREET, BOOGYDOWN, RI, 99999
SMITH, JOHN, 061862, M, 8453, WOOD, ROAD, FUNKYTOWN, RI, 99998
There are absolutely NO evidentiary issues involved in this assignment as we are not looking for evidence of crime or misconduct. We are merely seeking to prove (or disprove) whether the process of creating these electronic records inadvertently leaves protected health information behind so that this can be addressed.
In essence, I'm talking about a super-grep, searching every sector. I'm thinking about using a disk editor such as forensic WinHex. Something like Encase would be overkill, and too expensive for this project anyway. Since I'm free-lancing, I can't ethically use my employer's forensic software or station.
In a perfect world, I'd be able to automate the search against the text file line-by-line, but that's probably asking too much.
Any expert suggestions?
In essence, I'm talking about a super-grep, searching every sector.
Or … you're talking about establishing a base environment, for instance in a VMWare or Virtual PC, then take a snapshot/establish a differential drive, and let the application run and close. At that time, any changes made to the drive are in the post-snapshot/differential drive. Searching that (with the usual caveats about greps expressions across sector boundaries) would probably cut down the searching job quite a bit. You may even be able to eyeball the changes. Extracting each sector from the differential file should provide you with the best bsae for searches.
(The Virtual PC disk format is easy to understand – some programming is probably needed.)
An even more quick-and-dirty method would be to use something like SandboxIE, and run the application in that environment. That, however, will only help you find traces left in existing files new files and modifed files are collected in the sandbox – if files are deleted, their old contents is lost in this setup (though I realize I haven't looked at SandboxIE for a while, and things may have changed).
The problem, though, seems to be to ensure that file fragmentation wont prevent searches. Looks like you would want to ensure that the tool you splits search strings to allow for sector breaks. That is,
where | represents a sector boundary, so that A|B is interpreted 'A at the end of any sector and B at the beginning of any other sector'. Of course, very short search patterns will cause false positives that need to be weeded out.
But I have a feeling I'm stating the obvious.
If the names/data are not going to be encrypted/altered, you can use a simple hex editor, like tiny hexer, link given here
it has a script (configurable) to extract "text data" from binary files, you can use it also on a disk, of course.
It will extract all data recognized as "text" into either a .txt or a .htm file, which you can later search with your "wordlist" items.
I don't think there is an issue with actual strings written across sectors, usually these kind of programs use databases that do not "scatter" entries across sectors, as they use fixed length fields, I would check this kind of behaviour only if nothing has been found before "normally", as a l"ast attempt" only.
As I understand it, you just need to search the unallocated space, and possibly undeleted temp files.
Depending on the size of your data files, if they are large (>100MB), then typically they may also be fragmented. Smaller files can also get fragmented when being edited, such as .DOC or .XLS files.
A straight sector by sector search will not tell if the search string is in the required data file, or a left over fragment.
are you saying the system is just a work station accessing a remote file. In this case, I would suggest that the pagesys file is the most likely one to retain patient data. This would vary due to memory size, applications loaded etc, time machine has been turned on. The only way to prevent this would be to ensure no virtual memory is allowed.
Firstly, I think that the sandbox is the better option, and the one that will be easiest to manage to identify data written to disk - it will narrow your search area hugely, and you may even be able to verify it manually depending upon footprint.
You've been reasonably explicit that you are interested in disks, however your statement
"cleans up" after itself when it shuts down, and doesn't leave any identifiable patient information scattered around.
isn't specifically disk related - you might like to grab the memory and have a look to see if it "cleans up" there too - or at least if the OS does. It's another avenue of attack for someone who is truly motivated. There is also the possibility that, at some, indeterminate point in the future, that memory may be written to disk - in hibernation, part of another applications poor memory management, or the like - that might additionally impact on the application footprint - it's got to be worth a few bonus marks if nothing else 😉
Other than that, you could "dd" an image of the disk to a single file, use "strings" to extract strings from it to a file then "grep" with regular expressions until you find something useful, satisfy yourself there is nothing or run out of time -)
So you have a disk image? Does it have the OS and health care application(s) on it? Is it bootable?
Or is the image something created in a laboratory specifically for this assignment?
Also, are you looking for records of the format that you described, or the data in those records?
I could try help you with a grep (not a 'super-expert' at this but i've done enough to probably help you create something pretty workable - although likely to be pretty damn slow i suspect).
If you could specify the likely bounds for each field that'd help.
However this would be to create a grep that searches for these strings in that particular format.
If you have access to the software/system. I'd suggest (as i think others are elluding to), creating/adding/modifying records (using all functions as necessary), for a known set of text for example creating records for 'averylongforenamestring' 'averylongsurnamestring' and so forth. Keep a record of all these long unique strings that you use for the test, then keyword search the entire drive for these known long strings, with files mounted as necessary, various codepages, and so forth.