Greetings,
I heard rumors to that effect. I suspect yours is farther along than mine.
The documentation, such as it is, for mine is here
http//
Skeletool is the utility/example script and dfir contains the various libraries that I am building.
It really is just a proof of concept at the moment.
-David
keydet89 wrote
I hear that a lot…but what I rarely hear is anyone asking to have a one-shot tool updated to do directories/subdirectories…
If I understand you right, Catalogue does just that, point it at a folder and off it goes through sub-folders gathering metadata like crazy. Not perfect, really doesn't like hidden folders but there's always a way round.
Funny thing is, it was about the first tool I bought 'cos it did just the exact job we needed at the time. And I thought it was such an obvious solution - but apparently not, there seems to be more effort put into cleansing or changing metatdata.
David - unusual - is hard to define. But if you're familiar with the McLaren/Ferrari IP scandal a few years ago, that's kinda (but not exactly) what we're faced with. So, say I work for McLaren and the Ferrari case is done and dusted but I want to ensure that I'm clean on an ongoing basis and I see "Ferrari" in the Company column of metadata hoovered from one of my new start engineers, something's not right. So I do some digging and find the same guy has worked for three different companies and they all appear in my systems, we have a problem. But I can't just do a keyword search for "Ferrari" cos it will return way too many false positives.
Both - with zero programming experience/education but a desire to get some skills, would you argue against Python as a starting point? BTW, passed my GCFE last week D
Cheers - you both give a lot to this community, I've learned a helluva lot from you and I bet others have too.
I've previously used software called ListIt - http//
Hi Sydney, appreciate.
FYI that link appears to be broken but I found them on http//
I've emailed them asking if their product does what I need. Fingers crossed.
Hi Cults14,
I think you may try to use a tool named FOCA, which does a good job on metadata extraction from large sets of documents. FOCA is supporting the following document formats .doc .ppt .pps .xls .docx .pptx .ppsx .xlsx .sxw .sxc .sxi .odt .ods .odg .odp .pdf .wpd .svg .svgz .jpg. (Office 2007/2010 formats supported!).
You may use FOCA to extract metadata from the single file, selected folder or even a website (e.g. your intranet). In that last case it downloads all files from the selected URL prior to analysis.
Vendor and download here
http//
http//
I'm using FOCA for quite a long time and found it as a bit of good software.
As for license "FOCA is free for use in any environment, including but not necessarily limited to personal, academic, commercial, government, business, non-profit, and for-profit."
I hope you will find it useful.
Let me know your thoughts.
-Greg
IIRC recent Office fileformats are basically .ZIP files. You can do a mass rename of say .docx files to .zip, then unpack them to (separate) folders and do a string search on the XML files that contain the Metadata.
MDCR - yes I'm aware of the file format and where the properties are. Trouble is that most often we don't know what we're looking for up front so string searches aren't any good. To go back to the Ferrari example, even if we knew that Ferrari was a keyword, it would most likely be in so many documents that we would get far too many false positives for this to be workable. And mis-spellings, acronyms and abbreviations just add to the confusion.
Plus, we want to look for Authors as well (once we find documents with relevant data in the Company field). We've found this useful in the past, e.g. an Author found in Word metadata (by filtering on Company column in Excel output of metadata listing) turns out to be a key contact that we didn't know about before. NOW we can do string searches regardless of whether the string is in metadata or the text of the file.
It is perhaps surprising that a $35 piece of software was able to do so much for us in the past. Damn M$ for changing the file format x
HTH
Metadata extraction in docx, xlsx and pptx is not hard. If you know a scripting language you just need to handle decompression and text extraction from a handful of preknown xml files. Dump output to a csv and you've saved the money.
Greetings,
Thank you for the background, I appreciate it. 'tis an interesting problem. Modern ediscovery tools are working on solutions to it using a variety of approaches, including context searching. Keywords may be going the way of the dinosaur, though probably very slowly as explaining the new science to the court could take years, if not decades.
Python is, in my mind, easier to learn that Perl, but I think that is an individual thing. I think it is easier to read, and forces you to write clearer code. I think it has better OOP support.
They both have rich support options in the form of blogs, examples, training materials, classes, books, etc. They both have a wide variety of open source libraries to do many things. They're both available on most any system you might use.
-David
Update - no-one's come up with any tools which will list all metadata from all MS Office 2007/2010 docs in csv or similar format, most suggestions have been along the lines of keyword searching - which of course assumes you know what you're looking for (which we often don't).
However, Win7 native search features let you search in Company and Author Field (and others) using the following syntax
system.author<keywords>
system.company<keywords>
system.lastmodifiedby<keywords>
Clearly you need to be in the right folder and indexing helps, but IMO it's actually rather good - you can even use Boolean AND/OR/NOT
Meantime, with zilch programming experience, I'm off to find out if it's possible to script this in Python. Wish me luck!!
Cheers