Notifications

Clear all

Metadata Extraction

Cults14 · 2011-07-14T19:36:03Z

Hi hope someone can help.One of the aspects of my job is to routinely search for unwanted Intellectual Property on our systems, mainly in MS Office and PDF files - file and mail stores. I have been using a (nearly) free tool called Metadataminer Catalogue which you point at a drive or folder, press OK, and it goes off and strips out as much metadata as it can including - significantly for us - Company and Author for MS Office and Producer for PDF. And gives the option for exporting to Excel where we filter on Company and Author columns for quick return on time invested.Whenever we go through an IP exercise, this is the route which turns up most stuff most quickly, and also helps point us in the right direction for additional keyword searches. It's been really helpful.BUT - there's always a BUT isn't there?Catalogue can't cope with Office 2007 and later "x" files - it returns nothing in the Company and Author field. And the publishers have no plans to update the product.I've used Harlan's WMD.PL in the past but find this is suitable for small volumes, not the thousands we have to cope with. And AFAIK it's for Word only, not the remainder of the Office Suite.So, does anyone out there know of anything which will do what I'm looking for? I seem to recall that pinpoint had something but I suspect that it's been incorporated into Harvester. NUIX collects the right stuff as far as I recall but that's not the point of NUIX. And the price point is prohibitive for us.Any ideas anyone? We don't need to change metadata, just find it.Cheers

Page 2 / 3 Prev Next

General (Technical, Procedural, Software, Hardware etc.)

Last Post by Cults14 14 years ago

24 Posts

7 Users

0 Reactions

2,130 Views

RSS

kovar

(@kovar)

Prominent Member

Joined: 18 years ago

Posts: 805

15/07/2011 1:06 am

Greetings,

I heard rumors to that effect. I suspect yours is farther along than mine.

The documentation, such as it is, for mine is here

http//code.google.com/p/opensourceforensics/wiki

Skeletool is the utility/example script and dfir contains the various libraries that I am building.

It really is just a proof of concept at the moment.

-David

ReplyQuote

Cults14

(@cults14)

Reputable Member

Joined: 17 years ago

Posts: 367

Topic starter 15/07/2011 4:34 am

keydet89 wrote

I hear that a lot…but what I rarely hear is anyone asking to have a one-shot tool updated to do directories/subdirectories…

If I understand you right, Catalogue does just that, point it at a folder and off it goes through sub-folders gathering metadata like crazy. Not perfect, really doesn't like hidden folders but there's always a way round.

Funny thing is, it was about the first tool I bought 'cos it did just the exact job we needed at the time. And I thought it was such an obvious solution - but apparently not, there seems to be more effort put into cleansing or changing metatdata.

David - unusual - is hard to define. But if you're familiar with the McLaren/Ferrari IP scandal a few years ago, that's kinda (but not exactly) what we're faced with. So, say I work for McLaren and the Ferrari case is done and dusted but I want to ensure that I'm clean on an ongoing basis and I see "Ferrari" in the Company column of metadata hoovered from one of my new start engineers, something's not right. So I do some digging and find the same guy has worked for three different companies and they all appear in my systems, we have a problem. But I can't just do a keyword search for "Ferrari" cos it will return way too many false positives.

Both - with zero programming experience/education but a desire to get some skills, would you argue against Python as a starting point? BTW, passed my GCFE last week D

Cheers - you both give a lot to this community, I've learned a helluva lot from you and I bet others have too.

ReplyQuote

Sydney36

(@sydney36)

New Member

Joined: 16 years ago

Posts: 1

15/07/2011 7:40 am

I've previously used software called ListIt - http//www.forensictools.com.au/software/index.html. You can point it at a directory, and it will list all files found along with meta data. Not free though ($40 Australian)…

ReplyQuote

Cults14

(@cults14)

Reputable Member

Joined: 17 years ago

Posts: 367

Topic starter 15/07/2011 2:29 pm

Hi Sydney, appreciate.

FYI that link appears to be broken but I found them on http//www.forensictools.com.au/contact/index.html

I've emailed them asking if their product does what I need. Fingers crossed.

ReplyQuote

gmkk

(@gmkk)

Active Member

Joined: 14 years ago

Posts: 13

15/07/2011 3:23 pm

Hi Cults14,

I think you may try to use a tool named FOCA, which does a good job on metadata extraction from large sets of documents. FOCA is supporting the following document formats .doc .ppt .pps .xls .docx .pptx .ppsx .xlsx .sxw .sxc .sxi .odt .ods .odg .odp .pdf .wpd .svg .svgz .jpg. (Office 2007/2010 formats supported!).

You may use FOCA to extract metadata from the single file, selected folder or even a website (e.g. your intranet). In that last case it downloads all files from the selected URL prior to analysis.

Vendor and download here
http//www.informatica64.com/foca/
http//www.informatica64.com/DownloadFOCA/

I'm using FOCA for quite a long time and found it as a bit of good software.

As for license "FOCA is free for use in any environment, including but not necessarily limited to personal, academic, commercial, government, business, non-profit, and for-profit."

I hope you will find it useful.

Let me know your thoughts.

-Greg

ReplyQuote

MDCR

(@mdcr)

Reputable Member

Joined: 15 years ago

Posts: 376

16/07/2011 3:09 pm

IIRC recent Office fileformats are basically .ZIP files. You can do a mass rename of say .docx files to .zip, then unpack them to (separate) folders and do a string search on the XML files that contain the Metadata.

ReplyQuote

Cults14

(@cults14)

Reputable Member

Joined: 17 years ago

Posts: 367

Topic starter 17/07/2011 3:55 am

MDCR - yes I'm aware of the file format and where the properties are. Trouble is that most often we don't know what we're looking for up front so string searches aren't any good. To go back to the Ferrari example, even if we knew that Ferrari was a keyword, it would most likely be in so many documents that we would get far too many false positives for this to be workable. And mis-spellings, acronyms and abbreviations just add to the confusion.

Plus, we want to look for Authors as well (once we find documents with relevant data in the Company field). We've found this useful in the past, e.g. an Author found in Word metadata (by filtering on Company column in Excel output of metadata listing) turns out to be a key contact that we didn't know about before. NOW we can do string searches regardless of whether the string is in metadata or the text of the file.

It is perhaps surprising that a $35 piece of software was able to do so much for us in the past. Damn M$ for changing the file format x

HTH

ReplyQuote

joakims

(@joakims)

Estimable Member

Joined: 15 years ago

Posts: 224

17/07/2011 5:02 am

Metadata extraction in docx, xlsx and pptx is not hard. If you know a scripting language you just need to handle decompression and text extraction from a handful of preknown xml files. Dump output to a csv and you've saved the money.

ReplyQuote

kovar

(@kovar)

Prominent Member

Joined: 18 years ago

Posts: 805

18/07/2011 1:45 am

Greetings,

Thank you for the background, I appreciate it. 'tis an interesting problem. Modern ediscovery tools are working on solutions to it using a variety of approaches, including context searching. Keywords may be going the way of the dinosaur, though probably very slowly as explaining the new science to the court could take years, if not decades.

Python is, in my mind, easier to learn that Perl, but I think that is an individual thing. I think it is easier to read, and forces you to write clearer code. I think it has better OOP support.

They both have rich support options in the form of blogs, examples, training materials, classes, books, etc. They both have a wide variety of open source libraries to do many things. They're both available on most any system you might use.

-David

ReplyQuote

Cults14

(@cults14)

Reputable Member

Joined: 17 years ago

Posts: 367

Topic starter 29/08/2011 7:55 pm

Update - no-one's come up with any tools which will list all metadata from all MS Office 2007/2010 docs in csv or similar format, most suggestions have been along the lines of keyword searching - which of course assumes you know what you're looking for (which we often don't).

However, Win7 native search features let you search in Company and Author Field (and others) using the following syntax
system.author<keywords>
system.company<keywords>
system.lastmodifiedby<keywords>

Clearly you need to be in the right folder and indexing helps, but IMO it's actually rather good - you can even use Boolean AND/OR/NOT

Meantime, with zilch programming experience, I'm off to find out if it's possible to script this in Python. Wish me luck!!

Cheers

ReplyQuote

Page 2 / 3 Prev Next

Podcast: Well-Being In Digital Forensics And Policing: Insights From Hannah Bailey

Hannah Bailey shares her journey from frontline policin...

By Zoe , 1 day ago
RE: Android Forensics

Hi, Try decompressing the file using zlib in python. ...

By Dexter4n6 , 2 days ago
Interview: Neal Ysart, Co-Founder, The Coalition of Cyber Investigators

Neal Ysart shares how The Coalition of Cyber Investigat...

By Zoe , 2 days ago

8 Forums
15.7 K Topics
92.3 K Posts
7 Online
41.1 K Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed