Notifications
Clear all

Metadata Extraction

24 Posts
7 Users
0 Reactions
2,129 Views
(@cults14)
Reputable Member
Joined: 17 years ago
Posts: 367
Topic starter  

Hi hope someone can help.

One of the aspects of my job is to routinely search for unwanted Intellectual Property on our systems, mainly in MS Office and PDF files - file and mail stores.

I have been using a (nearly) free tool called Metadataminer Catalogue which you point at a drive or folder, press OK, and it goes off and strips out as much metadata as it can including - significantly for us - Company and Author for MS Office and Producer for PDF. And gives the option for exporting to Excel where we filter on Company and Author columns for quick return on time invested.

Whenever we go through an IP exercise, this is the route which turns up most stuff most quickly, and also helps point us in the right direction for additional keyword searches. It's been really helpful.

BUT - there's always a BUT isn't there?

Catalogue can't cope with Office 2007 and later "x" files - it returns nothing in the Company and Author field. And the publishers have no plans to update the product.

I've used Harlan's WMD.PL in the past but find this is suitable for small volumes, not the thousands we have to cope with. And AFAIK it's for Word only, not the remainder of the Office Suite.

So, does anyone out there know of anything which will do what I'm looking for? I seem to recall that pinpoint had something but I suspect that it's been incorporated into Harvester. NUIX collects the right stuff as far as I recall but that's not the point of NUIX. And the price point is prohibitive for us.

Any ideas anyone? We don't need to change metadata, just find it.

Cheers


   
Quote
keydet89
(@keydet89)
Famed Member
Joined: 21 years ago
Posts: 3568
 

I've used Harlan's WMD.PL in the past but find this is suitable for small volumes, not the thousands we have to cope with. And AFAIK it's for Word only, not the remainder of the Office Suite.

Actually, it does work pretty well on other document formats within the Office suite, prior to 2003. I also have other tools that can work, as well.

When you say that the tool is not suitable for the thousands of files that you have to cope with, what do you mean?

So, does anyone out there know of anything which will do what I'm looking for? I seem to recall that pinpoint had something but I suspect that it's been incorporated into Harvester. NUIX collects the right stuff as far as I recall but that's not the point of NUIX. And the price point is prohibitive for us.

Any ideas anyone? We don't need to change metadata, just find it.

If you're looking to add something like the MS Office suite beyond 97 (ie, docx/pptx, etc.), I'd look to something like EXIFTool or read_open_xml.pl, discussed here http//computer-forensics.sans.org/blog/2009/07/10/office-2007-metadata/

If you're thinking that these tools don't work for you because they only handle one file at a time, remember that they're open source and can be modified to search through directories and subdirectories for files.

HTH


   
ReplyQuote
(@cults14)
Reputable Member
Joined: 17 years ago
Posts: 367
Topic starter  

Hi Harlan, thanks for the prompt reply. I guess this is a cross between eDiscovery and Digital Forensics, you could call this EDA or Triage depending. Part of the problem is that we don't always know what we're looking for until we find it – we're basically looking for unusual stuff, stuff that shouldn't be there.

In terms of thousands – we have a project upcoming where an agreement we have with a JV partner is coming to an end, we need to flush any of their technical documentation out of our systems (and provide an audit trail). Although we're only talking about 30 users for this project, we have had others with upwards of 130 users.

That's local hard drives, attachments in local PSTs, shared drives on servers, Home drives on servers, attachments in Mailbox (Exchange), and attachments in Enterprise Vault (Exchange) – and we NEVER delete from EV. It's some piece of work. One project, we had a couple of guys with over 50,000 data files each on their local hard drives which we needed to analyse, Catalogue gave us a heads-up within a couple of hours that we had a problem. There are other things we look our for e.g. unusually large volumes of TIF, PDF or JPG files which could indicate scanning of hard copy documents.

I appreciate what you're saying about Open Source, but I've come to this at a late stage in life from a non-techie background so have zero programming skills  We have a small development department but it's been scaled back recently and there's no slack at the moment.

I'll have a look-see it it's possible to do a command-line batch file based on ExifTool, and with output to txt or (better) csv.

I may end up going with the begging bowl though…………….


   
ReplyQuote
(@kovar)
Prominent Member
Joined: 18 years ago
Posts: 805
 

Greetings,

Long shot, but have you looked at dtSearch's network spider version? That may do what you want.

-David


   
ReplyQuote
(@cults14)
Reputable Member
Joined: 17 years ago
Posts: 367
Topic starter  

Oops typo - meant ECA not EDA, in case you thought there was yet another TLA out there D


   
ReplyQuote
(@cults14)
Reputable Member
Joined: 17 years ago
Posts: 367
Topic starter  

David, thanks for that. I hadn't considered a search tool as often we don't know what we're looking for until we find it. Hence stripping out metadata and filtering in Excel looking for anything unusual is how we've worked in the past.

If dtSearch can output Office doc metadata to Excel without specific search terms then OK. Hope I'm not too stuck in my ways here.


   
ReplyQuote
(@kovar)
Prominent Member
Joined: 18 years ago
Posts: 805
 

Greetings,

You may be a bit stuck in your existing methods with the result that some alternatives are closed to you. I don't know how you're determining "unusual" so I can't really suggest options for you. If you must use humans to detect unusual activity then you may be very limited in your options.

-David


   
ReplyQuote
keydet89
(@keydet89)
Famed Member
Joined: 21 years ago
Posts: 3568
 

I appreciate what you're saying about Open Source, but I've come to this at a late stage in life from a non-techie background so have zero programming skills  We have a small development department but it's been scaled back recently and there's no slack at the moment.

I hear that a lot…but what I rarely hear is anyone asking to have a one-shot tool updated to do directories/subdirectories…


   
ReplyQuote
(@kovar)
Prominent Member
Joined: 18 years ago
Posts: 805
 

Greetings,

I'm working on a tool that'll run a variety of other tools over a list of images that are either already mounted or will be mounted by the tool. This is based on a request from one person, and talking to others about their workflow.

It could easily be extended to run a one shot tool over multiple directories, or over remote systems.

It'd not be terribly efficient though.

-David


   
ReplyQuote
keydet89
(@keydet89)
Famed Member
Joined: 21 years ago
Posts: 3568
 

David,

We should talk…I'm working on something similar.


   
ReplyQuote
Page 1 / 3
Share: