Hi hope someone can help.
One of the aspects of my job is to routinely search for unwanted Intellectual Property on our systems, mainly in MS Office and PDF files - file and mail stores.
I have been using a (nearly) free tool called Metadataminer Catalogue which you point at a drive or folder, press OK, and it goes off and strips out as much metadata as it can including - significantly for us - Company and Author for MS Office and Producer for PDF. And gives the option for exporting to Excel where we filter on Company and Author columns for quick return on time invested.
Whenever we go through an IP exercise, this is the route which turns up most stuff most quickly, and also helps point us in the right direction for additional keyword searches. It's been really helpful.
BUT - there's always a BUT isn't there?
Catalogue can't cope with Office 2007 and later "x" files - it returns nothing in the Company and Author field. And the publishers have no plans to update the product.
I've used Harlan's WMD.PL in the past but find this is suitable for small volumes, not the thousands we have to cope with. And AFAIK it's for Word only, not the remainder of the Office Suite.
So, does anyone out there know of anything which will do what I'm looking for? I seem to recall that pinpoint had something but I suspect that it's been incorporated into Harvester. NUIX collects the right stuff as far as I recall but that's not the point of NUIX. And the price point is prohibitive for us.
Any ideas anyone? We don't need to change metadata, just find it.
Cheers
I've used Harlan's WMD.PL in the past but find this is suitable for small volumes, not the thousands we have to cope with. And AFAIK it's for Word only, not the remainder of the Office Suite.
Actually, it does work pretty well on other document formats within the Office suite, prior to 2003. I also have other tools that can work, as well.
When you say that the tool is not suitable for the thousands of files that you have to cope with, what do you mean?
So, does anyone out there know of anything which will do what I'm looking for? I seem to recall that pinpoint had something but I suspect that it's been incorporated into Harvester. NUIX collects the right stuff as far as I recall but that's not the point of NUIX. And the price point is prohibitive for us.
Any ideas anyone? We don't need to change metadata, just find it.
If you're looking to add something like the MS Office suite beyond 97 (ie, docx/pptx, etc.), I'd look to something like EXIFTool or read_open_xml.pl, discussed here http//
If you're thinking that these tools don't work for you because they only handle one file at a time, remember that they're open source and can be modified to search through directories and subdirectories for files.
HTH
Hi Harlan, thanks for the prompt reply. I guess this is a cross between eDiscovery and Digital Forensics, you could call this EDA or Triage depending. Part of the problem is that we don't always know what we're looking for until we find it – we're basically looking for unusual stuff, stuff that shouldn't be there.
In terms of thousands – we have a project upcoming where an agreement we have with a JV partner is coming to an end, we need to flush any of their technical documentation out of our systems (and provide an audit trail). Although we're only talking about 30 users for this project, we have had others with upwards of 130 users.
That's local hard drives, attachments in local PSTs, shared drives on servers, Home drives on servers, attachments in Mailbox (Exchange), and attachments in Enterprise Vault (Exchange) – and we NEVER delete from EV. It's some piece of work. One project, we had a couple of guys with over 50,000 data files each on their local hard drives which we needed to analyse, Catalogue gave us a heads-up within a couple of hours that we had a problem. There are other things we look our for e.g. unusually large volumes of TIF, PDF or JPG files which could indicate scanning of hard copy documents.
I appreciate what you're saying about Open Source, but I've come to this at a late stage in life from a non-techie background so have zero programming skills We have a small development department but it's been scaled back recently and there's no slack at the moment.
I'll have a look-see it it's possible to do a command-line batch file based on ExifTool, and with output to txt or (better) csv.
I may end up going with the begging bowl though…………….
Greetings,
Long shot, but have you looked at dtSearch's network spider version? That may do what you want.
-David
Oops typo - meant ECA not EDA, in case you thought there was yet another TLA out there D
David, thanks for that. I hadn't considered a search tool as often we don't know what we're looking for until we find it. Hence stripping out metadata and filtering in Excel looking for anything unusual is how we've worked in the past.
If dtSearch can output Office doc metadata to Excel without specific search terms then OK. Hope I'm not too stuck in my ways here.
Greetings,
You may be a bit stuck in your existing methods with the result that some alternatives are closed to you. I don't know how you're determining "unusual" so I can't really suggest options for you. If you must use humans to detect unusual activity then you may be very limited in your options.
-David
I appreciate what you're saying about Open Source, but I've come to this at a late stage in life from a non-techie background so have zero programming skills We have a small development department but it's been scaled back recently and there's no slack at the moment.
I hear that a lot…but what I rarely hear is anyone asking to have a one-shot tool updated to do directories/subdirectories…
Greetings,
I'm working on a tool that'll run a variety of other tools over a list of images that are either already mounted or will be mounted by the tool. This is based on a request from one person, and talking to others about their workflow.
It could easily be extended to run a one shot tool over multiple directories, or over remote systems.
It'd not be terribly efficient though.
-David
David,
We should talk…I'm working on something similar.