The first thing I usually get asked when investigating a healthcare system is to identify whether PHI or PII reside on the system. People ask this because if it does then HIPAA regulates that the owners of the system may need to send out notifications within a specific time period.
Does anyone have any techniques they use to determine if PHI or PII resides on a system?
We should be able to do a keyword search on a drive after indexing the contents. I found this short list of regular expressions
^.*(ssn|social|security).*$
^.*name.*$
^.*address.*$
^.*city.*$
^.*state.*$
^.*zip.*$
^.*county.*$
^.*precinct.*$
^.*(email|e-mail|mail).*$
Has anyone else compiled such or list or have any other ideas on how to automate this task?
The first thing I usually get asked when investigating a healthcare system is to identify whether PHI or PII reside on the system. People ask this because if it does then HIPAA …
The first thing I usually do when seeing acronyms on a new post is to ask the OP to explicit them
http//
http//
http//
even if they are identifiable by the context.
jaclaz
When I dwell into a new cultural sub-category, I like to get samples of known data.
That is, in your case I would get a database of personally identifiable information (PII) as they are structured in the target system and extract key words from that.
I would do the same for protected health information (PHI), and anything else covered by Health Insurance Portability and Accountability Act (HIPAA).
It is easier and much better results, in my opinion to use a sample of known data to find similar data than to attempt and guess.