There are some SSN standards… Sort of…
There will never be a SSN where the sub-set is all zeros, i.e. 000-nn-nnnn, nnn-00-nnnn, or nn-nnn-0000 are invalid.
The first two parts have to be below 772-80-nnnn, as of today. They represent a geographic location where the SSN was applied to or issued, depending on the issue time (pre and post March 1972), and a batch number (second set). The Social Security Administration publishes monthly the batches issued.
666-nnn-nnnn is unassigned (albeit not officially acknowledged).
987-65-4320 through 987-54-4329 are used for example purposes.
The last four digits are just sequence numbers.
Here is some reference - http//
Hope this helps with that grep.
Hogfly,
Okay, so after all that…do you have an EnScript for the OP? Do you have an EnScript that searches for PII?
*sigh* No, haven't written one..though it's been planned by me and one other person. I suggested very valid alternatives that satisfy 90-95% of PII searches. The rest can be satisfied with a regex.
How is PII defined? How do you search for any person's first or last name across all space of a hard drive? Last name? In order for this to be considered PII, the person's name has to be included with something else…such as an SSN?
It is even more complicated that than, if you are in the United States. Of the 50 states, there are 43 different legal requirements for reporting loss/theft of PII. Name and SSN are the most obvious. Until recently, it was the default in Arizona that the driver's license number was the SSN and the Blues (health insurance) used to use SSN as the identifier for the policy holder.
In some states, name and driver's license number or name and birthdate with one other identifier (address, phone number, mother's maiden name, etc.) is reportable.
Name, account number and PIN are reportable in many states, but you don't need to have the complete name. For example, if I have the user login and PIN for a brokerage account, that is PII. I had a case where someone had purchased a stolen laptop and examination of the disk revealed valid userids and PINs for a number of customers of a brokerage firm which could have been used to manipulate the account holders' holdings. These were discovered only because Internet history data showed signs that a single account had been accessed and using the account parameters (stupidly passed as parameters in the URL) we were able to find the file containing the rest of the accounts. But there were no names, per se.
Bottom line, as others have mentioned, is that PII can be very difficult to spot unless you know what you are looking for. You can try doing searches for things like "account" or "username" or "userid" or "passwd" or "pass"… well, you get the idea.
I'm not sure that some of what we're seeing posted in this thread is going to be beneficial.
Case in point…while I was a member of a team on the PCI QIRA list (I'm still on the team, but we're not on the list any longer), during one of my recertification sessions, a tool for locating PCI data was discussed. I was one of the few forensic responders in the session, as most of the attendees were assessors. I made the point that the tool mentioned was insufficient for use by QIRA teams, as it only notified you that a file had been found to contain PCI data. My point was that for assessors, only one valid credit card number needs to be identified on a system…for QIRA forensic responders, *ALL* possible credit card numbers need to be identified.
My point is that we have to take care in how we search for PII data, because the same conditions are true. All PII data needs to be identified for the purposes of notification. There can be serious consequences if only 80% of the PII data is revealed and only those individuals are notified.
Another issue that needs to be recognized is that not all PII (or PHI) is in an easily searchable format. I have run a variety of searches on a system for PII data, all of which came up negative…only to find significant amounts of PII in scanned images (.TIF, etc.).
I guess the overall revelation about this issue is that examining a single system (say, just a laptop hard drive image) can be an intensive, iterative, manual (and hence expensive) process.
My point is that we have to take care in how we search for PII data, because the same conditions are true. All PII data needs to be identified for the purposes of notification. There can be serious consequences if only 80% of the PII data is revealed and only those individuals are notified.
Absolutely. On the other hand, there is tremendous cost to offering credit protection to individuals on the basis that their data may have been compromised. In addition to the direct costs, there is the indirect cost of damage to the reputation of the client who was keeper of the data (look at the Heartland case as an example).
Once you find what may be a valid name/SSN pair, verification is not inexpensive, especially if you have a large number of names. Credit reporting agencies typically charge between $15 and $45 per name/SSN pair whether they ultimately prove to be valid or not.
In other words, false positives can be as costly as false negatives. Either way, closing the barn door after the cow has escaped is expensive.
Sean,
Exactly. This is where the question of "was sensitive data processed by or stored on the system" is a real issue, even today…largely because in many of the incidents I and others respond to, the answer from the customer is, "I don't know."
Unfortunately, I think this is also where our industry suffers from the "CSI Effect", as well…