Revealed: Secret PIIs in your Unstructured Data!


Personally identifiable information or PII is pretty intuitive. If you know someone’s phone, social security, or credit card number, you have a direct link to their identity. Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities, and ultimately take your money. In some of my recent blogging, I’ve referred to the blurring of lines between PII and non-PII data. Case in point: it’s been known for at least 10 years that there are specific pieces of data, which in isolation may appear anonymous, but when taken together they’re just as effective at identifying a person as traditional PII.

The easiest to understand of these so called quasi-PIIs is the trio of full birth date, zip code, and gender. If a company  published a dataset that had been “de-identified” by removing all the standard PIIs, but left those three data items alone, a smart hacker could with very high likelihood find the name and address of the person behind that data.

Why would this work?  At a very basic level, the identity thief is effectively doing the work of a detective–essentially going through lists looking for matches. The lists in this case are voting records, which are available from most US towns and counties at a nominal fee– typically around $40. Voting records contain name, address, and most importantly full birth date; zip codes can be easily determined from the address.

By looking for matching birth dates and zip codes, savvy hackers narrow down the search to a few names. Add gender information and for most zip codes in the US, hackers can arrive at a unique name. Of course, the more additional information or clues gathered, especially taken from social media and other web sites, the easier it is to filter out names when there’s more than one candidate.

A quick back of the envelope calculation tells you why one might do very well with this approach. Taking 365 days—ignoring leap years—and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 “bins” to place a zip code’s worth of people. If you have gender information, you double the number of slots, to 58,400.

I can hear nitpickers out there saying that voting rolls contain names of those over the age of 18, so you would have to remove 6570 slots. True enough, but researchers have shown it’s possible to exploit Facebook’s leaky handling of data on school age minors to partially address this gap.

In any case, based on the last US census, there are over 40,000 zip codes, with an average of only 7000 people per zip code. On a gut level, it seems there’s a good chance most of those 7000 people will find themselves alone in one of those 58,400 slots. In other words, the odds are very good that most of them won’t share the same date of birth, zip code, and gender.

The real validation of this type of  hacking attack came from Carnegie Mellon University computer science professor and data privacy expert Latanya Sweeney, who ran the numbers back in 2000. Using then current census data (broken down by zip codes and age groups), she was able to identify 87% of the people in the US working with just those three non-PIIs.

Fortunately, Sweeney’s research and results from other experts have made their way to policy makers. For example, when medical research on patients is published, HIPAA’s Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (e.g., admission, birth) must also have the year removed.

With US regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification. They’ve called for all companies to achieve a “reasonable level of confidence” that their public data can’t be linked back to an individual. Clearly, the combination of birth date, zip code, and gender would fail that test.

Are there other quasi-PII’s out there? Of course! The larger problem is that consumers are sharing all kinds of information about themselves on web sites and social forums. In a possible scenario, think of an online retailer collecting preference data about its customers—sports interests, hobbies, etc.—along with geographic data and perhaps income information.

These data items would not be considered traditional PII.  If hackers pulled this “anonymous” data from a poorly permissioned file on a server, you could imagine them mining various special interest sites, looking for names that match up based on those interests and geo data.  Once they have a match, the next step might be a phishing attack, with the hackers pretending to be the retailer.

For companies that want to stay ahead of the coming stricter de-identification rules—that are being considered here in the US  and will likely become law in the EU—it would be worth their while to start carefully reviewing their non-PII data. Wherever that data might be on their file system.

Advertisements

2 thoughts on “Revealed: Secret PIIs in your Unstructured Data!

  1. Hey there! Would you mind if I share your blog with
    my twitter group? There’s a lot of people that I think would really appreciate your content. Please let me know. Cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s