Personally Identifiable Information Hides in Dark Data

May 3, 2013

To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card numbers, and all the other usual suspects. With the additional words “reasonable basis to believe that the information can be used to identify the individual”, HIPAA’s definition takes in digital handles such as emails, IP addresses and even facial imagery. But there’s a little more to HIPAA’s PII definition, and it applies specifically to free form text (commonly found in word processing documents, spreadsheets, presentations, etc.)

The complete list of HIPAA’s PIIs is enumerated in the law’s Safe Harbor guidelines. In plain-speak, these guidelines tell health IT administrators what information is considered private, requiring special authorization to view or process. It includes the aforementioned identifiers, as well as medical record numbers, health insurance IDs, and some others. By the way, we’ve conveniently put this PII list in our omnibus data protection compliance whitepaper.

An unstated assumption made by many is that PII only lives in structured formats—in other words, fields in a database. Readers of this blog of course know that PIIs are often likely to be harvested from the massive amounts of human generated dark data found on corporate files servers.

The HIPAA regulators have understood this as well. In clarifying the rules for removing PII —“de-identifying”—data for publication and general usage, they explicitly cover the possibility that PII can also reside in free-form text. I’ve excerpted the key paragraph from their de-identification best practices below :

PHI [protected health information] may exist in different types of data in a multitude of forms and formats in a covered entity.  This data may reside in highly structured database tables, such as billing records. Yet, it may also be stored in a wide range of documents with less structure and written in natural language, such as discharge summaries, progress notes, and laboratory test interpretations … The de-identification standard makes no distinction between data entered into standardized fields and information entered as free text (i.e., structured and unstructured text)— an identifier listed in the Safe Harbor standard must be removed regardless of its location.

Got that? PHI, which is essentially PII along with other sensitive medical information, embedded in spreadsheets, docs, and presentations is just as worthy of HIPAA privacy protections as fields in databases.

So if we follow these ideas—PIIs can be anything that reasonably links to an individual, and this data can exist in text—to their logical conclusion, then we need to consider a new possibility. Suppose this sentence from a doctor’s notes were uploaded to a file server:

The patient, a technical content specialist at Varonis, a software company, has been complaining about tennis elbow.

The natural question to ask is whether “technical content specialist at Varonis” is a PII?

It’s not a PII in the sense of a uniquely coded key such as social security number or health insurance ID that links back to a person. But in another sense, it acts very much like PII. Don’t believe me? Try typing that phrase into Google and see what comes up.

We’re really talking more about the meaning of the text—or as experts would say, the semantic value—rather than actual letters, numbers, and other syntax. But HIPAA’s Safe Harbor rule even takes this into account: it specifically notes that the “knowledge” in free text can also be used to point back to a person.

As a practical matter, the HIPAA rules mean that any reference to a patient’s job title and company is a violation of the law’s privacy protections.

This leads to a broader discussion on what’s called the “semantic web”. In brief, Google and a few others are already doing leading edge work on extracting meaning and knowledge from web content. You can see for yourself how well Google does this by entering the keywords “height of the empire state building” in a search. You’ll get back an actual answer, 1454’, in addition to all the docs with that exact phrase.

The larger point is that along with stealing PIIs, hackers and cyber thieves are also getting better at mining and interpreting human generated text for personal details, and then building more convincing fake identities to be used in social attacks, such as phishing and pretexting.

Bottom line: these bits and pieces of personal information that are scattered across file servers in clear-text documents can be used to identify an individual with very high likelihood.

That’s important to keep in mind when someone in your company asks, “do we know what’s in our files and the risks involved if our servers are breached?”


Clash of Compliance Cultures: Old vs. New World

February 11, 2013

In the last few years, US companies have not been shy about expressing their feelings on the EU’s Data Protection Directive (DPD). There’s a major social media player, for example, with a European HQ in Ireland that’s been publicly critical of a proposed “right to be forgotten” rule for letting consumers delete their online data. There’s also a search engine service that, while not openly objecting, is instead suggesting it’s already doing a darn good job of meeting the DPD’s rules.

US companies have begun to learn that the data privacy rules and expectations they’re accustomed to in the US are viewed differently on the other side of the Atlantic. The EU Charter–the European constitution—explicitly lists data protection as a fundamental right. That’s roughly like having a US amendment devoted to encryption, which, at this time, there isn’t.

This is not to say there’s a complete privacy compliance chasm between the US and EU.

Healthcare companies have long had extensive regulatory obligations under HIPAA for securing health information, alerting consumers about breaches, and gaining consent on information transfers. US companies in the banking and credit sectors could point to parallels in Gramm-Leach-Bliley and the Fair Credit Reporting Act.

While US medical and financial companies have had to deal with privacy and security legal burdens, that’s not been the case with the social media players. Because the Data Protection Directive covers all companies collecting data—not just ones in select, albeit important, industries—and through its Safe Harbor treaty it snags US firms as well, it’s not surprising that US Internet-based companies face the most culture shock when conducting business in the EU.

The ultimate issue is that in the new information economy data is revenue, and so deleting it is like, well, burning legacy paper currency.

Besides the right to data erasure differences, another sticking point between US social media companies and the EU is on rules for reasonable data retention limits. But this again reflects mostly differences between old and new economies.  After all, outside the social media world, it’s generally considered good security policy—limiting data breach liabilities—to keep PII data to a minimum and erase it when it’s no longer necessary. For example, the credit card vendors, through their PCI industry standard, emphatically remind corporations with regard to credit card numbers that “if you don’t need it, don’t store it! ”

But new regulatory forces along with changes in consumer attitudes may tilt social media companies towards a European view.

The FTC’s new privacy framework that was published earlier last year—and that I always come back to—calls for minimizing data collection of consumer data and sensible retention limits. There’s a (stalled) bill in the Senate, revealingly entitled “The Commercial Bill of Rights”, which will implement some EU-style data and privacy protections. The bill’s scope, by the way,  covers anycompany that “collects, uses, transfers, or stores covered information concerning more than 5,000 individuals.”

Good data protection and privacy best practices may one day become as American as espressos and lattes.



Defensible Disposal with Automation

September 13, 2012

It’s no secret that the data on corporate servers is growing exponentially. Documents, presentations, media, spreadsheets, and other files are constantly being created and moved onto servers, and after a while, most of it is rarely used, if at all. However, much of this stale data also must be retained in order to comply with regulatory compliance, or to maintain business continuity.

Many IT departments are faced with the reality of having to either continually expand their storage infrastructure or try to accurately determine which data can be safely disposed. The first option is costly and results in basically paying for information you’ll never use, while the latter can be costly in terms of man-hours and brainpower, especially without an automated process in place.

Let’s examine the options a bit closer.

Do Nothing

While it seems like a simpler solution to keep expanding your hardware and try to hold onto every bit just in case it is needed some time in the future, this sort of inaction with regards to defensible disposal is simply not a viable option. Allowing vast amounts of data to accumulate will make it increasingly difficult for users to find relevant data, slow down e-discovery, cause servers to perform poorly, and possibly even crash them, costing your business precious time and money.

Do Anything

Taking the wrong action can be just as damaging. Deleting your CEO’s old email archive might result in a very uncomfortable conversation; disposing of files that you are legally obligated to retain (for HIPAA, HITECH, SOX, etc.) can cost people their jobs, and possibly result in legal action. That’s something no IT professional ever wants to have to deal with.

Do the Right Thing

It should be clear by now exactly why proper defensible disposal techniques are integral to the survival of any business, especially those with sensitive data. Proper disposal techniques can save money and time by streamlining the process of deleting useless data and allowing for admins to focus on other more pressing needs.

If you’re finding the process itself takes quite a bit of planning and/or some sophisticated technology to do most of the heavy lifting, consider automating with technology like the Varonis Data Transport Engine. Varonis DTE simplifies the process of defensible disposal by leveraging our Metadata Framework, allowing admins to automatically and continually delete or migrate data based on a wide array of criteria, such as the content of the file or the date it was last accessed by a human user. This ensures that information that needs to be retained isn’t disposed of by accident and the data that can be safely deleted proceeds safely to bit-heaven, or bit bucket, or /dev/null.


Follow

Get every new post delivered to your Inbox.

Join 751 other followers