Mass marketing vs personalisation (infographic)

May 9, 2013

85 percent of us know that websites track their online shopping behavior, a new report from ecommerce optimization company Monetate says, and 75 percent of us want retailers to use our personal information to customize our shopping experiences.

That’s going back to the future, according to Monetate: going back to a time when all commerce was personal.

But there is a yin and a yang here.

While we may want personalized experiences, and we want websites to be smart — to know us, essentially, and act as an intelligent, solicitous person might — privacy is part of the picture. A good third of us don’t want our website activity tracked, and a quarter of us don’t want the websites we shop to personalize our experience at all.

Monetate has four tips for online retailers:

  1. Use marketing automation technology and big data to assist with personalization
  2. Target segments with relevant content based on what you know about them
  3. Don’t think of channels, think of customers first
  4. Be in it for the long haul, not the quick win

All the data, in visual form:

Personal-Mass-Marketing-Infographic_FINAL
Read more at http://venturebeat.com/2013/05/07/mass-marketing-vs-personalization-infographic/#qItF8VoBijgGBY1R.99


Personally Identifiable Information Hides in Dark Data

May 3, 2013

To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card numbers, and all the other usual suspects. With the additional words “reasonable basis to believe that the information can be used to identify the individual”, HIPAA’s definition takes in digital handles such as emails, IP addresses and even facial imagery. But there’s a little more to HIPAA’s PII definition, and it applies specifically to free form text (commonly found in word processing documents, spreadsheets, presentations, etc.)

The complete list of HIPAA’s PIIs is enumerated in the law’s Safe Harbor guidelines. In plain-speak, these guidelines tell health IT administrators what information is considered private, requiring special authorization to view or process. It includes the aforementioned identifiers, as well as medical record numbers, health insurance IDs, and some others. By the way, we’ve conveniently put this PII list in our omnibus data protection compliance whitepaper.

An unstated assumption made by many is that PII only lives in structured formats—in other words, fields in a database. Readers of this blog of course know that PIIs are often likely to be harvested from the massive amounts of human generated dark data found on corporate files servers.

The HIPAA regulators have understood this as well. In clarifying the rules for removing PII —“de-identifying”—data for publication and general usage, they explicitly cover the possibility that PII can also reside in free-form text. I’ve excerpted the key paragraph from their de-identification best practices below :

PHI [protected health information] may exist in different types of data in a multitude of forms and formats in a covered entity.  This data may reside in highly structured database tables, such as billing records. Yet, it may also be stored in a wide range of documents with less structure and written in natural language, such as discharge summaries, progress notes, and laboratory test interpretations … The de-identification standard makes no distinction between data entered into standardized fields and information entered as free text (i.e., structured and unstructured text)— an identifier listed in the Safe Harbor standard must be removed regardless of its location.

Got that? PHI, which is essentially PII along with other sensitive medical information, embedded in spreadsheets, docs, and presentations is just as worthy of HIPAA privacy protections as fields in databases.

So if we follow these ideas—PIIs can be anything that reasonably links to an individual, and this data can exist in text—to their logical conclusion, then we need to consider a new possibility. Suppose this sentence from a doctor’s notes were uploaded to a file server:

The patient, a technical content specialist at Varonis, a software company, has been complaining about tennis elbow.

The natural question to ask is whether “technical content specialist at Varonis” is a PII?

It’s not a PII in the sense of a uniquely coded key such as social security number or health insurance ID that links back to a person. But in another sense, it acts very much like PII. Don’t believe me? Try typing that phrase into Google and see what comes up.

We’re really talking more about the meaning of the text—or as experts would say, the semantic value—rather than actual letters, numbers, and other syntax. But HIPAA’s Safe Harbor rule even takes this into account: it specifically notes that the “knowledge” in free text can also be used to point back to a person.

As a practical matter, the HIPAA rules mean that any reference to a patient’s job title and company is a violation of the law’s privacy protections.

This leads to a broader discussion on what’s called the “semantic web”. In brief, Google and a few others are already doing leading edge work on extracting meaning and knowledge from web content. You can see for yourself how well Google does this by entering the keywords “height of the empire state building” in a search. You’ll get back an actual answer, 1454’, in addition to all the docs with that exact phrase.

The larger point is that along with stealing PIIs, hackers and cyber thieves are also getting better at mining and interpreting human generated text for personal details, and then building more convincing fake identities to be used in social attacks, such as phishing and pretexting.

Bottom line: these bits and pieces of personal information that are scattered across file servers in clear-text documents can be used to identify an individual with very high likelihood.

That’s important to keep in mind when someone in your company asks, “do we know what’s in our files and the risks involved if our servers are breached?”


FTC Warning on Sharing Files in the Cloud

March 26, 2013

As part of a research project I’m doing on data breaches, I came across some great practical advice about file sharing in the cloud, courtesy of the Federal Trade Commission. By the way, the FTC also has  extensive information on security incidents. In any case, this 2010 report warns businesses to carefully review the risks of sharing data outside the corporate intranet via cloud services.

The FTC reminds medical and financial organizations that they are under special obligations to protect social security and bank account numbers, healthcare data, and other personal information.  But any business that has PII that can potentially leak out of their IT infrastructure will find their guidelines very useful.

It’s not that the FTC is against external data sharing in the cloud—which they refer to in the report as P2P file sharing—but they ask companies to consider the risks. Here’s a key section that nicely summarizes the drawbacks:

People who use P2P file sharing software can inadvertently share files. They might accidentally choose to share drives or folders that contain sensitive information, or they could save a private file to a shared drive or folder by mistake, making that private file available to others. In addition, viruses and other malware can change the drives or folders designated for sharing, putting private files at risk  … Once a user on a P2P network downloads someone else’s files, the files can’t be retrieved or deleted. What’s more, files can be shared among computers long after they have been deleted from the original source computer …

And for those companies that do use P2P, the FTC suggests a few measures to improve security:

  • Bring the P2P software in-house and only give access to authorized users
  • Delete sensitive information you don’t need, and restrict where files with sensitive information can be saved
  • Use appropriate file-naming conventions that are less likely to disclose the contents
  • Monitor your network to detect unapproved P2P file sharing programs

If you’re currently looking for an in-house solution that satisfies the requirements above, check outDatAnywhere.  DatAnywhere offers the cloud experience without the cloud.  It’s a no-compromise security solution that uses your organizations existing file sharing infrastructure to provide file sync services, mobile device access, browser access, and 3rd party collaboration.


Revealed: Secret PIIs in your Unstructured Data!

March 26, 2013

Personally identifiable information or PII is pretty intuitive. If you know someone’s phone, social security, or credit card number, you have a direct link to their identity. Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities, and ultimately take your money. In some of my recent blogging, I’ve referred to the blurring of lines between PII and non-PII data. Case in point: it’s been known for at least 10 years that there are specific pieces of data, which in isolation may appear anonymous, but when taken together they’re just as effective at identifying a person as traditional PII.

The easiest to understand of these so called quasi-PIIs is the trio of full birth date, zip code, and gender. If a company  published a dataset that had been “de-identified” by removing all the standard PIIs, but left those three data items alone, a smart hacker could with very high likelihood find the name and address of the person behind that data.

Why would this work?  At a very basic level, the identity thief is effectively doing the work of a detective–essentially going through lists looking for matches. The lists in this case are voting records, which are available from most US towns and counties at a nominal fee– typically around $40. Voting records contain name, address, and most importantly full birth date; zip codes can be easily determined from the address.

By looking for matching birth dates and zip codes, savvy hackers narrow down the search to a few names. Add gender information and for most zip codes in the US, hackers can arrive at a unique name. Of course, the more additional information or clues gathered, especially taken from social media and other web sites, the easier it is to filter out names when there’s more than one candidate.

A quick back of the envelope calculation tells you why one might do very well with this approach. Taking 365 days—ignoring leap years—and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 “bins” to place a zip code’s worth of people. If you have gender information, you double the number of slots, to 58,400.

I can hear nitpickers out there saying that voting rolls contain names of those over the age of 18, so you would have to remove 6570 slots. True enough, but researchers have shown it’s possible to exploit Facebook’s leaky handling of data on school age minors to partially address this gap.

In any case, based on the last US census, there are over 40,000 zip codes, with an average of only 7000 people per zip code. On a gut level, it seems there’s a good chance most of those 7000 people will find themselves alone in one of those 58,400 slots. In other words, the odds are very good that most of them won’t share the same date of birth, zip code, and gender.

The real validation of this type of  hacking attack came from Carnegie Mellon University computer science professor and data privacy expert Latanya Sweeney, who ran the numbers back in 2000. Using then current census data (broken down by zip codes and age groups), she was able to identify 87% of the people in the US working with just those three non-PIIs.

Fortunately, Sweeney’s research and results from other experts have made their way to policy makers. For example, when medical research on patients is published, HIPAA’s Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (e.g., admission, birth) must also have the year removed.

With US regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification. They’ve called for all companies to achieve a “reasonable level of confidence” that their public data can’t be linked back to an individual. Clearly, the combination of birth date, zip code, and gender would fail that test.

Are there other quasi-PII’s out there? Of course! The larger problem is that consumers are sharing all kinds of information about themselves on web sites and social forums. In a possible scenario, think of an online retailer collecting preference data about its customers—sports interests, hobbies, etc.—along with geographic data and perhaps income information.

These data items would not be considered traditional PII.  If hackers pulled this “anonymous” data from a poorly permissioned file on a server, you could imagine them mining various special interest sites, looking for names that match up based on those interests and geo data.  Once they have a match, the next step might be a phishing attack, with the hackers pretending to be the retailer.

For companies that want to stay ahead of the coming stricter de-identification rules—that are being considered here in the US  and will likely become law in the EU—it would be worth their while to start carefully reviewing their non-PII data. Wherever that data might be on their file system.


Is DNA Really Personally Identifiable Information (PII)? No. Maybe? Yes!

February 5, 2013

Biometric data is at the limits of what current personal data privacy laws consider worthy of protection. This type of identifier covers fingerprints, voiceprints, and facial images. While the risk factors are not nearly as threatening to consumers as more traditional PII, they do exist. Until recently, the dangers of biometric identification using DNA were more theoretical than real. That has suddenly changed. An article in The New York Times last month put a spotlight on research that proved the feasibility of identifying a person—getting a specific name and address—all from a DNA sequence posted online.

It’s not that regulators have overlooked biometric identifiers. Under HIPAA’s safe harbor rules, for example, the Department of Health and Human Services has a list of 18 e-PHIs that would need to be removed from public medical data for it to be effectively considered de-identified. Along with IP addresses, URLs, email addresses, HHS mentions biometric data, with voiceprints and fingerprints given as the only examples.

I’ve already written about how the Federal Trade Commission, another key US agency involved in data privacy regulation, has issued new guidelines to companies collecting facial images. Driving the FTC’s suggestions—mostly directed at retailers—are the recent improvements in image recognition technology and the availability of massive amounts of tagged photos on social media sites. Image matching software is now good enough so that a face captured by a store’s mall kiosk can eventually reveal ethnicity, mood, and with good likelihood, an actual name behind the face.

The risk of linking a name to a set of fingerprints is less serious for the general public— unless you have a criminal record. However, after the Graduate Management Admission Council  (GMAC) began using fingerprints to establish the identity of students taking their “GMATs” for admission to US business schools, the testing company realized there could be privacy issues.

GMAC ultimately decided to use palm scans, which are based on digitizing vein patterns. Since public databases of hand veins don’t exist, the possibility of identification is eliminated.

I would have put DNA into the same category as palm scans: there’s advanced matching technology—available even at the consumer level—but without a public database, there isn’t much of a privacy issue, and therefore DNA is not really a PII.

However, this is not true anymore, and that was the starting point for the researchers mentioned in the Times article. There are actually two public genealogy databases for tracking down one’s ancestry, Ysearch and SMGF, with a combined 135,000 records of DNA data and covering about 39,000 unique last names.

These genealogy databases simply accept a key—actually a pattern on the Y-chromosome—and then return a surname (along with a confidence level). The idea behind these services is to help subscribers find their ancestors and learn more about family backgrounds.

The researchers then examined whether they could narrow down their search. They assumed that they had the state of residency of the subject along with a birthdate—both of these, by the way, are not considered PII under current HIPAA rules. With these three data points and public US Census data, they were able to prove that successful DNA matches would lead to just 12 people on average. That’s a stunning end result from starting with just a DNA pattern.

How good is the DNA “keyword” match at finding a last name? The researchers projected a success rate of 12% for males—since it’s based on the Y chromosome—with a 5% false positive. This is not nearly as accurate as the facial scans, but still a cause for concern. They concluded that the risk of this DNA-based last name search will grow in the future, and there are other scientists and experts who are calling for more public discussion.

I decided to check the privacy policy of one of the DNA testing services. Here’s the good news. They’ll only release your DNA data to third parties with your consent; they treat genetic data as personal data (like name and address), and they say that the genetic data is stored on “secure servers”.

However, thinking purely in term of bytes, folders, and access rights, I’m wondering how truly secure those DNA files are, and whether there are already hackers looking to get that data using the same techniques and exploits they use to snatch credit card numbers and other personally identifiable information.


The #1 legal concern data security

January 30, 2013

Inside Counsel magazine recently reported that data security is the top issue cited by more than half of in-house lawyers. This was reflected in a conversation yesterday at the IACCM Board Meeting, where both lawyers and non-lawyers highlighted its growing importance.

The Inside Counsel article focuses on the need to understand the nature of the data possessed within a business and then to take steps for its protection. It concentrates largely on worries over regulatory compliance and reporting, so various forms of personal data lie at the forefront of concerns. Since some level of hacking appears inevitable, the advice relates largely to the steps needed to limit potential fines and to eliminate the need for reporting. Much of this revolves around encryption, but also the need to analyze data flows to ensure weak spots are identified.

At the IACCM meeting, perhaps because more of the companies represented are b2b, the focus was somewhat different. For them, data security was also about critical business data – product development, strategic plans, customer records. The concern is more around the exposure that arises from links with trading partners – the extent to which shared systems or information access creates a gateway to wider data loss. The implications of this force companies to consider a wider array of solutions. This includes terms and conditions that commit trading partners to appropriate steps and contain penalties for failure. It often incorporates some right of audit or validation.

But ultimately, terms and conditions are a relatively weak form of protection because the most likely reasons for data security breach are either because  a trading partner lacks size and sophistication, or because it lacks integrity. And these issues will typically be fixed only one of two ways – that is, do the work in-house or select top quality partners who cannot afford reputational damage.


7 Recommendations for Data Protection by Forrester’s Andras Cser

November 27, 2012

by David Gibson

Last week Varonis hosted a webinar on using strong identify context to help protect data, where I was joined by Andras Cser of Forrester. Andras shared really interesting insights on the impact of data breaches, what got stolen, how they happened, and what you can do to better protect yourself.

On topic of entitlement reviews, Andras shared, “You have to get into a fairly rigid and rigorous structure of attestations, and basically that means you would want to have a campaign that runs every quarter, clearly understand the mappings between people, groups and resources that they’re accessing, and have managers look at their employees’ access rights, data elements, data access, and also application users should be granted some way of overseeing who has access to the data their application actually generates.”

Andras also shared illuminating key case studies from organizations that are protecting hundreds of terabytes to petabytes of data that are growing at 1-2.5% per week. It was fun for me to hear a fresh perspective on what works and what doesn’t when you’re trying to manage and protect data at scale.

Some of Andras’ recommendations were:

To see all seven of Andras’ recommendations, register to download and watch the full data protection webinar here.


The New Privacy Environment: European Union Leads the Way on Personal Data Protection

October 24, 2012

We all understand the risks in accidentally revealing a social security number. But are there other pieces of less identifying or even anonymous information that taken together act like a social security number? The European Union is breaking new ground on consumer privacy as it begins to reform its own regulations. The EU’s broader ideas on personal identity have even made their way across the pond into proposed new US regulations.

The history of the European Union’s consumer privacy and data security regulations begins with its 1995 Data Protection Directive–or EU 96/46EC for security wonks. EU directives provide guidance to its member nations’ legislatures, who then are free to craft their own specific laws. The DPD has been influential in shaping the vocabulary and, less charitably, the jargon of the consumer privacy discussion on both sides of the Atlantic.

In the US, the starting point for discussion on data security is Sarbanes-Oxley, which became law in 2002. In comparing and contrasting the two, it’s fair to say the DPD was more focused on securing consumer information, but more inclusive—unlike SOX–in covering both public and private companies. To this day in the US there’s currently no single comprehensive law on consumer privacy.

The EU’s original directive is significant because it defined personal data as “information relating to an identified or identifiable natural person”. For example, by EU rules, street address, name, and phone number are personal data; height, eye color, and model of car you drive are not. This notion of personal data as a type of key is part of the definition used in privacy laws outside the EU–including the US. In North America, though, we’ve come up with our own term for personal data, calling it instead “personally identifiable information” or PII.

By the way, the EU regulators intentionally created a less explicit definition of personal data so that it would encompass new technologies. In 2012, data related to an identifiable person could now be an email address, IP address, and for some EU nations, even a photo image.

To bring the story up to date, security experts began to realize that along with personal data there was other data–let’s call it quasi-personal–that if released could also be used to relate back to an individual. The data magic to accomplish identification typically requires matching a collection of anonymous data points– birth dates (or years), zip codes, ethnicity, and perhaps car model driven–against publicly available databases .

For example, there are well documented cases involving anonymized hospital discharge records subsequently used to re-identify the original patients!

With Facebook now up to 1 billion active users, it’s fair to say that the Web is overflowing with personal data at all levels of detail. Essentially social networks have provided hackers—the new ominous player on the scene—with a huge public repository to match against (c.f. Matt Honan).

To get a better understanding of how it’s possible to re-identify an individual, let’s review a variation on the aforementioned case. While the technique is not always guaranteed to uniquely identify a person (this depends on the available related information), it can often produce a narrowed down list of highly likely subjects.

Suppose, for argument’s sake, a European mortgage company analyzes a health report from a large public hospital. The records show that five individuals were being treated for a rare disease. Their ages were also published. Assuming the patients live near the hospital, the mortgage lender then simply filters its database on zip code and birth year. Working with a smaller set of records, it then scans social media sites or other online forums, filtering on the retrieved names and other data, all the while looking, for say, “get well” messages. If it finds a few matches, and with the additional new data points from the social site … I think you see where this is leading.

The good news is that the EU countries have long recognized that their laws have not kept pace. And the EU governing body is currently in the process of reforming the 1995 directive, taking into account the new realities of public data on the Web and the blurring of personal and anonymous data. To get a sense of the EU’s new thinking on personal data, refer to this work-in-progresspaper.

And there are also rumblings of change in the US along the same lines as the EU reforms.


The State of Data Protection [INFOGRAPHIC]

September 28, 2012

In the age of big data, businesses are creating, processing, storing, and sharing information at an alarming rate. A significant amount of the data is highly sensitive or confidential and should be properly safeguarded. It’s unnerving to think about the possibility of our own personal information sitting on servers, possibly unencrypted and open to everyone.

We hope that companies are complying with SOX, HIPAA, PCI, and other regulations but, as we know, hope is not a strategy – so we decided to take a hard look at the current state of data protection.

In March of 2012 we surveyed over 200 individuals in the IT community, asking about their current data protection practices and confidence levels, and how data protection practices correlate with data protection activities.

The results may surprise you. While over 80% reported that they store data belonging to customers, vendors, and other business partners, only 26% reported being very confident that data stored within their organization is protected.

Enjoy, share, embed our infographic and download the full report to learn which data protection activities truly matter.

The State of Data Protection


European Data Protection Reform Update: Summary of the 25 January 2012 Announcement

May 30, 2012

I know we are a few months out, but we spotted this information refernece European Data Protection Reform that is really interesting:

Summary of the Changes

The following key areas of the reform will impact on privacy and data protection compliance for organisations:

  • A Single Set of Rules: The Proposed Regulation provides for a single set of rules for all organisations processing personal data in the European Union. It will replace the first Data Protection Directive (published in 1995), which will be repealed. This Proposed Regulation will have direct effect in all Member States and, as a result, will achieve greater harmonisation than if the reform was made by a revised Directive, which carries with it a risk of inconsistent implementation by Member States, as witnessed with the implementation of the Data Protection Directive. In addition to the Proposed Regulation, there will be a new Directive on protecting personal data processed for the purposes of prevention, detection, investigation or prosecution of criminal offences and related judicial activities.
  • Fines: National data protection authorities will be allowed to impose fines of up to 2% of the worldwide gross revenue of an organisation. The 2011 proposal had set this amount at 5% of worldwide gross revenue.
  • “One-Stop Shop”: The Proposed Regulation implements a “one-stop shop” approach to data protection compliance in the European Union, meaning that an organisation only needs to comply with the data protection laws in place in the jurisdiction in which it has its main establishment. This is similar to the passporting system and principle of home state supervision, which is already reflected in European financial services regulation. In addition, the Proposed Regulation will have extra-territorial effect. This means it will apply to organisations (such as many U.S. businesses) that are not established in the European Union, but are active in the European Union market and offer their services to European Union citizens.
  • Data Breach Notification: The Proposed Regulation imposes a general requirement on all businesses to notify data protection authorities and data subjects in the event of a data breach. Notice of data breaches must be provided to the data protection authority “where feasible” within 24 hours, and to affected data subjects “without undue delay.” While breach notification has recently become a requirement for telecommunications and internet service providers, the Proposed Regulation extends this requirement to all organisations. Given the increase in global cyber risks and the reputational impact and associated costs of data losses and breaches, this aspect of the reform is likely to have a significant impact on organisations.
  • Consent: Where consent is to be used as a justification for processing personal data, the Proposed Regulation requires that it must be given explicitly, rather than assumed. This will cause particular concern for e-commerce organisations worried about how to obtain consent without detrimentally affecting the user experience.
  • Data Portability: The Proposed Regulation also introduces a new individual right of data portability, which is designed to facilitate an individual’s access to personal data. This requires organisations to permit customers to move their data to new organisations offering similar products or services. This is also intended to improve competition among services. While this may sound relatively straightforward, in practice the costs of migrating data from one system to another can vary significantly, and may be particularly burdensome for cloud providers and social networks.
  • The “Right to be Forgotten”: The Proposed Regulation also adds a new “right to be forgotten” which allows an individual to require an organisation to delete personal data where there is no longer any legitimate reason for keeping it. This new right is more stringent in nature to the existing obligation for data controllers not to keep data for longer than is necessary.
  • International Transfer of Data: The Proposed Regulation provides for a shift in the rules to reflect the way that data is currently transferred internationally. They seek to address the problem that current data protection laws function only within a given territory, usually defined along national borders, and do not reflect the reality of international business. In particular, organisations making use of the cloud will be collecting data in one territory and subsequently processing it in numerous other territories. The Proposed Regulation will simplify the requirements for organisations seeking to do this. In addition, it also aims to improve the current system of “binding corporate rules” to make compliance less burdensome – “binding corporate rules” are typically a set of intra-corporate global privacy policies that satisfy the European Union standard of adequacy when organisations are seeking to transfer the data outside of the EEA. The Proposed Regulation would require all data protection authorities to recognise “binding corporate rules” approved by an individual data protection authority.
  • Data protection by design and by default: The Proposed Regulation requires data controllers to only collect and retain personal data to the minimum extent necessary in relation to the purposes for which they are intended by design to be processed. This will be particularly controversial for organisations seeking to undertake data analytics of their mass repositories of data.
  • Accountability and Data Protection Officers: The Proposed Regulation seeks to increase the accountability of data controllers and data processors, including by requiring that they carry out data protection impact assessments prior to risky data processing activities. In addition, organisations with over 250 full time employees will be required to have a Data Protection Officer.

 


Follow

Get every new post delivered to your Inbox.

Join 751 other followers