Personally Identifiable Information Hides in Dark Data

May 3, 2013

To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card numbers, and all the other usual suspects. With the additional words “reasonable basis to believe that the information can be used to identify the individual”, HIPAA’s definition takes in digital handles such as emails, IP addresses and even facial imagery. But there’s a little more to HIPAA’s PII definition, and it applies specifically to free form text (commonly found in word processing documents, spreadsheets, presentations, etc.)

The complete list of HIPAA’s PIIs is enumerated in the law’s Safe Harbor guidelines. In plain-speak, these guidelines tell health IT administrators what information is considered private, requiring special authorization to view or process. It includes the aforementioned identifiers, as well as medical record numbers, health insurance IDs, and some others. By the way, we’ve conveniently put this PII list in our omnibus data protection compliance whitepaper.

An unstated assumption made by many is that PII only lives in structured formats—in other words, fields in a database. Readers of this blog of course know that PIIs are often likely to be harvested from the massive amounts of human generated dark data found on corporate files servers.

The HIPAA regulators have understood this as well. In clarifying the rules for removing PII —“de-identifying”—data for publication and general usage, they explicitly cover the possibility that PII can also reside in free-form text. I’ve excerpted the key paragraph from their de-identification best practices below :

PHI [protected health information] may exist in different types of data in a multitude of forms and formats in a covered entity.  This data may reside in highly structured database tables, such as billing records. Yet, it may also be stored in a wide range of documents with less structure and written in natural language, such as discharge summaries, progress notes, and laboratory test interpretations … The de-identification standard makes no distinction between data entered into standardized fields and information entered as free text (i.e., structured and unstructured text)— an identifier listed in the Safe Harbor standard must be removed regardless of its location.

Got that? PHI, which is essentially PII along with other sensitive medical information, embedded in spreadsheets, docs, and presentations is just as worthy of HIPAA privacy protections as fields in databases.

So if we follow these ideas—PIIs can be anything that reasonably links to an individual, and this data can exist in text—to their logical conclusion, then we need to consider a new possibility. Suppose this sentence from a doctor’s notes were uploaded to a file server:

The patient, a technical content specialist at Varonis, a software company, has been complaining about tennis elbow.

The natural question to ask is whether “technical content specialist at Varonis” is a PII?

It’s not a PII in the sense of a uniquely coded key such as social security number or health insurance ID that links back to a person. But in another sense, it acts very much like PII. Don’t believe me? Try typing that phrase into Google and see what comes up.

We’re really talking more about the meaning of the text—or as experts would say, the semantic value—rather than actual letters, numbers, and other syntax. But HIPAA’s Safe Harbor rule even takes this into account: it specifically notes that the “knowledge” in free text can also be used to point back to a person.

As a practical matter, the HIPAA rules mean that any reference to a patient’s job title and company is a violation of the law’s privacy protections.

This leads to a broader discussion on what’s called the “semantic web”. In brief, Google and a few others are already doing leading edge work on extracting meaning and knowledge from web content. You can see for yourself how well Google does this by entering the keywords “height of the empire state building” in a search. You’ll get back an actual answer, 1454’, in addition to all the docs with that exact phrase.

The larger point is that along with stealing PIIs, hackers and cyber thieves are also getting better at mining and interpreting human generated text for personal details, and then building more convincing fake identities to be used in social attacks, such as phishing and pretexting.

Bottom line: these bits and pieces of personal information that are scattered across file servers in clear-text documents can be used to identify an individual with very high likelihood.

That’s important to keep in mind when someone in your company asks, “do we know what’s in our files and the risks involved if our servers are breached?”


Customer Decision & Big Data: A possible Journey

April 26, 2013

Customer is king. Always. Whether in B2B or B2C settings. With much writing this week on the importance of a Customer Centric approach where B2B organizations need to develop a much deeper understanding of the modern Customer Decision Journey.

Questions have been raised as per whether Multichannel Marketing Mix approaches have been based on the right models and research to measure results.

With the hype of a report to be issued by the Council for Researchcurrently investigating measurement issues related to digital video advertising, report that in turn will form the basis of an Advertising Research Foundation inquiry into the quality of the models.

We believe it’s important to bring a combination of modeling, information and expertise to decisions “a P&G spokesman said in a statement to AdAge “We have clear evidence that marketing-mix modeling, combined with other information and expertise, has helped to improve return on investment of our marketing spending and media buying.

Beside, measurements what remains key is to reach the customer with a message which will limit the risk of ad avoidance, a phenomenon which has been noticed to be on the increase lately.

Can big data really improve the customer experience with personalized ads, products and service offerings?

For certain big data can say a lot about preferences and even location. But with constantly increasing terabytes of data, in structured, semi structured and unstructured formats. To make sense of it all is to say the least challenging.

The more so for businesses, which do not have their own platform from which to gather this data, nor the technical tools or analyst expertise to navigate and make sense of data gathered from their websites, blogs and external social platforms.

Some even ask the question whether Big Data is in reality an opportunity only for big players of the likes of Google.

What do you think?

Thanks to http://moniagalardi.com/2013/04/25/customer-decision-big-data-a-possible-journey/

 


EU to Google: We Really Mean it About Data Retention Limits

April 22, 2013

Are these data and privacy protection regulations serious or are they just for show?”  I’ve been hearing that question lately from the tech reporters and journalists who’ve been contacting me. Even after pointing out extensive case files and other documented incidents on government and legal sites, I’m still left with the feeling that it’s just not proof enough.

Fate has finally intervened.

With the EU Commission’s complaint against Google’s privacy policies reaching a conclusion, I now have a teachable moment to convince the naysayers that this stuff is serious business.

When Google changed its privacy terms in early 2012, the fine print was also being looked at by EU regulators. Google may have thought it was making it easier for consumers with a single policy covering all its web services, but others felt a bit differently. The Article 29 Working Party is in charge of advising the EU Commission on their data security and privacy rules, which are contained in the Data Protection Directive or DPD. In late 2012, they filed a complaint against Google, and addressed aletter to Mr. Page.

In so many words, the Article 29 folks said the search engine company had not done enough to follow DPD rules on consumer privacy.

Security experts, compliance gurus, CIOs, and other interested players would normally have to get the real story about this intersection of legal and tech in niche publications or in the back pages of certain business sections, or perhaps in a blog of a major data governance player. Since this is Google, and it appears that the EU is willing to go to the mat on this one—in other words, there will be fines—the story is now moving up in importance and appearing more prominently in business sections of main-stream publications.

You can read from the regulator’s report to learn about the long list of Google’s privacy shortcomings, which are conveniently bold-faced. I offer a few of their choice phrases: “no valid consent”, “incomplete or approximate information”, and “retention periods must be appropriate in regards to the purpose.”

Whoa! The EU—technically the individual national data protection authorities led by France’s CNIL— will fine a major American online service provider over their …  data retention policy?

Of course, having data retention policies and procedures —what to keep, what to archive—in place is just IT common sense. But you’re probably thinking that just because an organization doesn’t have explicit data retention or migration plans doesn’t mean it has broken the law.

Actually, it’s not only the EU that takes this IT procedure seriously. Data retention limits also show up in the US’s HIPAA rules for personal health data and in some financial data security regulations. But usually the limits—measured in years—are the amount of time an electronic document must be kept.

The EU, though, views data collection and retention with a goal of “data minimization” in mind: companies should store the minimum amount of personal data and limit the duration to what “must be appropriate in regards to the purpose”. That’s essentially the language of the DPD law. In other words, you just can’t keep personal consumer data unless there’s a legitimate business reason, you have to say what that reason is, and you have to say how long you’re going to keep it.

According to France’s CNIL, Google has to this date refused to provide any information about its data retention policies after being requested to do so.

And the EU Commission has been very clear that there will be consequences for not following its rules. How bad could the fines be violating, either willfully or negligently, the DPD? The head of the Commission is suggesting they could run as high as 2% of global sales.

Last year Google earned revenues of over $45 billion. You do the math on what it means for not taking data compliance regulations seriously.


Watch Google Show Off 4 Glass Apps

April 5, 2013


How the biggest DDOS attack in history could have been easily avoided, or not

April 5, 2013

The recent DDOS attacks aimed at Spamhaus hammer home three very important points that we must learn in our new digital society:  1.) How dependent we are on digital communication, 2.) How interdependent our networks have become, and 3). How drastic the consequences are when basic “blocking and tackling” measures are not taken.

This particular attack is not only affected Spamhaus, it has also affected the internet speed and availability for millions of users and sites in the UK and in Europe.  According to an article by John Markoff and Nicole Perlroth in the New York Times, “a number of computer security specialists pointed out that the attacks would have been impossible if the world’s major Internet firms simply checked that outgoing data packets truly were being sent by their customers, rather than botnets.”

The article also discusses how the attack would have been much less successful (or not successful at all) if more internet providers followed the best practice guidance released 13 years ago (2000) by the IETF (Internet Engineering Task Force) in Bcp38.

While the article does a good job explaining the high level concepts of the attack, here is a little more detail on how the attack works, and how these attacks can be stopped:

Imagine some “attacker” can “spoof” your phone number so that your number shows up on other people’s phones when they call. Now imagine the attacker calls a bunch of people and hangs up before they answer— you’ll probably get a bunch of calls back from those people, because it looks like you called and hung up when you didn’t. Now imagine thousands of attackers doing this—you’d certainly have to change your phone number. With enough calls, the entire phone system would be impaired.

That’s similar to what’s happening in this DDOS attack. Attackers are spoofing Spamhaus’s IP addresses (IP addresses are like a phone number on the internet), sending traffic (let’s call this “stimulus”) to servers that they know will respond to this traffic, and these servers dutifully send their responses back to Spamhaus’ servers. Armed with the power of thousands of computers in a botnet, the attackers are sending a lot of stimulus. To make matters worse, the responses are much larger, in terms of size, than the stimulus. This means that for every packet of stimuli, there are many more response packets. (In our example above, imagine that all those hang up calls were to phone numbers that would automatically leave 3 minute messages on your voicemail or keep calling back over and over).

So what servers are drowning Spamhaus (and the rest of us) in response packets?  These servers are called domain name servers, or DNS, and perform a critical function—they match a human friendly name (e.g. google.com) with a machine friendly number (i.e. an IP address). Computers need to know each other’s IP addresses in order to communicate (or the IP address of the firewall that is protecting the computers).

DNS in friendly terms? When you try to browse to google.com, your computer queries a DNS to learn its IP address. If your computer can’t connect to a DNS, or the DNS can’t resolvegoogle.com to an IP address, you’re out of luck. You can see this in action by going to a command prompt or shell on your computer, and typing:

nslookup http://www.google.com

If successful, you’ll see one or more IP addresses for Google.

Without DNS, instead of typing http://www.google.com in our web browser, we’d be typing, “173.194.75.105” or something similar. I can’t even remember my own phone number anymore—imagine if we had to remember these?

Why is DNS so vulnerable? The primary protocol that DNS servers happen to use is called UDP (User Datagram Protocol). This is important because UDP is “connectionless,” meaning there is no “handshake” when the initial connection is set up. “Handshakes,” like those used in TCP communications, offer a reasonable amount of host authentication—in other words, with TCP connections, you can be reasonably certain that both computers are who they say they are. With UDP, you cannot be sure, especially with short bursts of communications like DNS queries.

So, using a botnet, the attackers are sending millions of DNS queries that appear to be from the victim’s computer (“spoofing” the victim’s IP addresses), and the much larger responses from the DNS servers actually go to the victim’s computers. It’s kind of the ultimate “crank call.”

How can these attacks be stopped? Follow the guidance in BCP38, which explains how internet providers can filter out spoofed traffic. The idea is simple— every router (the devices that connect the internet) understands which addresses should be coming from which direction (interface, in router terms). If a packet arrives that says it’s coming from an IP address that shouldn’t be arriving from that interface, the packet should be dropped.

Why is this hard? It’s not. So why haven’t internet providers taken these simple steps?

Actually, most of them have—according to research by the MIT ANA Spoofer Project, cited in anarticle on Senki written in June of 2012, 80% of internet providers had already implemented the recommendations in BCP38, and were already blocking spoofed traffic. It’s the remaining 20% that remain responsible for allowing “spoofed” traffic.

We’re seeing more and more that when fundamental blocking and tackling is missing, our interdependence shows – when a few parties don’t take basic security measures, other parties suffer. Just like on the road, where a few (or many) distracted or careless drivers can cause harm to countless others, a group of sloppily configured routers can allow attackers to disrupt critical infrastructure that we’ve come to depend on.  80% just isn’t good enough.

We can’t turn off DNS. Though it’s theoretically possible to make everyone use TCP instead of UDP for DNS queries (which would make these queries much more difficult to spoof), so many people would be adversely affected during the transition that this might make things worse than just living with the DDOS attacks.

Our best choice is to create a culture of security and responsible computing, where it becomes unacceptable to be in the remaining 20%. Imagine if 20% of the drivers on the road didn’t obey traffic signals—it would no longer be safe to drive. It should be equally unacceptable that so many computers are now in botnet armies that can do such tremendous damage—80% isn’t really good enough there, either. If 20% of the computers in the world are allowed to become part of a botnet, we’re going to have much bigger problems. The culture of security and responsible computing needs to extend to internet providers, and internet users.


[Tech] It’s Official: Google Glass Is Here!

February 25, 2013

While Apple iWatch rumors continue to slog their way through the blog-o-sphere, Google has upped the ante. Google’s Glass is not a rumor, it’s real. In addition (according to Google) you can get one by the end of 2013 by entering and winning a special contest.

At least, Google calls it a contest. There are some unique rules. First, you have to pay $1,500 for your Glass, if you win. Also, you have to travel to New York, San Francisco or Los Angeles to pick your prize up. (UPS is not available.)

If that isn’t enough, you have to come up with a really creative idea about how you will use your Glass. If you need help coming up with ideas, Google has released a video entitled How it Feels [through Glass] that provides a behind-the-lens view of the Glass experience.

Google hasn’t specified how many “winners” there will be – supposedly, that will depend on the number of “really creative ideas.”

CNET reported that Glass will be able to connect via Bluetooth to both Android phones and the iPhone, while pulling data from Wi-Fi and using the 3G/4G feeds from the connected phone. Glass will not have its own cellular radio.


RUSSIAN SEARCH ENGINE YANDEX TAKING THE LEAD OVER BING, GOOGLE STILL ON TOP

February 11, 2013

Yandex, the Russian search engine is emerging as a leading search engine taking the lead over its counterparts such as Bing. Comscore analysis of statistics recorded for the worldwide search engine queries between November and December 2012 reveal that Microsoft websites processed 4.477 billion queries, while Yandex processed 4.844 billion.

Google was still miles ahead with 114.73 billion search queries and 65.2 percent market share. Second place belonged to Chinese search engine Baidu with 14.5 billion and 8.2 percent market share while Yahoo took the third place with 8.63 billion and 4.9 percent.

Although industry giants Microsoft are equipped with all the resources but still Yandex, whose marketing budget is not even close to that of Bing, turning out to be on top for global search queries.

Source: Michael Bonfils, The Inquirer


GOOGLE DOMINATES THE MOBILE APP MARKET, HAS 5 OF THE TOP 6 APPS IN THE U.S.

January 24, 2013

Mobile Apps Rankings

Wondering why Apple (AAPL) is sinking so much effort into building its own Maps application? Because it doesn’t want Google (GOOG) to gobble up all the revenue from big-name mobile applications. ComScore has published its most recent monthly review of the top iOS and Android apps in the United States ranked by unique visitors and has found that Google captured 5 of the top 6 spots with Google Maps, Google Play, Google Search, Gmail and YouTube. In fact, Facebook (FB) was the only non-Google app to crack the top 6, although it also had the benefit of being the most-visited app in the entire country by a margin of more than 10 million unique visitors. iTunes was the only Apple app to crack the top 10, meanwhile, as it ranked eighth with roughly 46 million unique visitors last month.


HOW ONE DEALERSHIP GENERATES $30,000.00 A MONTH IN PURE PROFIT USING A SIMPLE REPORT WITH ALERTS.

January 15, 2013

By Joe Tareen (Business Intelligence & Customer Engagement Professional)

Systemic problems within organizations never go away by just being identified and talked about. Management and all the stakeholder have to move beyond words, discussions and mere introductions of new process initiatives to address the core issue.

One has to identify the core problem and set up self monitoring systems to actively perform contextual tasks in the background. These contextual tasks should provide management with actionable intelligence to effectively and purposely engage team members with specific instructions to meet standards.This active and purpose driven engagement  should continue until a counter culture has developed or a new habit is ingrained within your organization to meet your change objectives.

The case in point:

A dealership that I had worked for presented me with a challenge for solving a chronic issue they faced on a monthly basis: Their Shop Supplies fee collection was well below the industry benchmark and their repeated efforts to address this revenue shortfall with Service Advisors was not bearing any fruitful results. The assumption was that Service Advisors were readily waiving the fees to either increase discounts or in rare cases were mitigating for an adverse customer related situation. Whatever the reasons, the dealership really needed this additional stream of income in face of increasing expense liabilities and decreasing consumer demand for automotive repair and maintenance services. Click Shop Supplies to find the definition.

The existing reporting systems were not intuitive enough to help mitigate this issue. They simply lacked the features that could identify the exact service transactions in which Shop Supplies fees were unjustly waived without providing a valid reason.

If you have ever visited or ran a service department for a franchised dealership in the middle of a huge metropolitan city, you know that the dizzying speed with which these business units operate and the multitudes of customer interactions that take place within a very short time span can cause a serious caffeine addiction for those who manage them. The management team just does not have the resources available to manually manage negative revenue exceptions, whatever they may be. Consequently dealerships end up incurring a substantial amount of potential profit loss each month.

The Solution:

After a careful analysis, we as a team established that any viable solution should at least consist of three important components:

  1. The solution should be ‘alert’ based and self-driven. In other words no activity should be required by management and the system should be smart enough to initiate all exception alerts via an email with details included.
  2. Transaction counts, in this case number of repair orders which were automatically identified with missing Shop Supplies charges needed to be viewed inside a grid report with drill down capabilities right down to the repair order details. This would immediately help management identify reasons for not charging Shop Supplies before an individual Service Advisor is forwarded a request to provide explanation.
  3. Report viewing as well as repair order details should be device neutral. In other words management should be able to access this intelligence using a browser, a tablet or a smartphone.

Luckily for us this was not a daunting task. The dealership under the leadership of a very capable and forward thinking General Manager had already tasked me to develop such a reporting solution albeit not for this specific purpose. We, as a matter of fact, had the solution up and running on the cloud for six months prior to facing this challenge request. We had successfully and very inexpensively utilized a third party cloud database application platform on which we built our very own Service Revenue Analysis application accessible via a simple browser. This cloud application platform was highly customize-able with the ability to tightly integrate with the Reynolds & Reynolds DMS system.

Our script imported all the invoiced repair orders from the DMS system on a nightly basis. Once the data was placed into our cloud application we had the capability to massage the data relationally, set alerts based on exception formulas and create on the fly unlimited number of Service report views and dashboards. These reports and dashboards consisted of interactive ad-hoc reports, gauges, charts, summary etc.

The users of the system also had the ability to easily create reports and dashboards per their own requirements. This introduced a very powerful element with which key stakeholder for the first time could devise proactive management strategies for the entire department based on fast developing and slow moving trends simultaneously.

This unprecedented insight generated a renewed interest and increased engagement in all of us. The entire team involved started to talk about how we could utilize this tool to its full potential by scaling it out to other departments. For the first time the dealership was able to bring their own Fixed Operations data to life and analyze it in a very multidimensional manner. We started to see relations and dependencies that we could not have imagined before. Also we were no longer captive to prepackaged and somewhat limited in scope reports that were provided to us by other vendors. It is not that we didn’t find some of their reports useful, we just believed that the best reports and analytics are the ones which are created by the end users themselves. Something to be said about the process of data discovery itself which brings to the table a spirit and pleasure of its own. In the back of our mind we knew that New & Used Car Sales, Parts Sales, and Finance would be next. Possibilities were endless, benefits many.

Specifically for the challenge of Shop Supplies fees collection at hand, our cloud application allowed us to create a condition based view for all repair orders. This condition was fairly simple. Include in this view all repair orders where there should be a Shop Supplies charge but the Shop Supplies field is equal to zero dollars.

Our system allowed us to then create a simple email alert out to the Service Manager and the General Manager based on this condition and included details of each repair order transaction that could be viewed inside the body of each email reducing number of clicks needed to get to the actionable intelligence.

The grid report view was also available as a dashboard that could be shared with anyone within or outside the dealership via a web link only by admin. Obviously this report was updated daily and automatically using our DMS integration piece, so the service manager was notified and provided details even before he arrived for work.

Upon reviewing this specific actionable intelligence, the Service Manager was simply able to forward these alerts on to each responsible Service Advisor who had waived the Shop Supplies charge in the first place with a stern message attached of course. This action made each Service Advisor self aware and realize that they could not get away with these dereliction and had no choice but to refrain from waiving Shop Supplies unless there was a good enough documented reason.

The Conclusion:

Results were amazing to say the least. We implemented the solution in September of 2012 and prior to that we were averaging right around $20,000.00 in Shop Supplies fees collection. With the new system in place we started to see our monthly collection amount exceed $30,000.00. It turns out all we needed was a system that simply alerted us of these exceptions and the active management initiative portion was just as simple as forwarding these alerts to the guilty party. I guess you could call it ‘preventive maintenance’, of course pun intended here.

This important task could simply be performed by any member of the management using any device as long as they had access to an email client, which definitely satisfied the last required component to make the effort ever so successful. This simple report continues to provide exceptional value to the dealership and has now organically expanded to include excessive Parts and Labor discount alerts per repair order, which have come to save the dealership additionally anywhere between $2500.00 to $5000.00 in gross profit each month.

How much additional profit can your dealership generate each month using simple reports with built in alerts?

P.S. If your dealership faces a similar challenge as outlined above, feel free to contact me. I would love to engage in a discussion to help in identifying the right solution for you and your organization.


Humanizing Big Data

January 9, 2013

HUMAN FACE OF BIG DATA
Some App Results

In less than two months, more than 3 million share and compare questions have been answered, in more than 100 countries, through “The Human Face of Big Data” smartphone survey app.

By collating and analyzing these 3 million+ responses we gained some insightful conclusions related to the attitudes and approaches to life from men and women, young and old, all over the world. Here are just a few of the most interesting findings…

In asking the question “What is most important for good health – diet, exercise, environment or genes?” we discovered that Americans are more likely to believe that good health is in their hands, choosing diet and exercise, while Europeans seem to believe their health is predetermined or out of their control, predominantly selecting either genes or environment

In response to the question “What do you do to help cope with stress most?” we learned that as we get older work and prayer tend to replace friends or the arts as our primary means of stress relief, indicating that older generations prefer to bury themselves in work or deal with stress on their own, rather than by seeking entertainment or distraction
When asked “If I could alter the DNA of my unborn child I would improve their: lifespan, intelligence, immunity or appearance” the findings showed that Americans are most concerned about their children’s education and job prospects, while Europeans worry most about their children’s health, perhaps reflecting the current unemployment rates and standards of available healthcare in these two nations.

While these findings give only a brief snapshot of the world around us, the goal of this app was to encourage people to embrace the subject of big data and to consider its potential to help us shape and change our daily lives. Hundreds of striking examples of ways this is already happening are illustrated in the photographs, infographics and essays within the Human Face of Big Data book.

The anonymous data complied from the app will be made available for educators, data scientists, researchers and the general public to access as a valuable research tool, in order to conduct further in-depth sifting and sorting of the results, that may one day be considered an invaluable snapshot of human history.


Follow

Get every new post delivered to your Inbox.

Join 752 other followers