THE BIG DATA EXPLOSION

August 15, 2014

Steve Fingerhut is a VP of Marketing at SanDisk.  In his inaugural guest blog post for the Metadata Era, Steve discusses how enhancing existing server investments with solid-state memory can speed up Big Data analytics while keeping costs in check.

Metadata readers know better than others that we’re living in an era of data- massive data generated by web transactions, our mobile devices, social media and even our refrigerators and cars. The numbers are stunning. Data is growing at dizzying exponential rates: 90% of the world’s data was created over the last 2 years alone, and by 2020 data will increase by 4,300%

The majority of data produced today is termed ‘Unstructured Data’, which is data that does not fit well into traditional relational database systems.

This category usually includes emails, word documents, PDFs, images, and now social media. To give you a glimpse of how much unstructured data we’re generating: every minute, 100 hours of video are uploaded to YouTube, and more than 100 Billion Google searches are done every month. But what do we do with all of this data?

Analytic Apps Crave Big Data

The giants of the web have long used data as a tool to help them understand customer behavior. For example, Ebay produces 50TB of machine-generated data every day(!), collecting and recording user actions to understand how they interact with their website.

By analyzing data in their focus area, businesses can respond to patterns and make needed changes to improve sales, achieve higher engagement rates, enhance safety or help guide their overall business strategy.

Big Data is no longer just a tool for these web giants. Collecting, analyzing and utilizing data is critical for businesses of any size to remain competitive, and as such, businesses are collecting more data.

When it comes to Big Data, bigger is better. Analytics that are meant to forecast future probabilities (predictive analytics) become far more accurate when increasing data-set size to a massive scale. So companies are expanding projects to help their big data grow even bigger.

Research shows that companies with massive investments in Big Data projects to mine data for insights, are not only generating excess returns but are also gaining competitive advantages. So it’s no wonder data is becoming the most precious commodity of organizations today.

Extending Memory

For data center managers who need to contend with the growth of business data, finding the infrastructure to support storing, archiving, accessing, and processing these huge data sets has become one of the biggest concerns for organizations, and imparts great challenges.

One approach is to divide- and-conquer the problem by distributing the data to separate servers with idle CPU capacity and storage resources. Having many computing units operating in parallel as, say, part of a Map-Reduce platform, is one way—though complex— to handle the problem.

Another idea is to squeeze more performance out of existing computing elements. For many kinds of Big Data applications, gigabytes of data points often have to be manipulated at a single time—for example, in complex statistical operations.

It’s far faster (by orders of magnitude) to have the data in memory at the time it’s needed instead of accessing it from disk storage. But it’s often not feasible to do this for all but the most powerful (and expensive) high-end servers with their very large memory spaces.

An effective route to contend with these challenges is to use SSDs or flash-based disk drives for this task. SSDs have the same type of memory found in mobile devices and cameras, but they’ve been expanded and customized to take care of far larger capacities and data center reliability. Would you be surprised to learn that the big web giants (like Amazon, Facebook and Dropbox) have long moved to include flash-based Solid State Drives in their storage infrastructure?

As such, SSDs deliver far superior performance than legacy storage– 100x that of old-fashioned hard drives. Fitted with SSDs, even standard servers can sort and crunch huge amounts of data without the much-feared “disk penalty”— losing valuable time through seeking and accessing data blocks from a drive’s magnetic media.

Other benefits: without any mechanical parts, companies can eliminate sudden, unpredictable disk failure from their list of risks!

A Real Added Value

But there is more to that. You might be wondering what the cost impact of flash is, and if your organization can afford implementing SSDs.  I actually think that you can’t afford not to, and let me explain why.

As we look at Big Data and analytics, applications are not only coping with huge data sets, but also data from multiple data sources, often requiring tens of thousands (if not hundreds of thousands) of operations per second (IOPS) for each workload.

To realize such high level performance with traditional drives, IT managers have had to ‘stitch together’ a huge pile of hard drives to jointly supply the needed IOPS. But bringing together so many drives not only generates complexities and additional points of failure, it also means managing and paying for more racks, networking, more electricity to power the infrastructure, more cooling costs and more floor space to pay for! As SSDs deliver 100x performance, you will require far less hardware to analyze and perform complex operations, which translates to cost savings both on infrastructure and operation.

Let me add some numbers to support my claims. Recently, we conducted a test using Hadoop, a Big Data framework used for large-scale data processing. We compared the use of hard drives vs. solid-state drives to see not only how much performance gains SSDs can deliver, but also to calculate their impact on costs.  As you may have already guessed, SSDs came out winning big on both ends. We saw 32% performance improvement using 1 terabyte dataset and better yet, a 22%-53% cost reduction, depending on the workload’s pattern of access to the storage.

Getting the Job Done Right

When aiming at optimizing Big Data analytics, there’s still a larger point to be made. It takes two elements to get the job done—a combination of hardware advances—SSDs, for example—as well as smart software. Companies will need both to contend with the oncoming data tsunami and ensure they can make the most of their analytics to remain competitive.


Big data and the rise of augmented intelligence

August 11, 2014


THREE THINGS TO BE AWARE OF WITH LOW-COST DATA BACKUP SERVICES

August 7, 2014

I’m always a little surprised by the reaction from customers regarding off-site storage services.  It goes something like, “Well, the price is so good, that I don’t really need to know anything else.”  From a pure accounting standpoint, I do see their point.

As a company goes down the road of evaluating low-cost backup and disaster recovery service providers, they should stop and “read the fine manual” as we say in IT: in this case, it’s the small print contained in the Terms of Service. I’ve looked at more than a few of these agreements and here are three key points that you should keep in mind:

1. Security Is Ultimately Your Responsibility

You’ll often see language in the ToS that says “they take security seriously” and “it’s very important”, but there’s additional legalese that states the providers can’t be held liable for any damages as result of data loss.

In fact, some of the ToS have a clause that explicitly says you are responsible for the security of your account. Yes, they will encrypt the data, and you may be given the option to hold the security keys. In a very strong sense, the security hot potato remains with you even though they have the data. When calculating the true costs and risks of these services, keep that in mind.

2.  Two-Factor Authentication?

As Metadata Era readers, you’re no doubt wondering about two-factor authentication. As a kind of a virtual commercial landlord, these services hold data for lots of businesses, so you might expect building  security would be tight—“show me your badge”.  After all, these backup services are a magnet for hackers.

I didn’t see two-factor authentication listed as a standard part of the packages of the cloud providers I looked at. However, there are third-party services available that can provide out-of-band authentication through a separate logon solution, but at an extra cost.  And you’ll have to contract separately with them.

3.  Data Availability

You store your data in the cloud with these companies, so you’d expect some promise that the data will be there when you need it.  Of course, on the public Intertoobz, there are limits to what they can be responsible for. Typically there are clauses in the ToS that exclude the digital equivalent of acts of nature—e.g., DoS attacks.

Outside unusual events, these back-up services generally don’t even provide a likelihood of availability—99%, 99.9%, or pick your sigma.  And the most they’re liable for when there’s loss of data dialtone is the subscription fee.

This is not to say that you can’t get a better deal—Service Level Agreements (SLAs) that compensate when certain metrics aren’t met—but for low-price, one-size-fits-all bit lockers, there is usually no or limited opportunity to negotiate.

 

If you already have an outsourced data backup or disaster recovery solution in place with a sensible SLA and you can truly estimate the cost savings, and you’re getting a blue-light deal, then more power to you.

However, for everyone else, a good in-house IT department using purchased archiving or transfer solutions can offer custom security solutions and high-availability, along with guaranteed accountability.


Authentication Lessons from the Magic Kingdom: A Closer Look at Kerberos, Part I

August 7, 2014

The flaws in NTLM I’ve been writing about might lead you to believe that highly-secure authentication in a distributed environment is beyond the reach of mankind. Thankfully, resistance against hackers is not futile. An advanced civilization, MIT researchers in the 1980s to be exact, developed open-source Kerberos authentication software, which has stood the test of time and provides a highly-secure solution.

How good is Kerberos? Even Microsoft recognizes it as superior and openly recommendsand supports it—albeit through their own version. The Kerberos protocol is more complicated than NTLM and harder to implement but worth the effort.

It’s also quite difficult to explain in a blog post. Kerberos involves complex interactions in which “tickets” or tokens are handed over by clients to various guardian servers. Faced with having to discuss Kerberos using all the usual protocol diagrams (seeWikipedia if you must), I decided to look for a better approach.

While I’ve no proof of this, it’s possible that the Kerberos authors may have been inspired by a real-world authentication systems used in theme parks—perhaps even looking at Disney World.

The General Admission Ticket (Kerberos’s Ticket Granting Ticket)

I haven’t been to the Magic Kingdom in a good long time, but I do remember an overall admission “passport” that allowed one to enter the park but also included tickets for individual rides—by the way, you can read more about Disney ticketing here.

Kerberos has a similar general admission concept. Unlike NTLM, user and client apps have to interact with what’s called a key distribution center (KDC), and initially the logon component, before they can even authenticate with and use individual services.

You can think of the front gate at Disney World, where you purchase the passport and the first round of authentication checks are made, as the Kerberos KDC logon service, and the Disney passport as what Kerberos refers to as the Ticket Granting Ticket or TGT.

As I recall, the passport lets you gain access to some of the rides, but then for the really good ones—say Pirates of the Caribbean—you’d have to pull out the individual tickets. So once in the park, the passport booklet always authenticates you as someone who has paid the fee to get in, thereby allowing Disney customers to use the individual ride tickets or purchase additional ones as well.

It Takes Three to Authenticate

NTLM authentication provides a binary relationship between a user and a server with no central authentication authority.  Perhaps more like a carnival where you show generic tickets—typically easy to forge!—directly to the ride attendant.

Kerberos, though, introduces a third component, the KDC. By the way, Kerberos refers to a mythological three-headed hound that guards the entrance to the underworld—we’re talking one tough guard doggy.

In the physical world, a complex administrative process is required to validate a document—theme part ticket, passport, driver’s license, etc.—as belonging to the holder and also making the paperwork difficult to duplicate. Perhaps not surprisingly, there’s similar complexity in issuing Kerberos’s TGT.

Here’s how it works.

Like NTLM, Kerberos also uses passwords and other IDs indirectly as keys to encrypt certain information. To start the authentication process, the Kerberos client sends basic identity data—user name and IP address—to the KDC. The KDC logon component than validates this against its internal database of users, and prepares the TGT—the digital version of the Disney passport.

The KDC first generates a random session ID. It uses that as a key to encrypt the identifying information sent by the client along with the session ID and some time stamp data. This forms the TGT. To say it another way, the TGT contains some unique data about the user along with the key that is then used to encrypt the whole shebang.

For this whole thing to work, the client needs the session ID.  But, of course, you can’t pass this as plain text, so the session ID is itself encrypted with, what else, but the user’s password or more precisely the hash of the password.  And then the two encrypted chunks of data are sent back to the user: the encrypted session ID and the encrypted TGT.

It’s a Small Authenticated World 

For all this to work, Kerberos makes the traditional assumption that only the user and the KDC have the password—a secret shared between the two of them. The client software asks the user for a password and then applies the hash to decrypt the session ID sent back by the KDC.  The unwrapped session ID decrypts the TGT.

At this point, the Kerberos client has access to the overall IT theme park, but just as in Disney, needs to go through another process to, so to speak, ride the server.

Just stepping back a bit, we can see the beauty of the first interaction to gain the TGT. The client is authenticated since it has the hash of the user’s secret password that decrypts the session ID. The server is authenticated since it has the session ID used to encrypt the TGT.

Of course, the server side of mutual authentication generally is not an issue in theme parks—unless somehow fake Disney Worlds started popping up that were taking advantage of paying customers!

But as I pointed out last time, rogue servers are a problem with NTLM’s challenge-response protocol. Kerberos completely solves this both at initial authentication and, as we’ll see next time, in gaining access to individual IT rides.

There’s a lot to digest here. I’ll continue with the rest of the Kerberos process in the next post.

Thanks http://www.varonis.com

 


4 THINGS YOU NEED TO KNOW ABOUT THE FUTURE OF FILE SHARING

August 6, 2014

Have you been in this movie?

You’ve been working for two months on a big project to analyze widgets — sales, marketing effectiveness, whatever. The first real deliverable is a presentation. A few versions are in your team’s shared folder, a few copies have been sent via email, one is in your home folder, your designer saved an update or two in Dropbox, and the final version will go in SharePoint.

You’re getting close, so your boss is now asking you to email her the latest version every other day (she doesn’t have access to the file server from her iPad). You’ve started to receive “Your mailbox is full” messages because the presentation is 15MB.

You want to pull your hair out. In the age of self-driving cars, shouldn’t file sharing be easier than this?

Check out our new whitepaper, 4 Things You Need to Know About the Future of File Sharing, to see if this story has a happy ending.


5 THINGS PRIVACY EXPERTS WANT YOU TO KNOW ABOUT WEARABLES

August 6, 2014

There’s been a lot of news lately in the health and fitness wearables space. Apple just announced they’re releasing an app, called “Health,” as well as a cloud-based platform “Health Kit”. Somewhat related, Nike recently pulled the plug on its activity tracking Fuelband. The conventional wisdom is that fitness trackers are on the decline, while the wearables market in general —think Google Glass and the upcoming iWatch–is still waiting for its defining moment.

And on the privacy front?  In fact, there’s been a lot of movement there as well, and the FTC is all over it! They recently hosted a “Consumer Generated and Controlled Health Data” event and all the speakers – the FTC Commissioner, technologists, attorneys, privacy experts – agree that the potential of health-based wearables is huge, but since health data is so sensitive, it needs special protection.

I’ve distilled their privacy wisdom into 5 key things privacy experts want you to know about health data, the data generated from your wearables, your privacy and why it’s so hard to create one law that protects all.

1. Transparency and trust are essential

If health and fitness wearable makers create privacy policies that are ambiguous and don’t require consumer consent for data sharing, it may limit the benefits of these services for many people, especially those who are privacy conscious.  Why upload your health data when there’s no guarantee it will be kept private?

Some experts suggest short, clear-cut notices about the safety and protection of your data—something akin to a data nutrition label.

nutrition label

 2. Your health data gets around

Latanya Sweeney, the FTC Chief Technologist and Professor of Government and Technology at Harvard University, attempted to document and map all flows of data between patients, hospitals, insurance companies, etc. She learned that it’s not really clear where data is going and it’s difficult to know all the places it might wind up.

Inspired by Sweeney, I checked whether some heathcare data may find its way outside the medical ecosystem. It does! The recent FTC report on the data broker industry (see appendix B) proves the brokers collect some sensitive patient data points.

datamap

Source: DataMap

3. Discharge data in disarray

Information about your hospital visit is also known as discharge data. This data is required by state law to be sent to whomever is designated by that state law to receive that data.

What do states do with your discharge data? Turns out 33 states sell or share your discharge data. Of the 33 states, there are only 3 that are HIPAA compliant.

discharge data

Source: DataMap

4. Geolocation is not to be overlooked

One very important privacy matter mentioned at the FTC event was geolocation. Many health and fitness apps and wearables mine data about your running routes or when you’re at the gym.  Some apps may also be able to predict where you’re going to be at a certain time or predict when you’re not home.

geolocation-predictions

5. There’s no free lunch

In exchange for a freemium health and fitness app, you are sharing A LOT of data. That’s not unusual in the free app world, but medical data is not the same as sharing your list of favorite movies.

Some users might put trust in the maker of their heathcare app or device-maker, say, Nike, but not realize that by using the product they’re consenting to having their health information sold and resold to third parties that may not be as trustworthy.

Jared Ho, who is an attorney in the FTC’s Mobile Technology Unit tested 12 health and fitness apps and found that his data was sent to the developer’s website as well as 76 third parties–mostly advertising and analytics organizations.

Here’s what he found:

  1. 18 of the 76 third parties collected device identifiers such as unique device ID.
  2. 14 of the 76 third parties collected consumer-specific identifiers, such as user name, name and email address
  3. 22 of the 76 third parties received information about the consumers such as exercise information, meal and diet information, medical symptoms, zip code, gender and geo-location.

wearable-data-travels

No one can predict what will happen in the wearables market, but emerging business practices and technologies will inform and impact consumer privacy regulations as it remains a very hot topic. Concerns and worries such as who will and should have access to one’s personal health data and who has potential access to the data will no doubt remain a part of the discussion.


HOW VARONIS HELPS WITH SARBANES-OXLEY (SOX)

August 6, 2014

If your company is publicly traded or if your company is private, but planning an initial public offer, SOX affects your company.

Sarbanes-Oxley compliance projects can be slow and painstaking for IT departments.  Manually identifying sensitive information and building reports detailing data access can drain resources very quickly.  Luckily, Varonis can help streamline your SOX compliance projects, saving you lots of time and money!

In particular, we ensure that access and use of sensitive and important financial information residing on file servers, Exchange, and SharePoint sites is automatically ratcheted down to business need-to-know, and that use of sensitive SOX-governed financial information is continuously monitored so that organizations have accurate and non-repudiable proof of data use and compliant behavior at all times.

Read the “How Varonis helps with SOX Compliance Brief” to learn more.


Follow

Get every new post delivered to your Inbox.

Join 838 other followers