Steve Fingerhut is a VP of Marketing at SanDisk. In his inaugural guest blog post for the Metadata Era, Steve discusses how enhancing existing server investments with solid-state memory can speed up Big Data analytics while keeping costs in check.
Metadata readers know better than others that we’re living in an era of data- massive data generated by web transactions, our mobile devices, social media and even our refrigerators and cars. The numbers are stunning. Data is growing at dizzying exponential rates: 90% of the world’s data was created over the last 2 years alone, and by 2020 data will increase by 4,300%
The majority of data produced today is termed ‘Unstructured Data’, which is data that does not fit well into traditional relational database systems.
This category usually includes emails, word documents, PDFs, images, and now social media. To give you a glimpse of how much unstructured data we’re generating: every minute, 100 hours of video are uploaded to YouTube, and more than 100 Billion Google searches are done every month. But what do we do with all of this data?
Analytic Apps Crave Big Data
The giants of the web have long used data as a tool to help them understand customer behavior. For example, Ebay produces 50TB of machine-generated data every day(!), collecting and recording user actions to understand how they interact with their website.
By analyzing data in their focus area, businesses can respond to patterns and make needed changes to improve sales, achieve higher engagement rates, enhance safety or help guide their overall business strategy.
Big Data is no longer just a tool for these web giants. Collecting, analyzing and utilizing data is critical for businesses of any size to remain competitive, and as such, businesses are collecting more data.
When it comes to Big Data, bigger is better. Analytics that are meant to forecast future probabilities (predictive analytics) become far more accurate when increasing data-set size to a massive scale. So companies are expanding projects to help their big data grow even bigger.
Research shows that companies with massive investments in Big Data projects to mine data for insights, are not only generating excess returns but are also gaining competitive advantages. So it’s no wonder data is becoming the most precious commodity of organizations today.
For data center managers who need to contend with the growth of business data, finding the infrastructure to support storing, archiving, accessing, and processing these huge data sets has become one of the biggest concerns for organizations, and imparts great challenges.
One approach is to divide- and-conquer the problem by distributing the data to separate servers with idle CPU capacity and storage resources. Having many computing units operating in parallel as, say, part of a Map-Reduce platform, is one way—though complex— to handle the problem.
Another idea is to squeeze more performance out of existing computing elements. For many kinds of Big Data applications, gigabytes of data points often have to be manipulated at a single time—for example, in complex statistical operations.
It’s far faster (by orders of magnitude) to have the data in memory at the time it’s needed instead of accessing it from disk storage. But it’s often not feasible to do this for all but the most powerful (and expensive) high-end servers with their very large memory spaces.
An effective route to contend with these challenges is to use SSDs or flash-based disk drives for this task. SSDs have the same type of memory found in mobile devices and cameras, but they’ve been expanded and customized to take care of far larger capacities and data center reliability. Would you be surprised to learn that the big web giants (like Amazon, Facebook and Dropbox) have long moved to include flash-based Solid State Drives in their storage infrastructure?
As such, SSDs deliver far superior performance than legacy storage– 100x that of old-fashioned hard drives. Fitted with SSDs, even standard servers can sort and crunch huge amounts of data without the much-feared “disk penalty”— losing valuable time through seeking and accessing data blocks from a drive’s magnetic media.
Other benefits: without any mechanical parts, companies can eliminate sudden, unpredictable disk failure from their list of risks!
A Real Added Value
But there is more to that. You might be wondering what the cost impact of flash is, and if your organization can afford implementing SSDs. I actually think that you can’t afford not to, and let me explain why.
As we look at Big Data and analytics, applications are not only coping with huge data sets, but also data from multiple data sources, often requiring tens of thousands (if not hundreds of thousands) of operations per second (IOPS) for each workload.
To realize such high level performance with traditional drives, IT managers have had to ‘stitch together’ a huge pile of hard drives to jointly supply the needed IOPS. But bringing together so many drives not only generates complexities and additional points of failure, it also means managing and paying for more racks, networking, more electricity to power the infrastructure, more cooling costs and more floor space to pay for! As SSDs deliver 100x performance, you will require far less hardware to analyze and perform complex operations, which translates to cost savings both on infrastructure and operation.
Let me add some numbers to support my claims. Recently, we conducted a test using Hadoop, a Big Data framework used for large-scale data processing. We compared the use of hard drives vs. solid-state drives to see not only how much performance gains SSDs can deliver, but also to calculate their impact on costs. As you may have already guessed, SSDs came out winning big on both ends. We saw 32% performance improvement using 1 terabyte dataset and better yet, a 22%-53% cost reduction, depending on the workload’s pattern of access to the storage.
Getting the Job Done Right
When aiming at optimizing Big Data analytics, there’s still a larger point to be made. It takes two elements to get the job done—a combination of hardware advances—SSDs, for example—as well as smart software. Companies will need both to contend with the oncoming data tsunami and ensure they can make the most of their analytics to remain competitive.