by Rob Sobers
In big data-land, all the talk lately seems to be about machine data. There is a flood of machine data being spewed into log files and databases. We’ve got web traffic logs, application event logs, OS logs, call center records, GPS coordinates, sensor data, and much more.
Machine generated data is valuable, no doubt, but what about human generated data?
Here’s a fun thought experiment – which would you rather own:
1.) A data set containing every single visit to twitter.com with the IP address, date, time, referrer, etc. of every visitor, or 2.) a data set containing the content of every single tweet ever authored by a human.
Last I checked, there wasn’t much demand for Twitter’s Apache logs, but the company is making a buck or two selling their firehose of tweets to the likes of Google and Microsoft.
In addition to the vast sea of human generated content on the Internet, much of which has a very low value (pick a random YouTube video), consider all the human generated data within your organization which, by definition, should have very high value density.
These are emails, Word documents, spreadsheets, presentations, audio files, video files. Not only do these files take up the lion’s share of digital storage capacity, we usually keep them around for a long time, and there is an enormous amount of metadata associated with them.
Why do we keep them for a long time? Partly because they take more effort to create; while some of this content is created by one person, much of it is now produced by teams of people who draft and iterate until the content is ready to be shared with more humans. Mostly though, we keep the content we create around because it’s important. The content can convey all kinds of information—our thoughts, ideas, plans, medical or financial information.
Human generated content is big; the metadata is bigger. Interesting metadata about a file might be who created it, what type of file it is (spreadsheet, presentation), what folder it is stored in, who has been reading it, who has access to it, or who sent it in an email to someone else. Over its lifespan, a file is usually accessed by many people, copied, sent or moved around to many places in many file systems. This metadata is so big that if you collect and store it all in its raw form, before long its size will dwarf the files themselves.
Just as analyzing machine generated data has practical applications for business, analyzing the “big metadata” associated with human generated content has enormous potential. More than potential, harnessing the power of big metadata has become essential to manage, protect, and effectively collaborate in today’s organizations. Those that fail to adopt these technologies report that they have little confidence that their data is protected, that they don’t know where critical information resides, do not know who it belongs to, and are no longer able to keep up with fundamental data protection activities.
For many organizations, human generated big data represents a new frontier of untapped potential. Now that we have the technology to listen to the heartbeat of our organization, we would be remiss not to. Some of the fundamental questions that you can start to answer:
- Who is creating the most content?
- Who is accessing the most data?
- Where is my sensitive data stored?
- Which servers aren’t being utilized?
- Is there anything abnormal going on?
And this is just the tip of the iceberg. Once you start combining data streams, the insights become that much more unique and game-changing.
Whether you start with a general purpose big data solution or a vertical full-stack product, the key takeaway is – collect the data now, you never know when you will need it.