The mandate to every IT department these days seems to be: “do more with less.” The basic economic concept of scarcity is hitting home for many IT teams, not only in terms of headcount, but storage capacity as well. Teams are being asked to fit a constantly growing stockpile of data into an often-fixed storage infrastructure.
So what can we do given the constraints? The same thing we do when we see our own PC’s hard drive filling up – identify stale, unneeded data and archive or delete it to free up space and dodge the cost of adding new storage.
Stale Data: A Common Problem
A few weeks ago, I had the opportunity to attend VMWorld Barcelona, and talk to several storage admins. The great majority were concerned about finding an efficient way to identify and archive stale data. Unfortunately, most of the conversations ended with: “Yes, we have lots of stale data, but we don’t have a good way to deal with it.”
The problem is that most of the existing off-the-shelf solutions try to determine what is eligible for archiving based on a file’s “lastmodifieddate” attribute; but this method doesn’t yield accurate results and isn’t very efficient, either.
Why is this?
Automated processes like search indexers, backup tools, and anti-virus programs are known to update this attribute, but we’re only concerned with human user activity. The only way to know whether humans are modifying data is to track what they’re doing—i.e. gather an audit trail of access activity. What’s more, if you’re reliant on checking “lastmodifieddate” then, well, you have to actually check it. This means looking at every single file every time you do a crawl.
With unstructured data growing about 50% year over year and with about 70% of the data becoming stale within 90 days of its creation, the accurate identification of stale data not only represents a huge challenge, but also a massive opportunity to reduce costs.
4 Secrets for Archiving Stale Data Efficiently
In order for organizations to find an effective solution to help deal with stale data and comply with defensible disposition requirements, there are 4 secrets to efficiently identify and clean-up stale data:
1. The Right Metadata
In order to accurately and efficiently identify stale data, we need to have the right metadata – metadata that reflects the reality of our data, and that can answer questions accurately. It is not only important to know which data hasn’t been used in the last 3 months, but also to know who touched it last, who has access to it, and if it contains sensitive data. Correlating multiple metadata streams provides the appropriate context so storage admins can make smart, metadata-driven decisions about stale data.
2. An Audit Trail of Human User Activity
We need to understand the behavior of our users, how they access data, what data they access frequently, and what data is never touched. Rather than continually checking the “lastmodifieddate” attribute of every single data container or file, an audit trail gives you a list of known changes by human users. This audit trail is crucial for quick and accurate scans for stale data, but also proves vital for forensics, behavioral analysis, and help desk use cases (“Hey! Who deleted my file?”).
3. Granular Data Selection
All data is not created equally. HR data might have different archiving criteria than Finance data or Legal data. Each distinct data set might require a different set of rules, so it’s important to have as many axes to pivot on as possible. For example, you might need to select data based on its last access data as well as the sensitivity of the content (e.g., PII, PCI, HIPAA) or the profile of the users who use the data most often (C-level vs. help desk).
The capability to be granular when slicing and dicing data to determine, with confidence, which data will be affected (and how) with a specific operation will make storage pros lives much easier.
Lastly, there needs to be a way to automate data selection, archival, and deletion. Stale data identification cannot consume more IT resources; otherwise the storage savings diminish. As we mentioned at the start, IT is always trying to do more with less, and intelligent automation is the key. The ability to automatically identify and archive or delete stale data, based on metadata will make this a sustainable and efficient task that can save time and money.
Interested in how you can save time and money by using automation to turbocharge your stale data identification? Request a demo of the Varonis Data Transport Engine.