Some Amazing Things About Your File System

by Andy Green

I was recently asked by one of our sales people to come up with a few unusual facts about user behaviors or statistics related to networked file systems. She was looking for a good anecdote that would make our customers reconsider conventional IT wisdom. I think I’ve found something to raise an IT admin’s eyebrow.

To be fair, my discovery has been known about in a general way for a long time. It’s even become part of our popular culture. No, I don’t mean Murphy’s Law, which is well-appreciated by IT journeymen. I am referring to the proverbial 80-20 rule, which was explained to me, with more than a little hand waving, when I first started in IT. It went something like this: “80% of the data is explained by 20% of the facts”.

As with many simply stated rules, 80-20 hides some deep ideas. It turns out to describe key stats in complex systems spanning economics, marketing, sociology, as well as a few physical sciences. In recent years, the rule has been found to apply to another and more familiar complex creation–the Internet.

The fancier way to describe the 80-20 rule is to say that the distribution of data—a graph of web site visits, web link references, and, as we’ll see later, file sizes—are governed by so-called power laws. Long tails or fat tails are still other terms used to talk about the relative weightiness of events at the extreme end of the data curve—that is, compared to the thinner limits of the more beloved bell-shaped curve.

There is strong evidence for the rule. Much has been written about fat tails with respect to web stats. You can partially satisfy your own curiosity by looking at the web traffic data collected by Quantcast. According to them, perennial top sites such as Facebook, Google, Yahoo, Twitter, and a few others attract a disproportionate amount of total web visits.

From a quick back-of-the-envelope calculation using the Quantcast numbers, I tallied up close to 80% of monthly visitor traffic against just 40 of Quantcast’s top ranked sites. These 40 sites, out of almost 400 million total web sites worldwide, is way, way less than 1%. That’s a very skewed 80-20 pattern—closer to 80-.00001!

What does this have to do with file systems? Networked file servers are complex enough with a large enough community of users accessing an ever changing supply of resources–files, directories, and access permissions—to potentially behave in similar ways to the Web.

In graphing the distributions of file sizes, researchers long ago noticed–long pause–a similar kind of skewed curve. While it may not be a true power law, the telltale fat tail shows up for extreme file sizes. For example, you can check out this paper from the folks at Microsoft Research wherein they plot byte-counts for their corporate file system.

Being curious about my own aged home computer, a 10 year-old Dell running Windows XP, I decided to take a quick peek at a histogram of its file system, using a freebie utility. Here’s what I learned: out of almost 70,000 files taking up about 29 GB of space, a mere 83 files, or a shade more than .1%, accounted for an astonishing 26% of the disk space!

Skewed disk utilization graph

Even though I’m familiar with the research, I was still a little stunned to see the fat tail pattern play out on my personal computer. By the way, Microsoft Outlook® .pst files can reach huge sizes–you’ve been warned!

What’s going on to explain these renegade fat tails in corporate file systems?

One of the proposed ideas is that we, as file users, are copying existing files and then editing–adding or subtracting content–from them for the next person down the chain to modify and so on. Essentially, users are successively multiplying a file size by a random factor, and this has been shown to lead to fat-tailed file size curves.

This copying behavior may also have a herd component to it. That is, we tend to edit files that have been copied or accessed more frequently. Preferences for popular files—or web sites or social networks—are also known to lead to fat-tailed distributions.

Based on my own experience as a user, I plead guilty to not only amending and expanding existing files but also echoing file permissions. When it came to read-write-execute or ACE metadata, I was definitely a member of the herd, following what someone else had done—that is, until I started at Varonis.

There’s an IT moral to all this. Your user community is, unfortunately, propagating the “everyone” group or other harmful ACEs, and also unknowingly helping to push files into the red-zone of the file size curve.

For my money, herding behaviors alone are reason enough to use Varonis’s DatAdvantage to really understand and manage your organization’s networked file systems. A file system and its community of users form a kind of social network in which it is quite easy to amplify bad habits.

So you’ll want Varonis’s software to automatically spot these patterns and then take more direct control over shaping your file system’s overall profile.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s