VDI and the “The Psychic Desktop”

July 5, 2012

Interesting post from the guys at fusion-io. The VDI solution is just amazing

In my last 15 years of designing and implementing storage solutions, I have seen and heard a lot of things — some good, some bad, and some pretty unusual. In one strange accident, a 180-slot tape library tumbled down a flight of stairs (not my fault!). Then there was $1 million SAN virtualization solution that caught fire at the first power-on because an electrician mixed up the cables in the 16A wall socket. I have also seen some poor design choices, like the customer who was running his virtual desktops from a high latency, low cost iSCSI solution because it was the only thing that would fit within his budget. The IT manager actually stopped having his lunch in the company restaurant for a few weeks because of all his complaining co-workers. These are exceptions, but they all started because of a small, bad decision.

In the storage world, people are used to measuring device latency in milliseconds. At Fusion-io, our device latency is just a couple of microseconds. Most users couldn’t care less about differences that small. It sounds ridiculous at first — 1/1000 of a second isn’t fast enough, so let’s build something hundreds of times faster. But when thousands of machines and processes start asking that storage device for data at the same time, it has an unbelievable impact. Anyone can build storage infrastructures with millions of IOPS, as long as you keep throwing racks and racks of hardware at it. But reducing latency calls for a change in the way we look at storage.

A few weeks ago, I was lost for words when I had a cup of coffee with an IT administrator who just migrated to a Fusion-io based VDI solution. He looked at me and said, “You guys have given us the psychic desktop. Thank you for that.” Not sure whether that was a good thing, I looked at him and asked him what he meant. He smiled and explained that one of his company’s users had come into the IT department to ask what they had done to his computer. The previous weekend, IT had migrated his virtual desktop from a storage array to the new Fusion-io based infrastructure. The user was very surprised. He said he had the feeling that his applications were opening so fast, it felt like the desktop knew where he was going to double-click before he actually did.

Over time, this employee got used to clicking on something and waiting for five or six seconds before something actually happened on his screen. He was shocked to see his complete Inbox in less than half a second after he clicked the e-mail icon. The user was convinced the company was running new software that gave his desktop some sort of future-telling capabilities. His question was actually serious, but at that moment, the IT department’s psychic desktop jokes were born.

Jerry Rijnbeek

Sales Engineer, Netherlands

 

 


Ten Things to Consider When Building a Storage System for Virtualization

July 5, 2012

Virtualization is quickly taking over data centers. Gone are the days when IT admins worried about managing operating systems running directly on physical server hardware. The manageability and cumulative performance advantages of virtualization has led to a growing trend where consumer operating systems like Microsoft Windows are run within virtual machines. These virtual machines are managed by a hypervisor (such as VMware’s vSphere) that mediates access to the physical hardware in the server node. Clusters of such server nodes are being put together to host several hundred to even thousands of virtual machines. Such clusters afford high availability and load balance by permitting migration of virtual machines between server nodes.

Just like the rest of the physical hardware, the hypervisor also virtualizes the underlying storage for the virtual machines. Thus, each virtual machine may present a virtualized SCSI disk to the guest operating system running inside it. The data written to this virtualized disk cannot be simplistically mapped to an underlying physical disk. This is because this data needs to remain accessible once the virtual machine is migrated to another server node (e.g., upon a hardware failure). A sophisticated storage subsystem is therefore needed, one that can keep the data accessible despite the movement of virtual machines across server nodes.

The Nutanix Complete Cluster offers such a sophisticated storage subsystem that was designed specifically for virtualization workloads. This storage subsystem can be accessed by the hypervisor through industry standard iSCSI/NFS protocols. This blog talks about 10 key considerations that went into the design of this storage subsystem.

Elegance of Nutanix Design

1. Converged and distributed: Hardware trends in the past ten years indicate that disk capacities and speeds are growing at a much faster pace than network speeds. A cost-effective solution needs to be converged to leverage these trends – i.e., the storage needs to be placed close to the computation that accesses that storage and not across an expensive network fabric. The Nutanix offering epitomizes this by building a distributed storage subsystem using the local disks in the server nodes themselves. This is in sharp contrast to the single-headed SAN/NAS solutions that require expensive networking to deliver the high performance required by server clusters running virtual machines.

30 year old legacy design, the linchpin of our datacenters

2. Incremental scalability: As compute/storage needs grow, it should be possible to grow the system incrementally rather than requiring a complete hardware refresh as is typical with centralized SAN/NAS solutions. The Nutanix Complete Cluster is designed to be incrementally scalable, with no single point of bottleneck. While near linear scalability has been demonstrated in a cluster of 50 nodes, the design affords limitless scalability.

3. Performance: A storage system that considers performance as an after-thought opens itself up for one or more expensive architectural redesigns. The Nutanix Complete Cluster was designed for delivering high performance from the very outset. It combines traditional wisdom in distributed system design with new techniques to deliver high performance. These include a pipelined architecture, asynchronous request handling, extensive caching, and judicious use of Fusion ioMemory to keep frequently accessed data as well as metadata. The design specifically caters to virtualization workloads. For example, the NFS server implementation in the Nutanix Complete Cluster was designed to deliver high data IOPS (both random and sequential) rather than high namespace IOPS (which is what outdated benchmarks like SpecFS primarily measure). This is specially suited for virtualization as the bulk of the IO requests from guest VMs are converted into NFS read/write requests by the hypervisor when accessing the underlying storage subsystem through the NFS protocol.

Nutanix Direct Data Path

4. Random IO: With potentially hundreds of virtual machines simultaneously issuing IO requests, the data access patterns appear random by the time they are incident on the underlying storage system. In contrast to traditional storage subsystem designs, Nutanix was designed with the intent of delivering high random IO performance from the very start. It uses techniques such as a distributed operation log to absorb random writes, careful placement of metadata indexes in high performance SSDs for quick lookups, and extensive use of caching and deduplication to absorb boot/login storms. Recently, a 40-node Nutanix cluster successfully ran VMware’s RAWC benchmark with a record-breaking 3000 virtual machines. More details on this VDI reference architecture can be found at http://bit.ly/yN9S01.

5. Fine-grained tiering: Gone are the days when the predominant form of persistent storage were magnetic disks with similar performance characteristics. Today the data can be stored on a wide variety of media e.g., SSDs, SAS/SATA drives etc, each affording different capacities and performance at a given price point. The storage subsystem in Nutanix recognizes these as separate tiers of storage and places data on them based on its temperature. Thus, hot data is placed on the faster SSDs while colder data might be placed on the slower SATA drives. As the temperature of data changes, the Nutanix complete cluster supports water-falling of data between tiers. To avoid polluting the SSDs with cold data, data is divided up into fine-grained units of a few megabytes that form the basis of data placement and migration. Such fine-grained management of data across tiers also enables Nutanix to quickly adapt to changing workloads.

Information Lifecycle Management

6. Consistency model: The Nutanix Complete Cluster can manage petabytes of data written by guest VMs. Just like other storage subsytems, metadata is maintained to enable the quick location of any data. Since losing data or returning stale data is not an acceptable option, a strict consistency model is supported. While relational database abstractions such as transactions can be used to implement strict consistency, this approach is known to be unscalable and slow. On the other hand, typical noSQL approaches that maintain structured information as a set of key/value pairs are know to be highly performant, but typically only afford eventual consistency. The Nutanix Complete Cluster adopts a novel two-fold approach for delivering high performance despite supporting strict consistency. First, the metadata is kept in a noSQL key/value store that was enhanced with the Paxos algorithm to provide strict consistency for updates of any given key’s value. Second, all metadata operations involving multiple keys are carefully sequenced in way so as to always keep the overall metadata tree completely consistent at all times. This approach provides the best of both worlds – delivering high performance while supporting strict consistency.

7. Congestion management: Every major function in the Nutanix Complete Cluster is handled by a different component. A key aspect of the design is that flow/congestion control is built into each of these components. Without proper congestion management, a distributed system can come to a grinding halt by entering situations where useful work can no longer be done. As an example, the component that manages writes to a disk might become clogged with requests. As a result, a remote sender may timeout its outstanding requests to the congested component and re-send them – causing further continued congestion. To avoid such situations, every component in the Nutanix Complete Cluster exerts appropriate flow control to ensure it accepts only as many requests as it can reasonably execute. In addition, stale or low priority requests are quickly dropped when congestion is detected.

8. Designed for high-availability: A highly available storage subsystem does not have the luxury of going offline when a few of its components fail. These components might be either software components, or hardware ones. The storage subsystem in the Nutanix complete cluster was designed for fault-tolerance. There is no single-point of failure and any component can fail and stay down for extended periods of time. Thus, any disk, node, network card etc may fail without affecting availability. All data is both replicated as well as checksummed to protect against faults. The number of replicas kept for the data is configurable – thus permitting simultaneous failure of one or more components without sacrificing availability.

Anatomy of A Write IO; 10,000 ft. view

9. Replication fan-out: Distributed storage subsystems are often designed by mirroring one disk onto another. With disk capacities running into terabytes, this implies that failure of one disk would require reading all the data from the other healthy disk in order to restore replication. Not only does this create a hot-spot in the system by making one disk the bottleneck while others might be idle, it also increases the chances of data loss because the intense workload on the healthy disk might also cause it to fail. The Nutanix Complete Cluster avoids this by replicating each unit of data (comprising a few megabytes) on a disk to a random disk in the rest of the cluster. On a disk failure, the corresponding replicas can be read to restore replication – the restored second copy can also be placed on any disk in the cluster. Thus, recovering from a failed disk utilizes all of the cluster’s resources and avoids the formation of any hot-spots.

10. Continuous healing: Nutanix’s highly available storage subsystem cannot freeze to run a data consistency check (akin to the fsck found in Unix filesystems). The distributed nature of the system coupled with the petabytes of data it can potentially manage implies that faults will happen sooner or later – for example due to failed components. To discover and recover from such problems, the Nutanix Complete Cluster continuously heals itself by running a MapReduce over its metadata and taking appropriate corrective measures based on the issues found. For example, if a data unit is found to be under-replicated due to a failed component, a replication will be kicked for that component. The MapReduce computation runs as a low-priority background job so as to not affect the performance of higher-priority IO requests emanating from the guest VMs. The use of MapReduce, whose use is predominant in Big Data analytics today, lends the Nutanix Complete Cluster the scalability to manage large amounts of data, while affording high availability at the same time.

To summarize, the Nutanix Complete Cluster bridges the gap between computation and storage by converging these in a compact rackable unit, one or more of which can be stacked together to build a powerful virtualization appliance. The new demands imposed by virtualization workloads required an architecture that was built ground-up to specifically meet these requirements. The yardsticks of availability, performance, and scalability indicate that the Nutanix Complete Cluster is delivering on its promise, and is stretching the horizons of what was earlier possible in the realm of virtualization. Despite everything that has been delivered so far, there are lot of more exciting things that are in the pipeline. So stay tuned.


Fusion-io SDK gives developers native memory access, keys to the NAND realm

June 1, 2012

Fusion-io SDK gives developers native memory access, keys to the NAND realm

Thought your SATA SSD chugged along real nice? Think again. Fusion-io has just released an SDK that will allow developers to bypass all the speed draining bottlenecks that rob NAND memory of its true potential (i.e. the kernel block I/O layer,) and tap directly into the memory itself. In fact, Fusion-io is so confident of its products abilities, it prefers to call them ioMemory Application Accelerators, rather than SSDs. The SDK allows developers native access to the ioMemory, meaning applications can benefit from the kind of hardware integration you might get from a proprietary platform. The principle has already been demonstrated earlier this year, when Fusion-io delivered one billion IOPS using this native access. The libraries and APIs are available now to registered members of its developer program, hit the more coverage link to sign up.


Follow

Get every new post delivered to your Inbox.

Join 752 other followers