Chapter 1. The Evolution of Storage

Joe Arnold

In 2011, OpenStack Object Storage, code-named Swift, was launched as an inaugural project within the OpenStack cloud computing project. OpenStack Swift was a thoughtful and creative response to the unprecedented and precipitous growth in data. It introduced the world to an open-source storage system proven to run at extremely large scale.

The timing couldn’t have been more perfect. Just as the growth in online data was taking off, software-defined storage (SDS) with systems such as Swift was being developed. Object storage and SDS are logical next steps in the evolution of storage.

But before getting into Swift (in Chapter 2), we should examine the boom in unstructured data, its particular storage requirements, and how storage has evolved to include object storage and SDS. We will also explain how object storage compares to the more familiar block and file storage.

Storage Needs for Today’s Data

In this era of connected devices, demands on storage systems are increasing exponentially. Users are producing and consuming more data than ever, with social media, online video, user-uploaded content, gaming, and Software-as-a-Service applications all contributing to the vast need for easily accessible storage systems that can grow without bounds. A wide spectrum of companies and institutions are facing greater and greater storage demands. In the geological and life sciences, machine-generated data is much more valuable when it is accessible. Video productions are taping more hours at higher resolution than ever before. Enterprises are capturing more data about their projects and employees expect instant access and collaboration.

What is all this data? The majority is “unstructured” data. This means that the data does not have a predefined data model and is typically stored as a file as opposed to an entry in a database (which would be structured data). Much of this unstructured data is the ever-proliferating images, videos, emails, documents, and files of all types. This data is generated by consumers on billions of devices with more coming online each year. Data is also being generated by the many Internet-connected devices such as sensors and cameras across all types of industry.

The Growth of Data: Exabytes, Hellabytes, and Beyond

To understand the scale of this growth in data, consider that according to research conducted by the International Data Corporation (IDC) in 2013, worldwide installed raw storage capacity (byte density) will climb from 2,596 exabytes (EB) in 2012 to a staggering 7,235 EB in 2017.[1] Stored data is continuing to grow at ever faster rates, leading IDC to estimate that by 2020 the amount of data in the world will reach 35,840 exabytes. Divided by the world’s population, that is roughly 4 terabytes per person.

By now you might be wondering just how big an exabyte is. It’s an almost unimaginably large quantity, but let’s try. It’s equivalent to about one thousand petabytes (PB), one million terabytes (TB), or one billion gigabytes (GB). These figures will be meaningful to some, but if they’re not, we’ll contextualize this a bit more. An average book takes up about 1 megabyte. The largest library in the world, the Library of Congress, has about 23 million volumes, which total approximately 23 TB of data. So it would take over 43,000 libraries the size of the Library of Congress to generate an exabyte. Another example would be high-resolution photos, which are roughly 2 MB in size; an exabyte would be almost 537 billion high-resolution photos.

Storage capacity continues to grow, necessitating new labels for ever larger quantities of data. Beyond an exabyte (1018 bytes), we have the zettabyte (1021 bytes) and yottabyte (1024 bytes), which follow the prefixes offered by the International System of Units (SI). In response both to the rapid growth of data and the proliferation of terms, some have suggested that we simply use “hellabyte,” for “hell of a lot of bytes.”

Requirements for Storing Unstructured Data

Unstructured data often needs to be stored in a way that ensures durability, availability, and manageability—all at low cost.

Durability
Durability is the extent to which a storage system can guarantee that data will never be permanently lost or corrupted, in spite of individual component failures (such as the loss of a drive or a computer). Data is arguably the most critical (and often irreplaceable) thing we create. Many types of data must be stored forever. In a data center, most unstructured data needs to be kept for a very long time to meet customer expectations, legal and regulatory requirements, or both.
Availability
Availability refers to a storage system’s uptime and responsiveness in the face of individual component failures or heavy system load. Unstructured data usually needs to be available in an instant across a variety of devices regardless of location; users want to access their data on their mobile devices, laptops at home, and desktops at work. Although some data can be archived, many users expect most of their data to be immediately available.
Manageability
Manageability—the level of effort and resources required to keep the system functioning smoothly in the long term—can easily be an issue with small storage systems, or even with a personal computer. The concept can include personnel, time, risk, flexibility, and many other considerations that are difficult to quantify. With larger storage systems coming online, manageability becomes critical. Storage should be easy to manage. A small number of administrators should be able to support a large number of storage servers.
Low cost
Unstructured data needs to be stored at low cost. With enough money, any storage problem can be solved. However, we live in a world of constraints. Business models and available budgets require low-cost data storage solutions. There are many factors that need to be accounted for in a systems’ overall costs. This includes initial up-front expenses, such as the cost of acquiring hardware. But there are also ongoing costs that might need to be factored in. These costs might include additional software licenses, personnel costs to manage the system, and data center costs such as power and bandwidth.

No One-Size-Fits-All Storage System

Although it would be great if there was a one-size-fits-all solution for the large amounts of mostly unstructured data the world is generating, there isn’t. Storage systems entail trade-offs that we can think of as responses to their particular requirements and circumstances.

The CAP theorem, first advanced by Eric Brewster (University of California at Berkeley professor, Computer Science Division) in 2000, succinctly frames the problem. It states that distributed computer systems cannot simultaneously provide:

Consistency
All clients see the same version of the data at the same time.
Availability
When you read or write to the system, you are guaranteed to get a response.
Partition tolerance
The system works when the network isn’t perfect.

Because these are incompatible, you have to choose the two that are most important for your particular circumstances when implementing a distributed computer system.[2] Because partition tolerance isn’t really optional in a distributed system, this means partition tolerance will be paired with either consistency or availability. If the system demands consistency (for example, a bank that records account balances), then availability needs to suffer. This is typically what is needed for transactional workloads such as supporting databases. On the other hand, if you want partition tolerance and availability, then you must tolerate the system being occasionally inconsistent. Although purpose-built storage systems offer an operator more reliability for a particular workload than a general-purpose storage system designed to support all workloads, if a system purports to do all three equally well you should take a closer look. There are always trade-offs and sacrifices.

Following the CAP theorem, Swift trades consistency for eventual consistency to gain availability and partition tolerance. This means Swift is up to the task of handling the workloads required to store large amounts of unstructured data. The term “eventual consistency” refers to a popular way of handling distributed storage. It doesn’t meet the ideal of providing every update to every reader before the writer is notified that a write was successful, but it guarantees that all readers will see the update in a reasonably short amount of time.

This allows Swift to be very durable and highly available. These trade-offs and the CAP theorem are explored in greater depth in Chapter 5.

Object Storage Compared with Other Storage Types

Different types of data have different access patterns and therefore can be best stored on different types of storage systems. There are three broad categories of data storage: block storage, file storage, and object storage.

Block storage
This stores structured data, which is represented as equal-size blocks (say, 212 bits per block) without putting any interpretation on the bits. Often, this kind of storage is useful when the application needs to tightly control the structure of the data. A common use for block storage is databases, which can use a raw block device to efficiently read and write structured data. Additionally, filesystems are used to abstract a block device, which then does everything from running operating systems to storing files.
File storage
This is what we’re most used to seeing as desktop users. In its simplest form, file storage takes a hard drive (like the one on your computer) and exposes a filesystem on it for storing unstructured data. You see the filesystem when you open and close documents on your computer. A data center contains systems that expose a filesystem over a network. Although file storage provides a useful abstraction on top of a storage device, there are challenges as the system scales. File storage needs strong consistency, which creates constraints as the system grows and is put under high demand. In addition, filesystems often require other features (such as file locking) that create a barrier for working well with large amounts of data.
Object storage
This will be familiar to those who regularly access the Internet or use mobile devices. Object storage doesn’t provide access to raw blocks of data; nor does it offer file-based access. Instead, it provides access to whole objects or blobs of data—generally through an API specific to that system. Objects are accessible via URLs using HTTP protocols, similar to how websites are accessible in web browsers. Object storage abstracts these locations as URLs so that the storage system can grow and scale independently from the underlying storage mechanisms. This makes object storage ideal for systems that need to grow and scale for capacity, concurrency, or both.

One of the main advantages of object storage is its ability to distribute requests for objects across a large number of storage servers. This provides reliable, scaleable storage for large amounts of data at a relatively low cost.

As the system scales, it can continue to present a single namespace. This means an application or user doesn’t need to—and some would say shouldn’t—know which storage system is going to be used. This reduces operator burden, unlike a filesystem where operators might have to manage multiple storage volumes. Because an object storage system provides a single namespace, there is no need to break data up and send it to different storage locations, which can increase complexity and confusion.

A New Storage Architecture: Software-Defined Storage

The history of data storage began with hard drives connected to a mainframe. Then storage migrated off the mainframe to separate, dedicated storage systems with in-line controllers. However, the world keeps changing. Applications are now much larger. This means their storage needs have pushed beyond what the architecture of an in-line storage controller can accommodate.

Older generations of storage often ran on custom hardware and used closed software. Typically, there were expensive maintenance contracts, difficult data migration, and a tightly controlled ecosystem. These systems needed tight controls to predict and prevent failures.

The scale of unstructured data storage is forcing a sea change in storage architecture, and this is where SDS enters our story. It represents a huge shift in how data is stored. With SDS, the entire storage stack is recast to best meet the criteria of durability, availability, low cost, and manageability.

SDS places responsibility for the system in the software, not in specific hardware components. Rather than trying to prevent failures, SDS accepts failures in the system as inevitable. This is a big change. It means that rather than predicting failures, the system simply works through or around those failures.

Unstructured data is starting to rapidly outpace structured data in both total storage and revenue. SDS solutions offer the best way to store unstructured data. By providing a way to deal with failures, it becomes possible to run your system on standard and open-server hardware—the kind that might fail occasionally. If you’re willing to accept this, you can easily add components to scale the system in increments that make sense for your specific needs. When whole systems can be run across mix-and-match hardware from multiple vendors—perhaps purchased years apart from each other—then migration becomes less of an issue.

This means that you can create storage that spans not just one rack, one networking switch, or even just one data center, but serves as a single system over large-scale, private, corporate networks or even the Internet. That is a powerful defense against the deluge of data that many businesses are experiencing.

Software-Defined Storage Components

An SDS system separates the intelligence and access from the underlying physical hardware. There are four components of an SDS system:

Storage routing

The storage routing layer acts as a gateway to the storage system. Pools of routers and services to access the routers can be distributed across multiple data centers and geographical locations. The router layer scales out with each additional node, allowing for more capacity for data access.

The routers in an SDS system can route storage requests around hardware and networking faults. When there is a hardware failure, the system applies simple rules to service the request by assembling the necessary data chunks or retrieving copies of the data from non-failed locations.

The processes in an SDS system account for access control, enable supported protocols, and respond to API requests.

Storage resilience

In an SDS system, the ability to recover from failures is the responsibility of the software, not the hardware. Various data protection schemes are used to ensure that data is not corrupted or lost.

There can be separate processes running on the system to continuously audit the existing data and measure how well the data is protected across multiple storage nodes. If data is found to be corrupt or not protected enough, proactive measures can be taken by the system.

Physical hardware
Within an SDS system, the physical hardware stores the bits on disk. However, nodes storing data are not individually responsible for ensuring durability of their own data, as that is the responsibility of the storage resilience systems. Likewise, when a node is down, the storage routing systems will route around it.
Out-of-band controller

SDS systems should be efficient to manage and scale. These distributed storage systems need an alternative form of management rather than a traditional storage controller, which intercepts each storage request. Therefore, an out-of-band, external storage controller is used by operators to orchestrate members of a distributed SDS system.

A controller can dynamically tune the system to optimize performance, perform upgrades, and manage capacity. A controller can also allow faster recoveries when hardware fails and allow an operator to respond to operational events. In this way, an SDS controller can orchestrate available resources–storage, networking, routing, and services–for the entire cluster.

Benefits of Software-Defined Storage

SDS systems can effectively manage scale and drive operational efficiencies in the infrastructure. Capacity management is a lot simpler with an SDS system, because each component is a member of a distributed system. Because of this arrangement, upgrades, expansions, and decommissions can be achieved without any downtime and with no need for forklift (physical) data migration.

The separation of physical hardware from the software allows for mix-and-match hardware configurations within the same storage system. Drives of varying capacity or performance can be used in the same system, enabling incremental capacity increases. This allows for just-in-time purchasing, which lets you take advantage of the technology innovation curve and avoid deploying too much storage.

SDS solutions are also often open source, which means better standards, more tools, and the ability to avoid lock-in to a single vendor. Open source encourages a large thriving ecosystem, where the diversity of the community members drives standards and tools. Now that we’re building applications that need to be compatible with more and more devices, creating and refining standards becomes more and more important.

Why OpenStack Swift?

Swift allows for a wide spectrum of uses, including supporting web/mobile applications, backups, and active archiving. Layers of additional services let users access the storage system via its native HTTP interface, or use command-line tools, filesystem gateways, or easy-to-use applications to store and sync data with their desktops, tablets, and mobile devices.

Swift is an object storage system, which, as we have discussed, means it trades immediate consistency for eventual consistency. This allows Swift to achieve high availability, redundancy, throughput, and capacity. With a focus on availability over consistency, Swift has no transaction or locking delays. Large numbers of simultaneous reads are fast, as are simultaneous writes. This means that Swift is capable of scaling to an extremely large number of concurrent connections and extremely large sets of data. Since its launch, Swift has gained a community of hundreds of contributors, gotten even more stable, become faster, and added many great new features.

Swift can also be installed on what is often referred to as commodity hardware. This means that standard, low-cost server components can be used to build the storage system. By relying on Swift to provide the logical software management of data rather than a specialized vendor hardware, you gain incredible flexibility in the features, deployment, and scaling of your storage system. This, in essence, is what software-defined storage is all about.

But what might be most interesting is what happens “under the hood.” Swift is a fundamentally new sort of storage system. It isn’t a single, monolithic system, but rather a distributed system that easily scales out and tolerates failure without compromising data availability. Swift doesn’t attempt to be like other storage systems and mimic their interfaces, and as a result it is changing how storage works.

This comes with some constraints, to be sure, but it is a perfect match for many of today’s applications. Swift is more and more widespread and is evolving into a standard way to store and distribute large amounts of data.

Conclusion

In recent years, two major changes to storage have come about in very short order. First, the emergence of web and mobile applications have fundamentally changed data consumption and production. This first started with the consumer web and has grown quickly with increasing numbers of enterprise applications.

The second major change has been the emergence of SDS, which enables large, distributed storage systems to be built with standards-based, commodity storage servers. This has dramatically reduced the costs of deploying data-intensive applications, as there is no reliance on individual hardware components to be durable.

In the next chapter we introduce OpenStack Swift. We will more fully discuss the features and benefits of Swift. After that you’ll get to dive into the particulars of the architecture of Swift.



[2] This principle is similar to the adage that says you can only pick two: fast, cheap, or good.

Get OpenStack Swift now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.