BUY THIS BOOK
Add to Cart

Print Book $49.99


Add to Cart

Print+PDF $64.99

Add to Cart

PDF $39.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £30.99

What is this?

Looking to Reprint or License this content?



Recent Forum Posts
Programming Amazon Web Services
Programming Amazon Web Services S3, EC2, SQS, FPS, and SimpleDB By James Murty
March 2008
Pages: 600

Cover | Table of Contents | Forum | Colophon


Table of Contents

Chapter 1: Infrastructure in the Cloud
The World Wide Web has grown quickly over the last couple of decades to become an invaluable resource for communication, research, and entertainment. The Web has also become an open platform on which powerful services and applications can be built by established companies and newcomers alike. It is a very accessible platform that allows even small companies to create web applications and build a business without requiring the backing of a large enterprise. A person or group with some expertise, some time, and a good enough idea can create a web application that competes with the offerings of larger corporations—or even carves out an entirely new market. On the Web, the size and marketing clout of a large corporation does not guarantee it a monopoly on the attention and patronage of a global audience.
The Web is full of opportunities for companies both large and small, but the smaller companies face a difficult problem: infrastructure.
Web applications that are popular and have thousands of users require significant infrastructure to provide the high performance and smooth experience that users demand. Industrial-strength infrastructure is very expensive to buy and maintain, so smaller companies with fewer users are often forced to do without. Yet in today’s world of web publicity flash storms caused by sites such as Slashdot and Digg, the difference between a web application serving a few dozen users and serving thousands may be no more than a glowing article and a few hours’ time.
Although this kind of attention may be exactly what you hope for, unless you have invested heavily in infrastructure, your application may not survive the onslaught. On the other hand, if you spend too much money on servers, bandwidth, hosting, and the management of all this infrastructure, there will be little left to develop the application itself. A dilemma facing many small development teams is how to strike the right balance between investing in application development and funding robust and scalable infrastructure.
Amazon offers a new and compelling solution to this dilemma in the form of infrastructure web services. These services allow application developers to avoid altogether the burden of buying and maintaining physical infrastructure by making it possible to rent virtual infrastructure instead. In this book we will show you how you can build your applications on top of Amazon’s services and effectively outsource your .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Amazon Web Services for Infrastructure
In this book, we will examine four offerings from Amazon Web Services (AWS) that provide flexible and affordable infrastructure components on which you can build industrial-strength web applications.
Amazon Simple Storage Service: S3
Amazon Simple Storage Service (S3) offers secure online storage space for any kind of data, providing an alternative to building, maintaining, and backing-up your own storage systems. It makes your data accessible to any other applications or individuals you allow from anywhere on the Web. There are no limits on how much data you can store in the service, how long you can store it, or on how much bandwidth you can use to transfer or publish it.
S3 is a scalable, distributed system that stores your information reliably across multiple Amazon data centers, and it is able to serve it quickly to massive audiences. Its storage application programming interface (API) is deliberately simple and makes no assumptions about the nature of the data you are storing. This simplicity means you can maintain complete control over how your data is represented in the service.
Amazon Elastic Compute Cloud: EC2 (beta)
Amazon Elastic Compute Cloud (EC2) makes it possible to run multiple virtual Linux servers on demand, providing as many computers as you need to process your data or run your web application without having to purchase or rent physical machines. In EC2 you have full control over each server with root access to the operating system (the root user is the ultimate system administrator on Linux machines), a configurable firewall to manage network access, and the freedom to install any software you please. Once you have set up an EC2 server the way you like it, you can save it permanently as a server image. You can then launch new servers from this image to create virtual machines that are preconfigured and ready to do your bidding.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Thinking Like Amazon
Before you start building applications based on AWS, it is worthwhile to consider the thinking behind these services. What were the key goals that lead Amazon to build the services in the first place? And how did these goals influence the design and implementation of the services?
Initially, the AWS infrastructure services were not conceived as products to be sold to developers external to Amazon but were instead designed to meet specific needs within Amazon’s own internal systems. It was only later that these services were opened up to the public. The key implementation details of the services are therefore intended primarily to serve Amazon’s needs and will not necessarily use the methodologies or techniques common in the rest of the industry. Appreciating the reasoning behind the architectural decisions and their implementation details can help you to adjust your expectations for the services. This, in turn, will make it easier to design applications that work well with the services’ capabilities.
Amazon’s services are designed to power the Amazon.com web site and related partner applications. The services operate as small component cogs in a large service-oriented architecture (SOA). Each service performs a specific task as simply and efficiently as possible, while the strengths of many different services are combined as required to perform complex processes and build the rich Amazon.com web pages with which we are familiar. Amazon’s SOA has been developed over many years of hard-won experience to be highly scalable to meet growing demand and be highly reliable despite the inevitable hardware and network failures that will occur in such an environment.
The AWS infrastructure services we will examine in this book were designed to fulfill specific tasks in this SOA environment. You will gain the most from the services with the fewest headaches if you design your applications to work like Amazon’s. Instead of taking the traditional approach of building a system with the expectation that everything will work as expected all the time, and that problems will be so rare that you can deal with them as an afterthought, you need to accept from the start that failures will occur, and you should design your application to deal with them. For example, you should aim to build application components that can recover from temporary network glitches, gracefully handle error conditions, and restart quickly. Try to avoid creating architectural bottlenecks that are single points of failure. Instead, share the work burden between multiple components in a service pool that can be expanded or contracted in response to demand, and ensure that each component in the group can be easily replaced.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reality Check
It should go without saying that the infrastructure services provided by AWS will not be suitable for every circumstance or application. There are a number of things you need to consider carefully before deciding whether a full or partial move to Amazon’s virtual infrastructure is appropriate in your situation. In this section we will briefly discuss some of the common objections to making such a move and suggest counter-arguments to these objections. Our aim is not to persuade you one way or another, but to raise the issues you need to consider before you make up your own mind.
The infrastructure provided by AWS is only available when you have a working Internet connection and a clear network path to the services. It is vital to have a high-speed Internet connection. If you have an intermittent or slow connection to the Internet, these services will not be a practical option. Even with a fast connection, with the fragility of networking hardware and the vagaries of Internet traffic routing, it is likely that sooner or later you will be unable to reach AWS for a brief period of time. If your application is completely dependent on AWS, this could result in downtime for your application and disruption for your customers. With these kinds of issues, it may not be possible for Amazon to help, especially if the problem is caused by network resources outside its control.
Amazon seeks to be resistant to traffic routing problems by making its services accessible multiple data center locations. These data centers are generally located in the United States, though S3 is also available from data centers located in Europe. Overall, for any web-based resource, there is always the possibility of losing connectivity, and you need to take this risk into account when planning your application.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interfaces: REST and Query Versus SOAP
AWS infrastructure services are made available through three separate APIs: REST, Query, and SOAP. In this book we will focus only on the REST and Query APIs and will not demonstrate how to use the SOAP APIs. We have a number of reasons for doing this, reasons which will become clearer after a brief explanation of the differences between the interfaces.
REST interfaces
The REST interfaces offered by AWS use only the standard components of HTTP request messages to represent the API action that is being performed. These components include:
  • HTTP method: describes the action the request will perform
  • Universal Resource Identifier (URI): path and query elements that indicate the resource on which the action will be performed
  • Request Headers: pieces of metadata that provide more information about the request itself or the requester
  • Request Body: the data on which the service will perform an action
Web services that use these components to describe operations are often termed RESTful services, a categorization for services that use the HTTP protocol as it was originally intended.
Query interfaces
The Query interfaces offered by AWS also use the standard components of the HTTP protocol to represent API actions; however these interfaces use them in a different way. Query requests rely on parameters, simple name and value pairs, to express both the action the service will perform and the data the action will be performed on. When you are using a Query interface, the HTTP envelope serves merely as a way of delivering these parameters to the service.
To perform an operation with a Query interface, you can express the parameters in the URI of a GET request, or in the body of a POST request. The method component of the HTTP request merely indicates where in the message the parameters are expressed, while the URI may or may not indicate a resource to act upon.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Interacting with Amazon Web Services
In this chapter, we will show you how to create and send HTTP requests that will be understood by the Amazon Web Services (AWS) infrastructure. To demonstrate how to communicate with the AWS REST and Query interfaces, we will build a communications library in Ruby. In later chapters we will build client programs that work with the API interface for each AWS infrastructure service. These service clients will take advantage of the low-level communication library we present in this chapter.
To interact with AWS, our communications library must create HTTP request messages that describe the actions to perform, and it must provide the data the service will operate on. The library will send request messages to the designated service, wait for a response, determine whether the request was successful, and pass the response back to the client for further processing. Although the infrastructure services have very different capabilities and applications, at the HTTP communication level they work in much the same way and will reuse the same library functionality.
Amazon’s S3, EC2, SQS, FPS, and SimpleDB services are made available via two application programming interfaces (APIs) that are based on the standard features of the HTTP protocol: the REST and Query interfaces. Each service’s API defines the structure and content of the HTTP request messages the service will understand and the response messages it will return.
We will assume that our readers are at least somewhat familiar with the HTTP protocol and the process for sending and receiving HTTP requests. Because it is always worthwhile to brush up on the basics before tackling the harder stuff, we will run through the briefest of HTTP guides.
HTTP requests are made up of four main components:
URI
The Universal Resource Identifier (URI) in an HTTP request identifies the resource the request will act upon. A URI can include four components: the protocol that will be used to transmit the request, the host name to which the request will be sent, a path that identifies a specific resource, and an optional query component that allows additional request parameters to be specified.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
REST-Based APIs
Amazon’s S3, EC2, SQS, FPS, and SimpleDB services are made available via two application programming interfaces (APIs) that are based on the standard features of the HTTP protocol: the REST and Query interfaces. Each service’s API defines the structure and content of the HTTP request messages the service will understand and the response messages it will return.
We will assume that our readers are at least somewhat familiar with the HTTP protocol and the process for sending and receiving HTTP requests. Because it is always worthwhile to brush up on the basics before tackling the harder stuff, we will run through the briefest of HTTP guides.
HTTP requests are made up of four main components:
URI
The Universal Resource Identifier (URI) in an HTTP request identifies the resource the request will act upon. A URI can include four components: the protocol that will be used to transmit the request, the host name to which the request will be sent, a path that identifies a specific resource, and an optional query component that allows additional request parameters to be specified.
HTTP method
The HTTP method describes the kind of action the service will perform when it processes the request. AWS does not use the full range of HTTP methods available. In this book we will only address the following five methods:
GET
Retrieves all the information available from a resource as specified by the request’s URI, including both metadata and data content information. Metadata is returned as response headers, and the main data content is returned in the body of the response.
HEAD
Retrieves only the metadata information for a resource specified by the request’s URI. Unlike a GET request, the resource’s data content will not be returned.
PUT
Stores the request’s data content in a resource at the location specified by the request’s URI. If a resource already exists at the URI location, its content is replaced with the new data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
User Authentication
AWS requires API request messages to be digitally signed by the owner of an AWS account. The services use this signature to confirm the identity of the sender and to ensure that the request has not been altered in transit. Generating request signatures and attaching them to your requests is a vital part of the communications process when using AWS.
Each AWS user account has an associated set of credentials that you use to sign your REST or Query request messages. These credentials, known as AWS Access Key Identifiers, are composed of a pair of text values that include an Access Key ID and a Secret Access Key. The Access Key ID identifies the AWS account holder who is making a request, and the Secret Access Key is used to calculate a digital signature for the request. As its name implies, your secret key must be kept private to ensure no one else sends requests to AWS pretending to be you. If you are afraid that your secret access key has been compromised, you can generate a new secret key at any time and invalidate the old one.
The SOAP interfaces use X.509 certificates to authenticate request messages instead of the Access and Secret keys. To use the SOAP interfaces, or tools based on this interface, you must obtain your public and private X.509 certificate files in addition to your AWS Access Key Identifiers.
To sign REST or Query API requests, you must generate a keyed Hash Message Authentication Code (HMAC) that authenticates the request. This means that each request message is summarized into a brief hash value, which is then cryptographically signed using the Secret Access Key credential associated with an AWS account. The result of the HMAC computation is a digital signature, a piece of data that could only have been generated by someone who knows the key properties of the request and possesses the AWS Secret Access Key credential. By attaching this signature to your request message, you sign the request in a way that allows AWS to know that you created the request and that the message has not been altered in transit.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performing AWS Requests
To perform API operations via the AWS REST or Query interfaces, our service client implementations will perform the following steps:
  • Create a URI specifying the service resource the request will act upon.
  • Build a request message containing all the information the service will need to perform the request.
  • Generate a request description string that summarizes the request message.
  • Authenticate the request by generating a signature from the request description string and attaching it to the request.
  • Transmit the HTTP request message to the service and receive a response.
  • Determine whether the request succeeded or failed. Raise a ServiceError if it failed.
  • Interpret the response sent by the service and obtain the result information specific to that operation.
The work required to perform the first two steps, which construct a request message, and the last step, in which the service’s response is interpreted, vary greatly across the different interfaces and operations. These steps will be performed by the service client implementations we will define in later chapters. The middle four steps, on the other hand, can be implemented once for each of the REST and Query interfaces, and they may be reused by all the services that use these interfaces.
In this section we will add methods to the AWS module that prepare and transmit requests for both the REST and Query interfaces. These methods will be generic enough to be reused by multiple AWS client implementations.
The AWS REST interfaces use request messages in which the HTTP method describes the action that will be taken. The REST interface we will focus on in this book—the one for the S3 service—understands the GET, HEAD, DELETE, and PUT methods. To perform a REST API operation, the communication library must be told which HTTP method to use and the URI of the resource the request will act upon. In addition to this basic information, some requests will include metadata information specified as request headers, and those that upload data to the service will also include a request body.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: S3: Simple Storage Service
Amazon’s Simple Storage Service (S3) provides unlimited online storage space for files or data of any kind. Information stored in S3 is accessible from anywhere you have an Internet connection and is maintained in a highly scalable and reliable system. You can use S3 to securely store your personal data, to cheaply distribute content to the general public, or as a data storage component in a distributed web application architecture.
Amazon offers a Service Level Agreement for S3 that makes users eligible for service credits should the S3 uptime percentage fall below 99.9%. To claim these credits, users of the service must track any faults experienced by their applications due to S3 downtime, and they must provide Amazon with detailed logging documentation to corroborate the claim. For more information, refer to the Amazon S3 Service Level Agreement at http://www.amazon.com/gp/browse.html?node=379654011.
S3’s data model is very simple, comprising only two kinds of storage resource: objects and buckets. Objects store data and metadata, and buckets are containers that can hold an unlimited number of objects. The simplicity of the system means it is very flexible and easily adapted to suit a range of purposes, but it also means that if you need to perform complex tasks, you may have to create more intelligent programs to make up for the lack of features in the storage model.
In addition to data storage, S3 provides access control mechanisms that allow you to keep your information private or make it public and accessible to anyone on the Internet. Access control settings are configured using a list of rules that describe who will be granted access to a resource and the kinds of access that will be permitted. Access control settings can be applied to both bucket and object resources.
Resources in S3 are identified using standard Universal Resource Identifiers (URIs). This means that publicly accessible objects can be downloaded from a URI resembling any standard web site location, such as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
S3 Overview
S3’s data model is very simple, comprising only two kinds of storage resource: objects and buckets. Objects store data and metadata, and buckets are containers that can hold an unlimited number of objects. The simplicity of the system means it is very flexible and easily adapted to suit a range of purposes, but it also means that if you need to perform complex tasks, you may have to create more intelligent programs to make up for the lack of features in the storage model.
In addition to data storage, S3 provides access control mechanisms that allow you to keep your information private or make it public and accessible to anyone on the Internet. Access control settings are configured using a list of rules that describe who will be granted access to a resource and the kinds of access that will be permitted. Access control settings can be applied to both bucket and object resources.
Resources in S3 are identified using standard Universal Resource Identifiers (URIs). This means that publicly accessible objects can be downloaded from a URI resembling any standard web site location, such as http://s3.amazonaws.com/bucket-name/object-name. S3 also allows resources to be accessed using alternative domain names. This feature allows you to publish links to your resources based on your own domain name, such as http://www.mysite.com/object-name, instead of the default S3-service domain name.
The S3 is built on a distributed architecture within Amazon. Your data is stored redundantly within this architecture, spread across multiple physical servers and across multiple data centers in different locations. If you wish you can specify a geographical location where your buckets will be stored. When this book was written, Amazon provided S3 data-center locations in the United States and Europe.
This storage strategy provides huge benefits in terms of redundancy, reliability, and scalability, but it also leads to some drawbacks that you must consider when building applications that use S3.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interacting with S3
The S3 web service application program interface (API) is made available through two interfaces: REST and SOAP. In this book we will use the REST interface.
The S3 service implementation presented in this chapter uses the REST API functionality in the AWS Ruby module. The AWS module includes methods, presented in ” in , that perform authentication, transmission, and response checking of REST API requests.
The REST API interface for the S3 service uses five HTTP methods to perform API operations: GET, HEAD, PUT, DELETE, and POST. The meaning of each method varies slightly, depending on what kind of S3 resource the operation is targeting: an object, a bucket, an Access Control List (ACL), or the S3 service itself. lists some of the operations you can perform on S3 resources using different HTTP methods.
Table : Acting on S3 resources with HTTP methods
ResourceGETHEADPUTDELETEPOST
S3 ServiceList your buckets----
BucketList the bucket’s objects-Create the bucketDelete the bucket-
ObjectRetrieve the object’s data and metadataRetrieve the object’s metadataCreate or replace the objectDelete the objectCreate or replace the object
ACL (for a Bucket or Object resource)Retrieve ACL settings-Apply new ACL settings--
The most recent S3 API version available when this book was written, was 2006-03-01. This version number is used as a component of the XML namespace of documents provided to and produced by the service, http://s3.amazonaws.com/doc/2006-03-01/.
In this chapter, we will gradually build up a complete implementation class called “S3” that you can use to interact with the S3 service. shows a basic Ruby code stub that defines the S3 class, to which we will add API implementation methods as we proceed through the chapter. Save this code to a file named S3.rb in the same directory as the AWS module file AWS.rb, which we defined in .
Example . S3 class stub: S3.rb
require 'AWS'
require 'digest/md5'

class S3
  include AWS # Include the AWS module as a mixin

  S3_ENDPOINT = "s3.amazonaws.com"
  XMLNS = 'http://s3.amazonaws.com/doc/2006-03-01/'
    
  # S3 API implementation methods will go here...
  
end
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Buckets
An S3 bucket is a container for data objects. A bucket does not contain any data; it is little more than a convenient way of grouping objects together. The closest computer system analogy to an S3 bucket would be a disk drive.
The access control permissions for each bucket can be configured to determine who can view the bucket’s contents, or add and remove objects in the bucket. Buckets are also used as the basis for the simple access-logging capabilities provided by S3.
You can configure your buckets to be based in different geographical locations. At the time this book was written, Amazon provided S3 data centers in two locations: the United States and Europe. When a bucket is based in a location, all the objects created inside that bucket are automatically stored in that location. Making your S3 resources available from a specific location can improve the performance of S3 for customers living in that region.
The storage and request fees for using S3 vary, depending on the location where your objects are stored (refer back to the ” section). Buckets are created in the U.S. location by default, so unless you choose otherwise, you will be charged the cheapest usage rates available.
To work most efficiently over multiple locations, S3 uses alternative DNS names and request redirection techniques to ensure that service requests are sent to data centers in the region where the bucket is stored.
The use of alternative DNS names means that S3 clients must use the subdomain host-naming format in requests that refer to buckets located outside the United States, and that the names of these buckets must be compatible with the DNS system. If an S3 client refers to a non-U.S. bucket in the path of a request URI, instead of in a sub-domain, the service will respond with a Permanent Redirect message (HTTP status 301). This response indicates that the client has used an inappropriate resource reference and must correct the reference before submitting a new request. The best way to avoid sending inappropriate references is to use the subdomain host-naming format whenever possible. We demonstrated how to use this approach in the section .”
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Objects
S3 objects are resources that store data. They are somewhat similar to the files in a standard computer system, but there are a number of important differences which were summarized in .”
An object can contain up to 5 GB of data, or it can be entirely empty. An object can store two types of information: data and metadata. The data stored by an object is its main content, such as a photo or text document. In addition to the data content, an object can store metadata that provides further information about the object, such as when it was created and the type of data it contains. You can store your own metadata information when you create or replace an object.
Each object resource in S3 can have access control permissions applied to it, allowing you to keep the object private, or to make it available to other S3 users or the general public.
Each object in S3 is identified by a unique name, known as its key, which uniquely identifies it within a bucket. Object keys must not be longer than 1,024 bytes when encoded as UTF-8, and they can contain almost any characters, including spaces and punctuation. Objects are similar to files, so it makes sense to use obvious names for your objects, as you would for a file, such as My Birthday Cake.jpg.
One major difference between the S3 storage model and the average computer file system is that S3 has no notion of a hierarchical folder or directory structure. S3 buckets contain objects—that is the beginning and end of the hierarchy imposed by the storage model. If you wish to impose a hierarchical structure for your objects in S3 to help organize and search them, you must construct this hierarchy yourself using the flexible naming capabilities of object keys. You can do this by choosing a special character or string to mark the boundaries between components of a hierarchical path and by storing your objects with key names that describe their full path in the hierarchy.
Because objects can be accessed using URIs, as if S3 was a standard web server, the most obvious character to use for delimiting the components of a hierarchical path is a forward slash (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Alternative Hostnames
S3 offers two mechanisms that allow you to use alternative hostnames to access the contents of your buckets. The first mechanism, called Virtual Hosting, allows you to use your own domain name as an alias for an S3 bucket. This feature makes the content distribution capabilities of S3 much more attractive, because you can provide your content through a domain name of your choice, while still serving data directly from S3. The second mechanism allows you to access your buckets via a subdomain of the S3 service domain.
To support alternative hostnames, S3 interprets the Host header in HTTP request messages to deduce the name of the bucket the request is referring to. Conveniently, the vast majority of HTTP client applications, including all web browsers, supply the Host header automatically when they make a request. Because the bucket name can be determined from the alternative hostnames, there is no need to include the bucket’s name in the URI path for requests that use alternative hosts.
The following table demonstrates the different URIs that can be used to refer to the same object in S3 using different hostnames. The first request uses a standard S3 URI to refer to an object, and it includes the bucket name in the URI’s path. The second request uses an S3 subdomain, in which case the hostname starts with the bucket’s name. The third uses a Virtual Host domain name, and the bucket name is represented by the entire hostname (this domain does not actually exist).
Hostname TypeURIBucket Name
Standard S3 hostnamehttp://s3.amazonaws.com/my-bucket/WebPage.htmlmy-bucket
S3 subdomainhttp://my-bucket.s3.amazonaws.com/WebPage.htmlmy-bucket
Virtual Hosthttp://www.my-bucket.com/WebPage.htmlwww.my-bucket.com
As you can see, alternative hostnames allow you to publish objects in the topmost directory of a host. This can be very useful if you are using S3 as an alternative to a web server, and you need to provide configuration files, such as robots.txt, from the root directory.
Although alternative hostnames make it possible to provide object content from the root directory of a host, this is in no way a full replacement for the functionality of a standard web server application that can perform tasks like redirecting requests and providing directory listings. For example, a request for the root of a virtual hostname such as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Access Control Lists
The S3 service allows you to define access control permissions to specify who can access your buckets and objects, and what kind of operations can be performed. The group of permission settings applied to an S3 resource is called an Accesss Control Policy (ACP), though more often these settings are referred to as an Accesss Control List (ACL), because this list defines the permission settings.
Every resource in S3 has an ACL associated with it. The default ACL applied to objects and buckets when they are created or updated marks these resources as private, meaning that you as the owner have full control over the resource, and no one else can access or modify it. You can update the ACL permission settings of your resources at any time.
Access Control Lists contain a set of up to 100 grant rules. Each grant rule defines the specific entity that can access a resource; this entity is called a grantee, and a single permission value describes what the grantee can do with the resource. You control the access permission settings for a resources by adding grant rules to, or removing them from, the ACL settings document associated with a resource.
ACL grant rules only grant access permissions, they cannot forbid them. Access permissions must be explicitly granted to take effect.
In ” we demonstrated how limited, “canned” access control permission settings can be applied when you create an object in S3. However, to take advantage of the full flexibility and power of the access controls available in S3, it is necessary to work with Access Control List configuration documents.
An Access Control List configuration document is an XML document that contains a set of grant rules. It includes two main sections. The first is an Owner section, which stores the ID and (optionally) the display name of the resource’s owner. The second section contains a listing of Grant XML elements that define the grant rules the document embodies.
Here is an example ACL document for a publicly accessible object:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Server Access Logging (Beta)
To help S3 account holders monitor the API operations that are performed on their objects, the service provides a mechanism called Server Access Logging to generate log files detailing the requests made against their buckets. This logging mechanism is intended to provide similar information to the access log files produced by standard web servers. These logs can help you to monitor the usage of your S3 resources, particularly when these resources are publicly available. By reviewing the log files, you can judge the popularity of particular resources and the cost incurred by serving them. There is no cost for using the Server Access Logging feature, beyond the standard fees for storing the generated log files.
The Server Access Logging mechanism is a beta feature of the S3 service. Neither the timely delivery nor the accuracy of the log documents is guaranteed. The logging operates on a best-effort basis and is not intended to provide a reliable record of activity suitable for billing or auditing purposes.
Server Access Logging must be explicitly enabled for each bucket you wish to monitor; it is not active by default. When you enable logging for a bucket, S3 will track the requests performed against that bucket, or against the objects inside it, and will write the details to log files. These log files are periodically saved as objects in a bucket of your choice.
It is important to note that the log files are not updated dynamically and are only written to your logging bucket at intervals, so it can take some time for the logged information to become available. Once the log file objects are created, they are your responsibility; you can manipulate them as you would any other object, and you can delete them when you have finished with them.
To enable and configure Server Access Logging for your buckets, an S3 client must create an XML BucketLoggingStatus document describing the logging settings to apply. Every bucket in S3 has a BucketLoggingStatus document associated with it. This document can be retrieved or updated using a URI that specifies a bucket and includes a single parameter:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Signed URIs
The access control settings available is S3’s ACLs provide a powerful mechanism for sharing your S3 resources with third parties, however there are some situations where ACLs may not provide as much flexibility as you would wish. For example, you may have sensitive resources in S3 that you wish to make available to someone else, who is not a member of S3, but you are not prepared to make these resources public. Or you may wish to make a resource available for only a limited time.
To handle such scenarios, S3 includes a mechanism for creating preauthenticated request messages called signed URIs. A signed URI is authenticated in advance to perform an operation and can be used by anyone to access a resource in S3.
A signed URI looks like a standard S3 URI except that it includes three additional parameters:
AWSAccessKeyId
The AWS Access Key of the account holder.
Expires
This value indicates the time when the signed URI will expire, specified as the number of seconds since January 1, 1970. The URI will expire at this time and cannot be used afterwards.
Signature
The signature value preauthenticates the request message.
defines a method that generates a time-limited, signed URI.
Example . Generate signed URI: S3.rb
def sign_uri(method, expires, bucket_name, object_key='', opts={})
  parameters = opts[:parameters] || []
  headers = opts[:headers] || {}

  headers['Date'] = expires

  uri = generate_s3_uri(bucket_name, object_key, parameters)
  signature = generate_rest_signature(method, uri, headers)

  uri.query = (uri.query.nil? ? '' : "#{uri.query}&")
  uri.query << "Signature=" + CGI::escape(signature)
  uri.query << "&Expires=" + expires.to_s
  uri.query << "&AWSAccessKeyId=" + @aws_access_key

  uri.host = bucket_name if opts[:is_virtual_host]

  return uri.to_s
end
The URI generated by this method can be used to perform the specific operation that was preauthenticated and signed, such as a GET request on an object. The URI cannot be used to perform any request that does not exactly match the signed version. For example, any attempt to modify the URI by changing the object it refers to, or the HTTP method it uses, will cause the request to fail due to the mismatch between the request received by S3 and the signature value in the URI.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Distributing Objects with BitTorrent
S3 provides a distribution service that allows you to provide content to a large number of people over the Web at low cost. However, if you are distributing a large amount of content to many people, the BitTorrent protocol can provide a much more efficient and cost-effective means of delivery than the normal web server approach. When you distribute data via BitTorrent, all the people downloading your content will share the data among one an other, which makes the downloads faster and saves you some money in data-transfer fees.
Objects stored in S3 can be distributed with the BitTorrent protocol very easily. To use this protocol, you provide a torrent file to the BitTorrent client applications that will perform the download. Every object in S3 has a torrent file associated with it that is available through a GET request that specifies the object in the URI and includes the request parameter torrent. The service replies to these requests with an HTTP 200 response message containing a torrent file.
Here is a URI that will return the torrent file associated with an S3 object:
Objects must be publicly accessible to be made available using BitTorrent, because BitTorrent clients cannot authenticate themselves to S3. If you use a torrent file for a nonpublic object, the client program will not complain that the item is unavailable, but it will not be able to download any data.
defines a method that sends a GET request to an object with the torrent request parameter and writes the torrent file to an output object. Torrent files should be written directly to files, because they contain binary data and the torrent information must be stored in a stand-alone file to be available to BitTorrent clients.
Example . Get torrent file: S3.rb
def get_torrent(bucket_name, object_key, output)
  uri = generate_s3_uri(bucket_name, object_key, [:torrent=>nil])
  response = do_rest('GET', uri)
  output.write(response.body)
end
Here is a command that invokes the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: S3 Applications
The Amazon Simple Storage Service (S3) can be used in a number of ways to meet many different storage needs. You can use it as a basic online file repository for backing up files, for web site hosting, as the basis for a network-mounted filesystem, or as a distribution network. In this chapter we discuss how you can use S3 to fulfill some common tasks by taking advantage of some of the available tools, software libraries, or third-party services.
The S3 developer ecosystem is very active, and much of the third-party software available is open source and free. In many cases you can achieve a great deal without having to write your own code, and even if you have specific needs not yet met by existing software, there are mature libraries available in a range of languages that you can use to build your own solution.
A very simple application for S3 involves using the service as a repository for sharing files that are too large to include in an email. There are a number of online services already available to do this job, but many charge monthly subscription fees if you need to share very large files; with S3 you can do this yourself at little cost.
To share a file with your friends or colleagues, you will need to upload the file to S3 and send a URI link to the S3 object in an email. Because your files may contain private information, we will keep them private by generating a signed Universal Resource Identifier (URI) link to the object so that only the people who receive the link from you can access it. An advantage of using a signed URI is that you can choose how long the link will remain valid.
defines a simple Ruby script that will upload a file to S3 and print out a signed URI you can share with others. Save this script as sharefile.rb, then modify the BUCKET_NAME constant to reference a bucket you have already created in your S3 account.
Example . Share file script: sharefile.rb
# The name of the bucket where shared files will be stored
BUCKET_NAME = 'my-bucket'

require 'S3'

if __FILE__ == $0
  if ARGV.length < 1
      puts "Usage: #{$0} file_to_upload [hours_until_expiry]"
      puts "Links will be valid for 24 hours by default"
      exit
  end

  # Calculate the expiry time in seconds.
  hours_until_expiry = (ARGV[1].nil? ? 24 : ARGV[1].to_f)
  expiry_time = Time.now.to_i + (hours_until_expiry * 3600).to_i

  # Open the file in binary mode and find its name
  file = File.new(ARGV[0], 'rb')
  path, file_name = File.split(file.path)

  # Upload the file to an S3 object named after the file
  puts "Uploading file: #{file_name}, size: #{file.stat.size} bytes"
  s3 = S3.new
  s3.create_object(BUCKET_NAME, file_name, :data => file)

  # Generate a signed URI to share the S3 object
  puts "URI will be valid for #{hours_until_expiry} hours:"
  puts s3.sign_uri('GET', expiry_time, BUCKET_NAME, file_name)
end
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Share Large Files
A very simple application for S3 involves using the service as a repository for sharing files that are too large to include in an email. There are a number of online services already available to do this job, but many charge monthly subscription fees if you need to share very large files; with S3 you can do this yourself at little cost.
To share a file with your friends or colleagues, you will need to upload the file to S3 and send a URI link to the S3 object in an email. Because your files may contain private information, we will keep them private by generating a signed Universal Resource Identifier (URI) link to the object so that only the people who receive the link from you can access it. An advantage of using a signed URI is that you can choose how long the link will remain valid.
defines a simple Ruby script that will upload a file to S3 and print out a signed URI you can share with others. Save this script as sharefile.rb, then modify the BUCKET_NAME constant to reference a bucket you have already created in your S3 account.
Example . Share file script: sharefile.rb
# The name of the bucket where shared files will be stored
BUCKET_NAME = 'my-bucket'

require 'S3'

if __FILE__ == $0
  if ARGV.length < 1
      puts "Usage: #{$0} file_to_upload [hours_until_expiry]"
      puts "Links will be valid for 24 hours by default"
      exit
  end

  # Calculate the expiry time in seconds.
  hours_until_expiry = (ARGV[1].nil? ? 24 : ARGV[1].to_f)
  expiry_time = Time.now.to_i + (hours_until_expiry * 3600).to_i

  # Open the file in binary mode and find its name
  file = File.new(ARGV[0], 'rb')
  path, file_name = File.split(file.path)

  # Upload the file to an S3 object named after the file
  puts "Uploading file: #{file_name}, size: #{file.stat.size} bytes"
  s3 = S3.new
  s3.create_object(BUCKET_NAME, file_name, :data => file)

  # Generate a signed URI to share the S3 object
  puts "URI will be valid for #{hours_until_expiry} hours:"
  puts s3.sign_uri('GET', expiry_time, BUCKET_NAME, file_name)
end
Here is the command you would use to upload a PDF (portable document format) document to S3, and to generate a link that will remain valid for 24 hours:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Online Backup with AWS::S3
The possibility of maintaining online backups of your important files at little cost is one of the most obvious and compelling uses of S3. There are already a number of third-party tools available for backing up your files in S3 with support for file versioning and scheduled uploads. If you are looking for such a tool, check the Amazon Web Services (AWS) Solution Center to see what is available. However, because you sometimes need to create your own solution, we will work through a simple example that demonstrates how to create a very simple backup tool in Ruby using the AWS::S3 library.
Our objectives for this backup solution are very modest indeed. We will not store different file version snapshots, nor will we implement complex schemes to allow for efficient file renaming or rearrangement of large files into smaller, more manageable chunks. Our backup process will comprise only the following steps:
  • Find all the files in a local directory to be backed up.
  • List the objects that are already present in S3.
  • Upload the local files that are not already present in S3, or whose contents have changed since the object was last uploaded to S3.
  • Delete objects stored in S3 when the corresponding local file has been deleted or renamed.
In this example we will use the excellent Ruby S3 library, AWS::S3, which may be found at http://amazon.rubyforge.org/. Our example is based on version 0.4.0 of this library.
AWS::S3 provides an object-oriented view of resources and operations in S3 that make it much easier to work with than the procedural application programming interface (API) implementation we presented in . We will define a simple Ruby script in the file s3backup.rb that will use this library to interact with S3.
First you must install the AWS::S3 library. This library is available as a Ruby gem package or as a download from the project’s web site that you can install manually. We prefer to use the convenient gem package that you can install from the command line.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
S3 Filesystem with ElasticDrive
One of the most interesting potential uses for S3 is as an unlimited data store on top of which other filesystem interface abstractions can be built. lists a number of products or services that use S3 as an underlying storage repository but expose it via different file or storage management protocols.
Some of these tools are designed to make S3 storage resources accessible to existing network-based tools that do not recognize S3—for example, as a File Transfer Protocol (FTP) or a Web-based Distributed Authoring and Versioning (WebDAV) service—and others aim to make the storage space in S3 available as a lower-level filesystem resource. Both of these approaches provide benefits, but it is the S3-based filesystem approach that we will concentrate on in this section, because it presents the most interesting possibilities. It also presents the most difficult challenges.
If it proves to be feasible to build whole filesystems on top of S3, many of the service’s limitations could be overcome in a very elegant way. Rather than having to use specialized S3 tools to access your storage space, you can make the service look and behave like a standard disk drive that stores data reliably in the cloud behind the scenes. On your computer you could copy files to and from this disk, even rename and rearrange them. In the background the changes you make would automatically be translated into API requests and stored in S3. This approach also makes it possible to use advanced disk management protocols, like RAID mirroring, to automatically manage synchronization between a local file system and S3, effectively giving you an effortless, online backup of all your files.
Great promise and potential, sadly, does not always lead to practical outcomes. There are a number of difficult issues that S3-backed filesystems must overcome to be considered reliable, economical, and agile enough to be used in real-world applications. There is some debate among S3 domain experts in the AWS forums as to whether it will ever be possible to achieve these three vital characteristics when using S3. This debate is highly technical and well beyond the scope of this book, so we will merely describe the main difficulties such filesystems face and leave it to you, the reader, to investigate further and make your own judgment about whether the filesystems approach is suitable for your purposes. After imparting these words of warning, we will proceed with an example that shows how to set up an S3-based filesystem to whet your appetite.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Mediated Access to S3 with JetS3t
The S3 service can be a very effective platform for sharing information, when its simple access control mechanisms meet your needs; but the level of control possible with the service’s ACL settings may not always be sufficient. Some scenarios are difficult or impossible to achieve with ACL settings alone, such as if you wish to make your S3 storage available to your customers or colleagues to use when they do not have their own AWS account. In such cases you may need to provide your own intermediate service to mediate access to your S3 storage.
In this section we will demonstrate how to use tools available in the JetS3t Java library to mediate third-party access to your S3 storage. These tools include a client-side application, for interacting with S3 to upload and download files, and a server-side Gatekeeper component that decides whether the client, or user, should be authorized to perform these operations.
Disclaimer: The JetS3t project was created by the author of this book.
There are a number of ways you could share your S3 storage with others. Let us take a look at a few of the options to see why we think the JetS3t tools are worth considering.
Public write permission via an ACL
The simplest way to allow third parties to upload files to your S3 buckets is to grant write permission to the general public. If you apply this ACL setting, anyone with S3 client software can upload files into the bucket and replace or delete existing objects. This makes it easy to grant access to others, but the disadvantages of this approach should be clear: anyone can upload, replace, or delete objects in your bucket.
If you grant public write access to a bucket, you cede a great deal of control over what happens in your S3 accou