Chapter 4. S3 Applications

The Amazon Simple Storage Service (S3) can be used in a number of ways to meet many different storage needs. You can use it as a basic online file repository for backing up files, for web site hosting, as the basis for a network-mounted filesystem, or as a distribution network. In this chapter we discuss how you can use S3 to fulfill some common tasks by taking advantage of some of the available tools, software libraries, or third-party services.

The S3 developer ecosystem is very active, and much of the third-party software available is open source and free. In many cases you can achieve a great deal without having to write your own code, and even if you have specific needs not yet met by existing software, there are mature libraries available in a range of languages that you can use to build your own solution.

Share Large Files

A very simple application for S3 involves using the service as a repository for sharing files that are too large to include in an email. There are a number of online services already available to do this job, but many charge monthly subscription fees if you need to share very large files; with S3 you can do this yourself at little cost.

To share a file with your friends or colleagues, you will need to upload the file to S3 and send a URI link to the S3 object in an email. Because your files may contain private information, we will keep them private by generating a signed Universal Resource Identifier (URI) link to the object so that only the people who receive the link from you can access it. An advantage of using a signed URI is that you can choose how long the link will remain valid.

Example 4-1 defines a simple Ruby script that will upload a file to S3 and print out a signed URI you can share with others. Save this script as sharefile.rb, then modify the BUCKET_NAME constant to reference a bucket you have already created in your S3 account.

Example 4-1. Share file script: sharefile.rb
# The name of the bucket where shared files will be stored
BUCKET_NAME = 'my-bucket'

require 'S3'

if __FILE__ == $0
  if ARGV.length < 1
      puts "Usage: #{$0} file_to_upload [hours_until_expiry]"
      puts "Links will be valid for 24 hours by default"

  # Calculate the expiry time in seconds.
  hours_until_expiry = (ARGV[1].nil? ? 24 : ARGV[1].to_f)
  expiry_time = + (hours_until_expiry * 3600).to_i

  # Open the file in binary mode and find its name
  file =[0], 'rb')
  path, file_name = File.split(file.path)

  # Upload the file to an S3 object named after the file
  puts "Uploading file: #{file_name}, size: #{file.stat.size} bytes"
  s3 =
  s3.create_object(BUCKET_NAME, file_name, :data => file)

  # Generate a signed URI to share the S3 object
  puts "URI will be valid for #{hours_until_expiry} hours:"
  puts s3.sign_uri('GET', expiry_time, BUCKET_NAME, file_name)

Here is the command you would use to upload a PDF (portable document format) document to S3, and to generate a link that will remain valid for 24 hours:

$ ruby sharefile.rb Documentation.pdf
Uploading file: Documentation.pdf, size: 1527187 bytes
URI will be valid for 24 hours:

When the command has finished, copy and paste the resulting URI into an email message, and the message’s recipients will be able to use the link to download the file for the next 24 hours.

To create a

link that expires in a longer or shorter time, include the number of hours until the link should expire as a second parameter.

# Link will expire in 3 days (72 hours)
$ ruby sharefile.rb Documentation.pdf 72

# Link will expire in 30 minutes (0.5 hours)
$ruby sharefile.rb Documentation.pdf 0.5

Remember to delete the file in S3 when you have finished sharing it.

Online Backup with AWS::S3

The possibility of maintaining online backups of your important files at little cost is one of the most obvious and compelling uses of S3. There are already a number of third-party tools available for backing up your files in S3 with support for file versioning and scheduled uploads. If you are looking for such a tool, check the Amazon Web Services (AWS) Solution Center to see what is available. However, because you sometimes need to create your own solution, we will work through a simple example that demonstrates how to create a very simple backup tool in Ruby using the AWS::S3 library.

Our objectives for this backup solution are very modest indeed. We will not store different file version snapshots, nor will we implement complex schemes to allow for efficient file renaming or rearrangement of large files into smaller, more manageable chunks. Our backup process will comprise only the following steps:

  • Find all the files in a local directory to be backed up.

  • List the objects that are already present in S3.

  • Upload the local files that are not already present in S3, or whose contents have changed since the object was last uploaded to S3.

  • Delete objects stored in S3 when the corresponding local file has been deleted or renamed.

AWS::S3 Ruby Library

In this example we will use the excellent Ruby S3 library, AWS::S3, which may be found at Our example is based on version 0.4.0 of this library.

AWS::S3 provides an object-oriented view of resources and operations in S3 that make it much easier to work with than the procedural application programming interface (API) implementation we presented in Chapter 3. We will define a simple Ruby script in the file s3backup.rb that will use this library to interact with S3.

First you must install the AWS::S3 library. This library is available as a Ruby gem package or as a download from the project’s web site that you can install manually. We prefer to use the convenient gem package that you can install from the command line.

$ gem install aws-s3

S3Backup Class

Example 4-2 defines the beginning of a Ruby script that will back up your files. This script stub loads the libraries we will need, including the AWS::S3 library and the MD5 (Message-Digest algorithm 5) digest library. To keep everything nicely organized, we will define a Ruby class called S3Backup to contain our implementation methods. All the method definitions that follow in this section should be defined inside this class.

Example 4-2. S3Backup class stub: s3backup.rb
#!/usr/bin/env ruby

# Load the AWS::S3 library and include it to give us easy access to objects
require 'rubygems'
require 'aws/s3'
include AWS::S3

# Use the ruby MD5 digest tool for file/object comparisons
require 'digest/md5'

class S3Backup

  # Implementation methods will go here...

To establish a connection with S3, you must let the AWS::S3 library know what your AWS credentials are. Example 4-3 defines an initialize method for the S3Backup class that will include your credentials.

Example 4-3. Initialize an S3 connection: s3backup.rb
def initialize
    :access_key_id     => 'YOUR_AWS_ACCESS_KEY',
    :secret_access_key => 'YOUR_AWS_SECRET_KEY'

List Backed-Up Objects

Before our program uploads files to S3, it needs to find out which files are already stored there so that only new or updated files will be uploaded. Example 4-4 defines a method that lists the contents of a bucket. As a convenience, this method will create a bucket if one does not already exist.

Example 4-4. List bucket contents: s3backup.rb
# Find a bucket and return the bucket's object listing.
# Create the bucket if it does not already exist.
def bucket_find(bucket_name)
  puts "Listing objects in bucket..."
  objects = Bucket.find(bucket_name)        
rescue NoSuchBucket
  puts "Creating bucket '#{bucket_name}'"
  if not Bucket.create(bucket_name)
    raise 'Unable to create bucket'
  objects = Bucket.find(bucket_name)        

Find Files to Back Up

Example 4-5 defines a method that recursively lists the files and subdirectories contained in a directory path and returns the object names the files will be given in S3. The backup script will be given a directory path by the user to indicate the root directory location of the files to back up. Any file inside this root path will be uploaded to S3, including files inside subdirectories. When we store the files in S3, each object will be given a key name corresponding to the file’s location relative to the root path.

Example 4-5. List local files: s3backup.rb
# Find all the files inside the root path, including subdirectories.
# Return an array of object names corresponding to the relative
# path of the files inside the root path.
# The sub_path parameter should only be used internally for recursive
# method calls.
def local_objects(root_path, sub_path = '')
  object_names = []
  # Include subdirectory paths if scanning a nested hierarchy.
  if sub_path.length > 0
    base_path = "#{root_path}/#{sub_path}"
    base_path = root_path
  # List files in the current scan directory
  Dir.entries("#{base_path}").each do |f|
    # Skip current and parent directory shortcuts
    next if f == '.' || f == '..'
    file_path = "#{base_path}/#{f}"
    object_name = (sub_path.length > 0 ? "#{sub_path}/#{f}" : f)
      # Recursively find files in subdirectory
      local_objects(root_path, object_name).each do |n|
        object_names << n
      # Add the object key name for this file to our list
      object_names << object_name
  return object_names

Back Up Files

We now have methods to list the objects in the target S3 bucket and to list the local files that will be backed up. The next step is to actually upload the new and changed files to S3. Example 4-6 defines a method to do this.

Example 4-6. Upload files: s3backup.rb
# Upload all objects that are not up-to-date in S3.
def upload_files(path, bucket, files, force=false, options={})
    files.each do |f|
        file ="#{path}/#{f}", 'rb') # Open files in binary mode

        if force || bucket[f].nil?
            # Object is not present in S3, or upload has been forced
            puts "Storing object: #{f} (#{file.stat.size})"
  , open(file.path),, options)
            obj = bucket[f]

            # Ensure S3 object is latest version by comparing MD5 hash
            # after removing quote characters surrounding S3's ETag.
            remote_etag = obj.about['etag'][1..-2]
            local_etag = Digest::MD5.hexdigest(

            if remote_etag != local_etag
                puts "Updating object: #{f} (#{file.stat.size})"
      , open(file.path),, options)
                puts "Object is up-to-date: #{f}"

This method loops through the local file listing and decides which files should be uploaded by checking first whether the file is already present in S3. If the file is present in the target bucket, it checks whether the local file has changed since the S3 version was created. If the file is not present, it is uploaded immediately.

If the file is already present in the bucket, we have to find out whether the local version is different from the version in S3. The method generates an MD5 hash of the local file’s contents to find out whether it differs from the object stored in S3. The S3 object’s MD5 hash value is made available as a hex-encoded value in the object’s ETag property. If the hash value of the local file and the object match, then they have identical content, and there is no need to upload the file. If the hashes do not match, then we assume the local file has been modified and that it should replace the version in S3.

It can take some time and processing power to generate the MD5 hash values for files, especially if they are large, so this hash-comparison approach slows things down. A faster alternative would be to compare the dates of the local file and the S3 object to see whether the local file is newer; but such comparisons are risky, because the object creation date reported by S3 may differ from your local system clock. Because we are more concerned with protecting our data than doing things quickly, we prefer to use hashes; it is the safest approach.

The upload_files method includes two optional parameters. The options parameter allows us to pass extra options to the method defined in the AWS::S3 library. Our script will use these options to specify an access control policy to apply to newly created objects. The method’s force parameter is a Boolean value that allows users to force files to be uploaded, even if they are already present in the bucket. This option could be handy if the user wanted to force a change to the Access Control List (ACL) policy settings of all the objects in a backup bucket.

Delete Obsolete Objects

In addition to storing files in S3, our backup script will be able to delete obsolete objects from S3 when the corresponding local file has been removed or renamed. This step will help to prevent our backup bucket from filling up with outdated files. Example 4-7 defines a method that loops through the objects present in the target bucket and checks whether the listing of local files includes a corresponding file. If there is no local file corresponding to the object, it is deleted. In a more advanced backup scenario, these outdated objects would be kept for some time, in case the local files had been deleted by mistake; but such a feature is beyond the scope of this book.

Example 4-7. Delete objects: s3backup.rb
# Delete all objects that are not present in the local file path
def delete_obsolete_objects(bucket, local_files)
  bucket.each do |obj|
    if local_files.index(obj.key).nil?
      # Obsolete object, delete it
      puts "Deleting orphan object: #{obj.key}"

Putting It All Together

The final step to complete the S3Backup class is to add a method to tie together all the steps required to perform a backup. Example 4-8 defines a back_up method that performs this task. The methods we defined above should only be used from within the class itself, so we will make these methods private by using Ruby’s private macro.

Example 4-8. Perform backup: s3backup.rb
# Perform a backup to S3
def backup(bucket_name, path, force=false, options={})
    # Ensure the provided path exists and is a directory
    if not
        raise "Not a directory: '#{path}'"

    puts "Uploading directory path '#{path}' to bucket '#{bucket_name}'"

    # List contents of the target bucket
    bucket = bucket_find(bucket_name)

    # List local files
    files = local_objects(path)

    # Upload files and delete obsolete objects
    upload_files(path, bucket, files, force, options)
    delete_obsolete_objects(bucket, files)

private :bucket_find, :local_objects
private :upload_files, :delete_obsolete_objects

The S3Backup class is now functionally complete, but the class by itself cannot be run as a script. Example 4-9 defines a block of code that will automatically invoke the S3Backup class when the Ruby script file is run from the command line. Add this code to the end of the script file, outside the body of the S3Backup class.

Example 4-9. Run block: s3backup.rb
if __FILE__ == $0
    if ARGV.length < 2
        puts "Usage: #{$0} bucket path [force_flag acl_policy]"

    bucket_name = ARGV[0]
    path = ARGV[1]
    force_flag = ARGV[2]
    acl_policy = (ARGV[3].nil? ? 'private' : ARGV[3])

    s3backup =
    s3backup.back_up(bucket_name, path, force_flag, {:access=>acl_policy})

The script is now ready to run. You can try it out with some of the following commands. However, be careful not to back up your files to an S3 bucket that already contains objects you wish to keep.

# Print a help message by not specifying the required parameters
$ ruby s3backup.rb                                       
Usage: s3backup.rb bucket path [force_flag acl_policy]

# Back up the directory Documents/ImportantDirectory to the bucket my-bucket
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory
Uploading contents of directory 'Documents/ImportantDirectory' to bucket 'my-bucket'
Listing objects in bucket...
Creating bucket 'my-bucket'
Storing object: Document1.txt (17091)
Storing object: Document2.txt (8517)
. . .

# Follow-up backups of the directory Documents/ImportantDirectory will run
# faster as only new or changed files will be uploaded
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory
. . .
Object is up-to-date: Document1.txt
Object is up-to-date: Document2.txt
. . .

# Force the script to upload all the local files again, this time with the
# 'public' access control permission.
$ ruby s3backup.rb my-bucket Documents/ImportantDirectory true public
. . .
Storing object: Document1.txt (17091)
Storing object: Document2.txt (8517)
. . .

If you are serious about backing up your files to S3, you will likely need many backup features that are missing from this example; plus, we have not included a script to restore your files from S3 if a disaster strikes. We will leave these additional features as an exercise for the reader.

Content-Length Workaround

You may experience problems using version 0.4.0 of the AWS::S3 library with some web proxies, because the method that creates a bucket does not explicitly set the Content-Length header prior to performing the PUT request. Some web proxies refuse to pass on PUT messages that do not include this header, but the S3 service accommodates these.

If you receive inexplicable Unable to create bucket error messages when you use the s3backup.rb script, try adding the workaround code in Example 4-10 to your script outside the S3Backup class.

Example 4-10. Content-Length fix: s3backup.rb
# Modification to AWS::S3 library to ensure bucket creation PUT requests 
# include a Content-Length header
class Bucket
    class << self
        def create(name, options = {})
            options['Content-Length'] = 0   # Explicitly set header
            put("/#{name}", options).success?

S3 Filesystem with ElasticDrive

One of the most interesting potential uses for S3 is as an unlimited data store on top of which other filesystem interface abstractions can be built. Appendix A lists a number of products or services that use S3 as an underlying storage repository but expose it via different file or storage management protocols.

Some of these tools are designed to make S3 storage resources accessible to existing network-based tools that do not recognize S3—for example, as a File Transfer Protocol (FTP) or a Web-based Distributed Authoring and Versioning (WebDAV) service—and others aim to make the storage space in S3 available as a lower-level filesystem resource. Both of these approaches provide benefits, but it is the S3-based filesystem approach that we will concentrate on in this section, because it presents the most interesting possibilities. It also presents the most difficult challenges.

If it proves to be feasible to build whole filesystems on top of S3, many of the service’s limitations could be overcome in a very elegant way. Rather than having to use specialized S3 tools to access your storage space, you can make the service look and behave like a standard disk drive that stores data reliably in the cloud behind the scenes. On your computer you could copy files to and from this disk, even rename and rearrange them. In the background the changes you make would automatically be translated into API requests and stored in S3. This approach also makes it possible to use advanced disk management protocols, like RAID mirroring, to automatically manage synchronization between a local file system and S3, effectively giving you an effortless, online backup of all your files.


Great promise and potential, sadly, does not always lead to practical outcomes. There are a number of difficult issues that S3-backed filesystems must overcome to be considered reliable, economical, and agile enough to be used in real-world applications. There is some debate among S3 domain experts in the AWS forums as to whether it will ever be possible to achieve these three vital characteristics when using S3. This debate is highly technical and well beyond the scope of this book, so we will merely describe the main difficulties such filesystems face and leave it to you, the reader, to investigate further and make your own judgment about whether the filesystems approach is suitable for your purposes. After imparting these words of warning, we will proceed with an example that shows how to set up an S3-based filesystem to whet your appetite.

There are three main criteria an S3-backed filesystem must meet to be practical:


You must be able to rely on a data storage system to keep your information safe in a range of circumstances, especially when things go wrong. The difficulty for filesystems based on S3 is that there are two circumstances in which latency can cause the S3 version of your data and the local version to fall out of synch, at which point a local system failure will cause you to lose any data that has not been replicated in S3. In the worst case scenario, an entire S3-backed filesystem could be corrupted if the data in S3 is not up-to-date.

The first latency problem is the time it takes for data to be copied over the network between your computer and the S3. Network resources are generally much more constrained than filesystem resources, so there will always be a delay before local changes can be reflected in S3. The second latency problem is caused by S3 itself, which does not provide a guarantee that you will always be able to retrieve the latest version of your data (see the discussion in S3 Architecture” in Chapter 3).


S3 account holders are charged by Amazon based on the amount of data they transfer to and from S3, and the number of requests that are performed. If an S3-backed filesystem cannot adequately translate filesystem changes to efficient S3 operations, the potential exists for a great deal of data to be transferred back and forth and for a large number of requests to be performed in doing this. Such inefficiencies could cause the S3 usage costs to quickly add up and make an S3-backed filesystem prohibitively expensive.

This issue is less serious when the S3-backed filesystem exists on an Amazon Elastic Compute Cloud (EC2) instance, because data transfer between EC2 and S3 buckets located in the United States is free; however, the per-request charges can still add up, even when you are not being charged for bandwidth.


A filesystem that stores data using network-based services is inevitably slower than a physical disk drive attached to a computer. S3-backed filesystems can minimize or even eliminate delays caused by S3 network transmissions by caching the filesystem locally and using intelligent algorithms to ensure that the least amount of data is sent over the network as possible. However, the more data that is cached, the greater the risk that the local filesystem and the version stored in S3 will fall out of synch, and the greater the risk that a local system crash could cause data loss.

ElasticDrive: S3 As a Virtual Block Device

ElasticDrive ( is a proprietary tool to make S3 available as a low-level filesystem resource via Filesystem in Userspace (FUSE). ElasticDrive presents a virtual block device on your computer that looks like any other data-storage block device, and you can build any filesystem you like on the device, including RAID systems. Behind the scenes, the blocks that comprise the virtual block device are cached and automatically synchronized with S3 as shown by Figure 4-1. The intended result is that your local filesystem is seamlessly backed up to S3.

ElasticDrive provides a virtual block device backed by S3
Figure 4-1. ElasticDrive provides a virtual block device backed by S3

A free trial version of ElasticDrive is available. This version imposes a maximum size limit for your virtual block device, however, it should still be sufficient to evaluate the product. As we noted previously, there are still some open questions about how reliable and effective S3-backed filesystems are in general, so bear this in mind when you are evaluating this tool and be sure to test it thoroughly with test data. Our guide is based on ElasticDrive version 0.4.0.

ElasticDrive has the following requirements:

  • The FUSE kernel module is available.

  • The FUSE utility programs are available.

  • FUSE and Python development libraries are installed.

In this application, we will step through the installation and configuration process for ElasticDrive on Fedora version 4 (this is the version available in EC2 as Amazon’s “Getting Started” public AMI). The process will be similar on other distributions, though the exact commands and application paths may differ.


EC2 is the best environment in which to test this approach due to the lower S3 usage cost for bandwidth and the greater S3 access speed available. Refer to later chapters for information on deploying applications on EC2 servers.

Setup and configuration

Install the FUSE and development libraries the tool requires.


We will assume that you have logged in to the computer as the root (administrative) user when you perform the commands listed in this section.

$ yum install fuse fuse-devel python-devel

Obtain a free evaluation version of the ElasticDrive application and the related documentation from the vendor’s web site. Extract the distribution tar file.

$ tar xvzf elasticdrive-0.4.0_dist.tar.gz

Run the ElasticDrive installation script in the distribution directory.

$ cd elasticdrive-0.4.0_dist

Follow the configuration instructions included with the program to configure ElasticDrive by editing the /etc/elasticdrive.ini file. You must update the S3 fuseblock variable to contain your S3 account credentials, the name of the bucket to use for storage, and the size of the virtual block device you wish to create. Here are the configuration settings we applied in elasticdrive.ini to create a 50MB virtual block device linked to the S3 bucket elasticdrive-test.





# fuseblock|/path/to/fuse/fuse="file:///tmp/foo.img?size=2000000000"


In the printout above, the last two lines in the elasticdrive.ini configuration file are broken over two lines. In the real file the setting must be on a single line.

Create the mount folder you specified in the fuseblock path.

$ mkdir -p /home/jmurty/fuse

With ElasticDrive now configured, it is time to fire it up and try it out. Run the application as a daemon, and confirm that it started correctly by looking at the log file /var/log/elasticdrive.log.

$ /etc/elasticdrive.ini

$tail /var/log/elasticdrive.log

Once you have confirmed that ElasticDrive runs without any errors, you can set up the virtual block device it provides to work as a standard filesystem. In this example we will create a filesystem on the block device, format it as the commonly used Linux filesystem called ext3, and mount it as a loop-back device.

Format the ed0 directory exposed within the fuse path as an ext3 filesystem.

# Create an ext3 filesystem on the device
$ /sbin/mke2fs -b 4096 /home/jmurty/fuse/ed0

Once you have formatted the virtual block device, you should be able to see a list of block stripe files stored in the S3 bucket that ElasticDrive is using. As you write data to the device, the blocks will be synchronized with S3 as stripe objects.

Now we will mount the new filesystem and write some data to it.

# Mount the new filesystem as a standard drive
$ mkdir /mnt/elastic
$ mount /home/jmurty/fuse/ed0 /mnt/elastic -o loop

# Create some files on the virtual drive
$echo "Hello filesystem" > /mnt/elastic/hello.txt

The virtual drive you have just created should work just as you would expect a standard disk drive to work. In the background the ElasticDrive application will cache the filesystem changes you make and store them in S3 at intervals. If you have debugging turned on, you can watch this process occurring in the ElasticDrive log file. The best way to see some log activity is to copy a few hundred kilobytes of data to the drive and then unmount it to force ElasticDrive to write the data to S3.

Mediated Access to S3 with JetS3t

The S3 service can be a very effective platform for sharing information, when its simple access control mechanisms meet your needs; but the level of control possible with the service’s ACL settings may not always be sufficient. Some scenarios are difficult or impossible to achieve with ACL settings alone, such as if you wish to make your S3 storage available to your customers or colleagues to use when they do not have their own AWS account. In such cases you may need to provide your own intermediate service to mediate access to your S3 storage.

In this section we will demonstrate how to use tools available in the JetS3t Java library to mediate third-party access to your S3 storage. These tools include a client-side application, for interacting with S3 to upload and download files, and a server-side Gatekeeper component that decides whether the client, or user, should be authorized to perform these operations.


Disclaimer: The JetS3t project was created by the author of this book.

There are a number of ways you could share your S3 storage with others. Let us take a look at a few of the options to see why we think the JetS3t tools are worth considering.

Public write permission via an ACL

The simplest way to allow third parties to upload files to your S3 buckets is to grant write permission to the general public. If you apply this ACL setting, anyone with S3 client software can upload files into the bucket and replace or delete existing objects. This makes it easy to grant access to others, but the disadvantages of this approach should be clear: anyone can upload, replace, or delete objects in your bucket.

If you grant public write access to a bucket, you cede a great deal of control over what happens in your S3 account. You make it possible for malevolent individuals to use your account to store and distribute their own files, and you make yourself vulnerable to the risk that such individuals will overwrite the files stored in your bucket by legitimate users.

You could make your bucket less of a target by not granting read access to the public. In that case, the bucket acts as a drop-box into which anyone can upload their files, but they cannot obtain a listing of the files that are stored there. Those who knew the bucket name could still store and distribute their own files by making the objects they create publicly accessible. This approach is less risky than allowing complete public access, but it is still far from safe.

Intermediate Relay Server

A better way to share your S3 storage with others, while maintaining control over how it is used, is to provide your own server and software to act as a middleman between the client and S3. In this arrangement your server would allow trusted clients to log in and upload files, which the server would then relay to S3 for long-term storage.

This approach has a number of advantages. You can exert a great deal of control over who is able to access the server, using whatever authentication mechanism you prefer. You can use the server’s disk as a short- or long-term cache for files, so you are not wholly dependent on the S3 service being available. And your server can provide a simpler interface than that offered by S3; for example, it could accept file uploads via protocols not supported by S3 such as FTP or WebDAV.

There are also disadvantages to this approach. If your server is likely to handle a large amount of traffic, it will need enough processing power to receive and relay all the data, enough bandwidth to cope with the traffic, and enough disk space to store the files until they have been written to S3. This can lead to exactly the kind of infrastructure problems that the S3 service was designed to avoid. Worst of all, you will be paying double for bandwidth, because files must be uploaded twice: first to your intermediate server, then to S3; though you could minimize these fees by running your server in Amazon’s EC2 service.

Gatekeeper server

The third option for providing mediated access to your S3 storage, and the one we will pursue here, is to provide your own authorization server that acts as a gatekeeper to your S3 account. This server will receive requests from clients who wish to perform an operation on your account, and it will allow or deny this request based on criteria you define. If the client’s request is allowed, the server will send the client a preapproved, signed URI that the client’s software can use to interact with S3 to perform the operation. The Gatekeeper server only authorizes operations; it does not act as a mediator for the actual data transfer.

Like the other approaches, this option has some disadvantages. It requires that you run your own gatekeeper authorization server to generate signed URIs for clients, and it also requires that your clients use specialized software that can communicate with the gatekeeper and interact with S3, using signed URIs instead of AWS credentials.

Despite these drawbacks, this approach offers compelling advantages. Because the client’s software interacts directly with S3 to upload or download files, your bandwidth expenses are less than they would be with an intermediate server, and you do not have to worry about the intermediate server running out of space or bandwidth. Also, because the gatekeeper server merely receives client requests and responds with signed URIs, it does not need to be very powerful.

Finally, this approach really highlights the power of S3’s URI-signing feature and demonstrates how it can be used in a nonobvious way, which makes it an interesting example in its own right. The fact that much of the work has already been done in an open-source toolkit means there is little reason not to try it out.

JetS3t Gatekeeper

The JetS3t project ( is an open-source suite of Java tools for working with S3 that includes an API implementation and a few applications. This application is based on version 0.6.0 of the JetS3t project. The applications we are most interested in are the Gatekeeper servlet, which acts as an authorization service to generate signed URIs for clients, and the Cockpit Lite application, which is an S3 client program that interacts with S3 using signed URIs received from the gatekeeper. Figure 4-2 shows the interaction between the Gatekeeper servlet and the Cockpit Lite client applications.

The gatekeeper mediating clients’ access to S3
Figure 4-2. The gatekeeper mediating clients’ access to S3

To begin, you must download the JetS3t distribution from the project’s web site and unzip it. You must also have Java version 1.4.2 or later installed on both the server where the gatekeeper will run and on client computers that will run the Cockpit Lite client application.

Deploy the Gatekeeper servlet

The Gatekeeper authorization application is a standard Java servlet. To run the servlet, you must first install a servlet container. In this example we will use the open-source Apache Tomcat servlet container version 5.5 ( To install Tomcat, download the core installation archive appropriate for your computer system from the web site and install or decompress it.


In this example we will assume you are running the Linux operating system and installing all the software manually. If you are using Windows, you can use Tomcat’s setup.exe installer to do the hard work and take advantage of the extra graphical user interface (GUI) tools available on that platform.

Once the Tomcat core server is installed, start it up and confirm that you can visit the default Tomcat welcome page at http://localhost:8080/.

$ cd apache-tomcat-5.5.25/bin/

With Tomcat running, you can deploy the Gatekeeper servlet by copying the preprepared Gatekeeper web archive (WAR) file from the JetS3t distribution directly into Tomcat’s webapps directory. After a short delay, Tomcat should notice that the new file is present and will automatically decompress and run the servlet.

# Deploy the Gatekeeper WAR file to Tomcat's webapps directory
$ cp jets3t-0.6.0/servlets/gatekeeper/gatekeeper-0.6.0.war \

# Confirm that Tomcat has noticed the new servlet and started running it.
$ ls apache-tomcat-5.5.25/webapps/gatekeeper-0.6.0

By default, the pre-prepared Gatekeeper servlet is configured to make testing easy by authorizing all client requests. We will tighten up the security settings after we have tested the servlet. Confirm that the gatekeeper is running by visiting the servlet’s URL, http://localhost:8080/gatekeeper-0.6.0/GatekeeperServlet in your web browser. You should see a brief welcome page stating that the servlet is running. If the servlet is not available, try stopping Tomcat and starting it again to allow it to recognize the servlet.


The Gatekeeper servlet writes log messages to Tomcat’s default log files, especially logs/catalina.out. If you are experiencing problems, check this log file to see detailed debugging information.

Although the gatekeeper claims it is ready, you will have to configure it to know your AWS credentials and to tell it which bucket you will be making available through the servlet. Edit the gatekeeper’s web.xml configuration file stored in apache-tomcat-5.5.25/webapps/gatekeeper-0.6.0/WEB-INF/web.xml and set the appropriate values for the initialization parameters AwsAccessKey, AwsSecretKey, and S3BucketName. Here is the portion of the configuration file that contains the initialization parameters:


Once this configuration is complete, the gatekeeper will be ready to respond to authentication requests made by the Cockpit Lite application. All of these requests will be allowed, and the users of Cockpit Lite will be able to do anything they wish, so do not make your gatekeeper publicly available until you read the authorization options in Authorization with HTTP Basic” later in this chapter.

Configure and test Cockpit Lite

Cockpit Lite is an application that allows users to interact with S3 without requiring access to the account holder’s AWS credentials. Whenever the user performs an operation in Cockpit Lite, the application asks the gatekeeper to approve the operation and issue it with a signed URI corresponding to the task the user wishes to perform. The application can be run as a stand-alone program, or it can be made available in a web page as a Java applet.

To test Cockpit Lite, we will start by running it in standalone mode to make it easier to manage. Once we have confirmed it can communicate with the Gatekeeper servlet we have just deployed, we will demonstrate how to make it available as an applet on a web site.

The first step in configuring Cockpit Lite is to ensure that it knows where to find the Gatekeeper servlet, so it can request authorizations. Edit the application’s configuration file configs/ and make sure the gatekeeperUrl property refers to the URI of the Gatekeeper servlet you deployed, such as http://localhost:8080/gatekeeper-0.6.0/GatekeeperServlet.

Run Cockpit Lite by invoking the startup script appropriate for your platform, either bin/ or bin/cockpitlite.bat. If all goes well, you will be presented with a graphical application for interacting with the S3 bucket you have made available via the gatekeeper. For more detailed instructions about using this application, please refer to the documentation available on the JetS3t web site.

If you intend to make the Cockpit Lite application available to other people, it will be much easier to direct them to a web page, rather than expecting them to obtain and install the JetS3t distribution. Fortunately the distribution includes pre–prepared applet versions of Cockpit Lite, among other applications, in the directory applets. To deploy a browser version of the application, you can simply copy this applet directory and its contents to Tomcat’s root folder.

$ cp -R jets3t-0.6.0/applets apache-tomcat-5.5.25/webapps/ROOT

You must configure the applet version of Cockpit Lite to know the Gatekeeper URL in the same way as you did in the standalone version. Do this by editing the webapps/ROOT/applets/ file you copied into Tomcat.

Launch the applet by loading the pre–prepared web page now available at http://localhost:8080/applets/jets3t-cockpitlite.html. Because the applet needs to be able to read and write files on your computer, you will be prompted to respond that you wish to trust it. Answer “yes,” and the application should start up in your browser ready for work.

Authorization with HTTP Basic

The system we have just set up is interesting but is not really an improvement on using public ACL settings on your S3 bucket. Because the Cockpit Lite client application is communicating with a default gatekeeper, anyone who runs the Cockpit Lite application is granted full access to the contents of your bucket. Now that we have the basics in place, it is time to look at how we can control third-party access to your S3 account.

At the simplest level, you can control who has access to your bucket by controlling who can access the Gatekeeper servlet. Because the gatekeeper is provided by a web server, you can use the authentication mechanisms offered by the server to require Cockpit Lite users to authenticate themselves before they can access the gatekeeper. This approach is relatively easy to implement and uses commonly available and well-understood techniques, but it results in an all-or-nothing situation in which every authorized user has full access to the bucket. If this is all the control you need, you can implement simple authentication by turning on HTTP Basic authorization.

To require Cockpit Lite users to provide login information to access the gatekeeper, we will activate HTTP Basic authorization for the servlet. While we are doing this, we will also define two distinct access roles, “gatekeeper” and “gatekeeper-admin,” for normal users and administrators. We will take advantage of these two access roles to provide custom role-based authorizations in a later example.

Let us configure Tomcat to create two login users who will belong to the normal and administrative access roles. Edit the Tomcat users’ XML file apache-tomcat-5.5.25/conf/tomcat-users.xml to include the two user elements defined below. You can set the username and password values to any values you like.

<?xml version='1.0' encoding='utf-8'?>
  . . .
  <user roles="gatekeeper" username="user" password="secret"/>
  <user roles="gatekeeper-admin" username="admin" password="secret"/>

Restart the Tomcat server to force it to re–read this file and recognize the new users.

$ cd apache-tomcat-5.5.25/bin/
$ ./

Now that we have user accounts, we can configure the Gatekeeper servlet to refuse requests from users who cannot authenticate themselves. Edit the servlet’s configuration file apache-tomcat-5.5.25/webapps/gatekeeper-0.6.0/WEB-INF/web.xml to include the additional security and login configuration elements defined below.

  . . . 
    <display-name>Gatekeeper Authorization</display-name>
      <web-resource-name>Protected Area</web-resource-name>
    <realm-name>Gatekeeper Authorization Required</realm-name>


Once you have made these changes and Tomcat has recognized them (you may have to restart the server before Tomcat will notice), you can test the login requirements by visiting the Gatekeeper servlet’s URL in your web browser. You should be prompted for a username and password and, unless you enter the login credentials you configured in Tomcat’s users’ file, you will not be able to view the status web page. The results will be similar when you run the Cockpit Lite application; you will be prompted to enter your credentials to allow the application to access the gatekeeper.

Customizable authorization modules

The JetS3t Gatekeeper servlet is intended to provide an extensible authorization framework that allows you to implement an S3 authorization service as powerful as you need through some configuration and Java coding. The servlet performs the request authorization process by calling on a number of code modules, each of which is responsible for a different part of the authorization process. By implementing your own modules and configuring the gatekeeper to use your version instead of the default one, you can take complete control over any aspect of this process.

The Gatekeeper servlet uses four replaceable modules to authorize client requests and return results. Each of these modules is implemented as a Java class that extends the following four interfaces:


Provides the client application with a list of the objects stored in an S3 bucket and informs it of the operations the user is allowed to perform. In the default implementation, the object listing includes the complete contents of an S3 bucket, though an alternative implementation may list only a subset of a bucket’s contents, depending on the identity of the client. The default implementation also tells the Cockpit Lite client that she can perform any S3 operation, but an alternative might allow only a restricted set of operations.


Allows or denies the client’s requests to perform S3 operations. Methods in this class are provided with information they can use to make authorization decisions, such as details about the requested operation and information about the client making the request, including her IP address and username, if server authorization is turned on.

The gatekeeper runs each authorization request through the Authorizer to determine whether it should be passed on to the UrlSigner module. The default Authorizer implementation allows all requests. Alternative implementations might perform one or more of the following advanced functions:

  • Perform user authorization by comparing a user’s login credentials or point of origin information against a user database or directory service, like Lightweight Directory Access Protocol (LDAP).

  • Allow fine-grained access control by organizing users into roles with differing permissions. Some users may only be allowed to perform a limited set of S3 operations, but others could have full privileges.

  • Evaluate authorization requests based on the specific S3 object being accessed. Some users may have restricted access to some portions of a bucket’s object hierarchy.

  • Evaluate file upload (PUT) operations based on properties of the file that will be uploaded. File uploads could be restricted based on the name, content type, or size of the file.


Generates the signed URL strings that the Cockpit Lite application uses to interact with S3. Signed URLs can be created for the full set of S3 REST operations, including GET, PUT, HEAD, and DELETE. This module is invoked after the Authorizer module has authorized a request, so its only responsibility is to generate the signed URL.

The default implementation of this module can generate all the signed URLs necessary for Cockpit Lite to work, so it will be sufficient for most users. However, a customized version of this module could provide some very powerful features under the right circumstances.

A customized UrlSigner module could be used to re–map object names on the fly, presenting a different view of S3 resources to the user than is actually stored in the service. This feature is possible because the signed URLs generated by the gatekeeper need not correspond to the object structure shown in Cockpit Lite. By remapping object names, you could partition a single S3 bucket into a number of logical pieces using hierarchical object names. You could then share this bucket among many users, and these users can only see and access the portion of the object hierarchy that belongs to them.


This feature generates a transaction identifier that is meaningful for a specific application. This value may uniquely identify each authorization request message received from a client, or it may be used to group multiple request messages together into a single, logical transaction that shares a common identifier. The default implementation generates a random, globally unique identifier (GUID) for each authorization request. A custom implementation might obtain a more reliable value, such as a database sequence number, from an external system.

Implement a custom authorization module

It is not possible to provide example alternative implementations for all the gatekeeper modules in this book. To explore these more fully, you will have to refer to the JetS3t project documentation. We will stick to demonstrating the functionality most likely to be useful to many readers: a customized Authorizer module that gives different permissions to different users, depending on whether they belong to the normal user or administrative user access role. Our module will prevent nonadministrator users from deleting objects from S3 (Although users will not be able to delete objects, they will still be able to overwrite them with new files. A more sophisticated Authorizer module would disallow PUT operations that overwrite existing objects.).

To implement this customized behavior, we will need to do the following:

  • Define two Tomcat user accounts with the access roles “gatekeeper” and “gatekeeper-admin” (see Authorization with HTTP Basic” for instructions).

  • Write a custom Authorizer implementation class.

  • Configure the gatekeeper to use our Authorizer class instead of the default one.

  • Build and deploy our customized Gatekeeper servlet.

To build your own version of the Gatekeeper servlet, you will need to decompress the source files provided with the JetS3t project, and you will need to have Sun’s Java Development Kit (JDK) version 1.4 or later installed on your system. To take advantage of the scripts provided with JetS3t, you will also need to install the Apache ANT build tool, which may be found at

With these requirements in place, you are ready to create a new implementation of the gatekeeper’s Authorizer module. Create a new file in the JetS3t project’s source code directory, src/org/jets3t/servlets/gatekeeper/impl/ Edit this file to contain the code listed in Example 4-11.

Example 4-11. Custom gatekeeper Authorizer implementation:
package org.jets3t.servlets.gatekeeper.impl;

import org.jets3t.servlets.gatekeeper.Authorizer;
import org.jets3t.service.utils.gatekeeper.GatekeeperMessage;
import org.jets3t.service.utils.gatekeeper.SignatureRequest;
import org.jets3t.servlets.gatekeeper.ClientInformation;

 * Authorizer implementation to disallow DELETE requests from 
 * users not in the 'gatekeeper-admin' role. 
public class ExampleAuthorizer extends Authorizer {

   * Default constructor - no configuration parameters are required. 
  public ExampleAuthorizer(javax.servlet.ServletConfig servletConfig) 
      throws javax.servlet.ServletException 

   * Control which users can perform DELETE requests.
  public boolean allowSignatureRequest(GatekeeperMessage requestMessage,
    ClientInformation clientInformation, SignatureRequest signatureRequest)
    // Apply custom rules if this is a DELETE request.
    if (SignatureRequest.SIGNATURE_TYPE_DELETE.equals(
      // Return true if the user is a member of the "gatekeeper-admin"
      // access role, false otherwise.
      return clientInformation.getHttpServletRequest()
    } else {
      // Requests for operations other than DELETE are always allowed.
      return true;            

   * Allow any user to obtain a listing of a bucket's contents.
  public boolean allowBucketListingRequest(
    GatekeeperMessage requestMessage, ClientInformation clientInformation)
    return true;

This implementation code extends the Authorizer abstract class and implements two mandatory methods: allowSignatureRequest and allowBucketListingRequest. The only portion of this example code that does any real work is the allowSignatureRequest method. It checks whether a user has requested a DELETE operation and, if so, ensures that this user is a member of the access role called “gatekeeper-admin.” If a user is not a member of this access role, she will not be permitted to perform the delete operation.

To test this implementation, you must rebuild the Gatekeeper servlet application to include the implementation class, and you must configure it to use this implementation instead of the default one. To configure the Gatekeeper servlet to use an alternative Authorizer implementation class you must edit the servlet’s web.xml file. Rather than editing this file in the live Tomcat deployment, as we have previously, we will instead modify the predeployment file that is referenced by the ANT build scripts.

Copy the gatekeeper configuration file you edited previously in Authorization with HTTP Basic” to the JetS3t directory.

$ cp apache-tomcat-5.5.25/webapps/gatekeeper-0.6.0/WEB-INF/web.xml \

Edit the copied file jets3t-0.6.0/servlets/gatekeeper-web.xml to change the AuthorizerClass initialization parameter to refer to your new Authorizer implementation class, following the example below.

. . .

Run the ANT build script included with JetS3t to build a new Gatekeeper WAR file that includes the new Authorizer implementation and your modified configuration file.

$ cd jets3t-0.6.0
$ant rebuild-gatekeeper

These commands will run the ANT build script build.xml to build the gatekeeper, and update the servlet WAR file with your changes. Copy the updated archive file servlets/gatekeeper/gatekeeper-0.6.0.war to Tomcat’s web application deployment directory.

$ cp jets3t-0.6.0/servlets/gatekeeper/gatekeeper-0.6.0.war \

If all goes well, the gatekeeper will reload and the new configuration will be applied. The next time you run Cockpit Lite, you will be unable to delete objects from S3 if you log in as the user who belongs in the “gatekeeper” access role. Log in as the user in the “gatekeeper-admin” access role instead, and you will once again be able to delete objects.

Next steps

In this chapter we have tried to keep the examples brief and simple to follow, but to make them so, we have avoided raising potential security or performance issues until now. If you intend to use the Gatekeeper servlet in a production system, or if you will expose it to the dangers of the Internet in any way, you will first have to secure it properly. The topic of securing web services is well beyond the scope of this book, but as a minimum you should consider taking the following steps:

  • Require all communication between Cockpit Lite and the Gatekeeper servlet to be transmitted using secure HTTPs (HTTPS) instead of standard HTTP to prevent anyone from snooping on the transmissions.

  • Ensure that Tomcat is configured with security in mind by disabling any unnecessary servlets that are installed by default.

  • Consider protecting the Tomcat server by making it accessible only through a more hardened web server, like Apache.

Get Programming Amazon Web Services now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.