The possibility of maintaining online backups of your important files at little cost is one of the most obvious and compelling uses of S3. There are already a number of third-party tools available for backing up your files in S3 with support for file versioning and scheduled uploads. If you are looking for such a tool, check the Amazon Web Services (AWS) Solution Center to see what is available. However, because you sometimes need to create your own solution, we will work through a simple example that demonstrates how to create a very simple backup tool in Ruby using the AWS::S3 library.
Our objectives for this backup solution are very modest indeed. We will not store different file version snapshots, nor will we implement complex schemes to allow for efficient file renaming or rearrangement of large files into smaller, more manageable chunks. Our backup process will comprise only the following steps:
Find all the files in a local directory to be backed up.
List the objects that are already present in S3.
Upload the local files that are not already present in S3, or whose contents have changed since the object was last uploaded to S3.
Delete objects stored in S3 when the corresponding local file has been deleted or renamed.
In this example we will use the excellent Ruby S3 library, AWS::S3, which may be found at http://amazon.rubyforge.org/. Our example is based on version 0.4.0 of this library.
AWS::S3 provides an object-oriented view of resources and operations in S3 that make it much easier to work with than the procedural application programming interface (API) implementation we presented in Chapter 3. We will define a simple Ruby script in the file s3backup.rb that will use this library to interact with S3.
First you must install the AWS::S3 library. This library is available as a Ruby gem package or as a download from the project’s web site that you can install manually. We prefer to use the convenient gem package that you can install from the command line.
$ gem install aws-s3
Example 4-2 defines the beginning of a
Ruby script that will back up your files. This script stub loads the
libraries we will need, including the AWS::S3 library and the MD5
(Message-Digest algorithm 5) digest library. To keep everything nicely
organized, we will define a Ruby class called S3Backup
to contain our implementation
methods. All the method definitions that follow in this section should
be defined inside this class.
Example 4-2. S3Backup class stub: s3backup.rb
#!/usr/bin/env ruby # Load the AWS::S3 library and include it to give us easy access to objects require 'rubygems' require 'aws/s3' include AWS::S3 # Use the ruby MD5 digest tool for file/object comparisons require 'digest/md5' class S3Backup # Implementation methods will go here... end
To establish a connection with S3, you must let the AWS::S3
library know what your AWS credentials are. Example 4-3 defines an initialize
method for the S3Backup
class that will include your
credentials.
Before our program uploads files to S3, it needs to find out which files are already stored there so that only new or updated files will be uploaded. Example 4-4 defines a method that lists the contents of a bucket. As a convenience, this method will create a bucket if one does not already exist.
Example 4-4. List bucket contents: s3backup.rb
# Find a bucket and return the bucket's object listing. # Create the bucket if it does not already exist. def bucket_find(bucket_name) puts "Listing objects in bucket..." objects = Bucket.find(bucket_name) rescue NoSuchBucket puts "Creating bucket '#{bucket_name}'" if not Bucket.create(bucket_name) raise 'Unable to create bucket' end objects = Bucket.find(bucket_name) end
Example 4-5 defines a method that recursively lists the files and subdirectories contained in a directory path and returns the object names the files will be given in S3. The backup script will be given a directory path by the user to indicate the root directory location of the files to back up. Any file inside this root path will be uploaded to S3, including files inside subdirectories. When we store the files in S3, each object will be given a key name corresponding to the file’s location relative to the root path.
Example 4-5. List local files: s3backup.rb
# Find all the files inside the root path, including subdirectories. # Return an array of object names corresponding to the relative # path of the files inside the root path. # # The sub_path parameter should only be used internally for recursive # method calls. def local_objects(root_path, sub_path = '') object_names = [] # Include subdirectory paths if scanning a nested hierarchy. if sub_path.length > 0 base_path = "#{root_path}/#{sub_path}" else base_path = root_path end # List files in the current scan directory Dir.entries("#{base_path}").each do |f| # Skip current and parent directory shortcuts next if f == '.' || f == '..' file_path = "#{base_path}/#{f}" object_name = (sub_path.length > 0 ? "#{sub_path}/#{f}" : f) if File.directory?(file_path) # Recursively find files in subdirectory local_objects(root_path, object_name).each do |n| object_names << n end else # Add the object key name for this file to our list object_names << object_name end end return object_names end
We now have methods to list the objects in the target S3 bucket and to list the local files that will be backed up. The next step is to actually upload the new and changed files to S3. Example 4-6 defines a method to do this.
Example 4-6. Upload files: s3backup.rb
# Upload all objects that are not up-to-date in S3. def upload_files(path, bucket, files, force=false, options={}) files.each do |f| file = File.new("#{path}/#{f}", 'rb') # Open files in binary mode if force || bucket[f].nil? # Object is not present in S3, or upload has been forced puts "Storing object: #{f} (#{file.stat.size})" S3Object.store(f, open(file.path), bucket.name, options) else obj = bucket[f] # Ensure S3 object is latest version by comparing MD5 hash # after removing quote characters surrounding S3's ETag. remote_etag = obj.about['etag'][1..-2] local_etag = Digest::MD5.hexdigest(file.read) if remote_etag != local_etag puts "Updating object: #{f} (#{file.stat.size})" S3Object.store(f, open(file.path), bucket.name, options) else puts "Object is up-to-date: #{f}" end end end end
This method loops through the local file listing and decides which files should be uploaded by checking first whether the file is already present in S3. If the file is present in the target bucket, it checks whether the local file has changed since the S3 version was created. If the file is not present, it is uploaded immediately.
If the file is already present in the bucket, we have to find
out whether the local version is different from the version in S3. The
method generates an MD5 hash of the local file’s contents to find out
whether it differs from the object stored in S3. The S3 object’s MD5
hash value is made available as a hex-encoded value in the object’s
ETag
property. If the hash value of
the local file and the object match, then they have identical content,
and there is no need to upload the file. If the hashes do not match,
then we assume the local file has been modified and that it should
replace the version in S3.
It can take some time and processing power to generate the MD5 hash values for files, especially if they are large, so this hash-comparison approach slows things down. A faster alternative would be to compare the dates of the local file and the S3 object to see whether the local file is newer; but such comparisons are risky, because the object creation date reported by S3 may differ from your local system clock. Because we are more concerned with protecting our data than doing things quickly, we prefer to use hashes; it is the safest approach.
The upload_files
method
includes two optional parameters. The options
parameter allows us to pass extra
options to the S3Object.store
method defined in the AWS::S3 library. Our script will use these
options to specify an access control policy to apply to newly created
objects. The method’s force
parameter is a Boolean value that allows users to force files to be
uploaded, even if they are already present in the bucket. This option
could be handy if the user wanted to force a change to the Access
Control List (ACL) policy settings of all the objects in a backup
bucket.
In addition to storing files in S3, our backup script will be able to delete obsolete objects from S3 when the corresponding local file has been removed or renamed. This step will help to prevent our backup bucket from filling up with outdated files. Example 4-7 defines a method that loops through the objects present in the target bucket and checks whether the listing of local files includes a corresponding file. If there is no local file corresponding to the object, it is deleted. In a more advanced backup scenario, these outdated objects would be kept for some time, in case the local files had been deleted by mistake; but such a feature is beyond the scope of this book.
The final step to complete the S3Backup
class is to add a method to tie
together all the steps required to perform a backup. Example 4-8 defines a back_up
method that performs this task. The
methods we defined above should only be used from within the class
itself, so we will make these methods private by using Ruby’s private
macro.
Example 4-8. Perform backup: s3backup.rb
# Perform a backup to S3 def backup(bucket_name, path, force=false, options={}) # Ensure the provided path exists and is a directory if not File.directory?(path) raise "Not a directory: '#{path}'" end puts "Uploading directory path '#{path}' to bucket '#{bucket_name}'" # List contents of the target bucket bucket = bucket_find(bucket_name) # List local files files = local_objects(path) # Upload files and delete obsolete objects upload_files(path, bucket, files, force, options) delete_obsolete_objects(bucket, files) end private :bucket_find, :local_objects private :upload_files, :delete_obsolete_objects
The S3Backup
class is now
functionally complete, but the class by itself cannot be run as a
script. Example 4-9 defines a block of code
that will automatically invoke the S3Backup
class when the Ruby script file is
run from the command line. Add this code to the end of the script
file, outside the body of the S3Backup
class.
Example 4-9. Run block: s3backup.rb
if __FILE__ == $0 if ARGV.length < 2 puts "Usage: #{$0} bucket path [force_flag acl_policy]" exit end bucket_name = ARGV[0] path = ARGV[1] force_flag = ARGV[2] acl_policy = (ARGV[3].nil? ? 'private' : ARGV[3]) s3backup = S3Backup.new s3backup.back_up(bucket_name, path, force_flag, {:access=>acl_policy}) end
The script is now ready to run. You can try it out with some of the following commands. However, be careful not to back up your files to an S3 bucket that already contains objects you wish to keep.
# Print a help message by not specifying the required parameters $ ruby s3backup.rb Usage: s3backup.rb bucket path [force_flag acl_policy] # Back up the directory Documents/ImportantDirectory to the bucket my-bucket $ ruby s3backup.rb my-bucket Documents/ImportantDirectory Uploading contents of directory 'Documents/ImportantDirectory' to bucket 'my-bucket' Listing objects in bucket... Creating bucket 'my-bucket' Storing object: Document1.txt (17091) Storing object: Document2.txt (8517) . . . # Follow-up backups of the directory Documents/ImportantDirectory will run # faster as only new or changed files will be uploaded $ ruby s3backup.rb my-bucket Documents/ImportantDirectory . . . Object is up-to-date: Document1.txt Object is up-to-date: Document2.txt . . . # Force the script to upload all the local files again, this time with the # 'public' access control permission. $ ruby s3backup.rb my-bucket Documents/ImportantDirectory true public . . . Storing object: Document1.txt (17091) Storing object: Document2.txt (8517) . . .
If you are serious about backing up your files to S3, you will likely need many backup features that are missing from this example; plus, we have not included a script to restore your files from S3 if a disaster strikes. We will leave these additional features as an exercise for the reader.
You may experience problems using version 0.4.0 of the AWS::S3 library with some web proxies, because the method that creates a bucket does not explicitly set the Content-Length header prior to performing the PUT request. Some web proxies refuse to pass on PUT messages that do not include this header, but the S3 service accommodates these.
If you receive inexplicable Unable to create
bucket
error messages when you use the s3backup.rb script, try adding the
workaround code in Example 4-10 to your
script outside the S3Backup
class.
Get Programming Amazon Web Services now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.