Craic Computing Logo

A Tutorial on aws-sdb: A Ruby Interface to AWS SimpleDB

This tutorial was written by Robert Jones of Craic Computing in September 2008.
The software was written by Tim Dysinger, not by Craic Computing.

Introduction

Amazon SimpleDB is a web service that allows you to store and query structured data using the Amazon Web Services (AWS) hosted infrastructure. It is a relatively new member of the AWS portfolio of web services and at the time of writing (September 2008) is in a Limited Beta release.

Despite the 'DB' in its name, SimpleDB should not be viewed as a traditional relational database. Rather, it stores data in large hashes where each unique key ('item' in the parlance of AWS) identifies attributes that are themselves a set of key/value pairs. SimpleDB stores these across a distributed system, indexes them and provides a fairly simple API that allows you to store, retrieve and query your data.

You can learn more about SimpleDB on the AWS Developer Connection

Interfaces to SimpleDB are available in all the major programming languages. In Ruby one interface is the aws-sdb software written by Tim Dysinger. The project is hosted on RubyForge and on GitHub.

It provides a simple, direct interface to SimpleDB, translating your requests using the SimpleDB REST API into URLs, passing those to the service and parsing out its response.

Prerequisites

To use the service you need to set up an account with AWS. Refer to their documentation for information on the costs involved in using SimpleDB. In addition you will need to sign up for access to SimpleDB. At the time of writing this service is in Limited Beta, which means you may not be given access. As the service matures this limitation will undoubtedly be lifted.

Once you have an AWS account you will be provided an Amazon Access Key ID and a Secret Access Key. Both of these need to be provided to aws-sdb. A standard way of doing this is by making them environment variables in your shell like this:

$ export AMAZON_ACCESS_KEY_ID='<your key id here>'
$ export AMAZON_SECRET_ACCESS_KEY='<your secret key here>'

Installing aws-sdb

aws-sdb is distributed as a Ruby Gem and is installed with this command (note the hyphen in the name)

$ sudo gem install aws-sdb

To save you some typing I've created a file of information about several music CDs. Download the file music.yaml for use later in the tutorial.

We will use the irb interactive ruby shell to test out the service. The simple-prompt option makes things a little easier to read. Type in the libraries as shown. The first two are the only ones you need for aws-sdb, the other two are used for this tutorial. Here note the Underscore in the name aws_sdb!

$ irb --simple-prompt
>> require 'rubygems'
>> require 'aws_sdb'
>> require 'pp'
>> require 'yaml'

Setup

The first step in interacting with SimpleDB is creating a new instance of the Service class. This is NOT setting up any kind of persistent connection, as you would with a relational database. All your interaction with SimpleDB will happen through individual REST requests and responses sent via HTTP. In fact all this call does is create an object with the account information needed access the service and sets up a Logger instance to help with debugging.

>> sdb = AwsSdb::Service.new

This returns this ugly looking response.

=> #<AwsSdb::Service:0x14c7060 @secret_access_key="yyyy", 
 @logger=#<Logger:0x14c6f0c @default_formatter=#<Logger::Formatter:0x14c6ed0 
 @datetime_format=nil>, @progname=nil, @logdev=#<Logger::LogDevice:0x14c6e94 
 @shift_age=0, @filename="aws_sdb.log",
 @mutex=#<Logger::LogDevice::LogDeviceMutex:0x14c6e58 
 @mon_entering_queue=[], @mon_count=0, @mon_owner=nil, 
 @mon_waiting_queue=[]>, @dev=#<File:aws_sdb.log>, 
 @shift_size=1048576>, @level=0, @formatter=nil>, 
 @base_url="http://sdb.amazonaws.com", @access_key_id="xxxx">

Note that this contains your Amazon keys which you do not want to leave lying around. I've changed mine here.

Create a SimpleDB Domain

A Domain in SimpleDB is the container that houses all your data. You create a new domain with create_domain.

>> sdb.create_domain('craic_test')
=> nil

nil is the correct return value

You can list your domains with list_domains

>> sdb.list_domains
=> [["craic_test"], ""]

Notice that the command is returning an array. The first element is itself an array. The second element will be an empty string unless you have a large number of results. In that case SimpleDB may not be able to return all of them in a single call and will give you a token with which to fetch to the next batch of results. I'll describe that later in the tutorial.

You can also delete a domain, and ALL the data it contains, with delete_domain

>> sdb.delete_domain('craic_test')
=> nil

Storing data in your Domain

First, load the tutorial data file into irb.

>> music = YAML.load_file('music.yaml')
=> {607618005825=>{"artist"=>"The Fall", 
"title"=>"The Wonderful and Frightening World of The Fall", 
"date"=>1984, "genre"=>"Alternative", "tracks"=>16}, 
[...]

music is a Ruby Hash where the keys are the barcodes from a set of eight CDs and the values are hashes which have keys like 'title' and 'artist', etc. Here is an example of one of the CDs:

>> pp music[744861056720]
{"artist"=>"Mogwai",
 "title"=>"Happy Songs for Happy People",
 "date"=>2003,
 "genre"=>"Alternative",
 "tracks"=>9}

SimpleDB stores data in effectively the same way. This block of data will become a record, or 'item', in the domain hash that can be retrieved using the unique key, or 'item name', which in this case is the string '744861056720'.

Note: Here is a BIG difference between Ruby hashes and SimpleDB. Everything in SimpleDB is a String. So while we can use the integer 744861056720 as a hash key in Ruby, in Simple DB we have to convert it to a string. aws-sdb will do this for us in most cases, with the important exception of SimpleDB queries.

Store a Record into Simple DB and Retrieve it

You use the put_attributes method to add a new item. The arguments are the domain, the unique name for the item and a hash of attributes that will become the value of the item.

For convenience, set your SimpleDB domain as a variable

>> domain = 'craic_test'
>> sdb.put_attributes(domain, 744861056720, music[744861056720])
=> nil

Fetch that record out again with get_attributes, which takes a domain name and the unique item name as its arguments

>> sdb.get_attributes(domain, '744861056720')
=> {"artist"=>["Mogwai"], "title"=>["Happy Songs for Happy People"],
 "date"=>["2003"], "genre"=>["Alternative"], "tracks"=>["9"]}

Note: Notice how our single values are returned to us as Arrays. SimpleDB lets you store multiple values for each key. More on this in the next section.

Updating and Deleting Items

Updating or replacing the attributes of an item is straightforward, again using put_attributes with a hash containing the key/value pairs you want to change, or add.

Add a new attribute to an existing item like this:

>> sdb.put_attributes(domain, '744861056720', { 'label' => 'Matador' })
=> nil
>> sdb.get_attributes(domain, '744861056720')
=> { [...] "label"=>["Matador"]}

Update an existing attribute of an item, replacing the value, like this:

>> sdb.get_attributes(domain, '744861056720')
=> { [...] "tracks"=>["9"]}
>> sdb.put_attributes(domain, '744861056720', { 'tracks' => 10 })
=> nil
>> sdb.get_attributes(domain, '744861056720')
=> { [...] "tracks"=>["10"]}

You can also add a new value to an existing attribute by disabling the default mode of replacing the value. You do this by passing false as the final argument to put_attributes. As I mentioned, get_attributes returns values as arrays, even though they only contain single values in our examples. In essence, you can store arrays of values accessed by individual keys. This is extremely useful for storing things like keywords or tags where you want to query across multiple values.

>> sdb.put_attributes(domain, '744861056720', { 'tracks' => 11 }, false)
>> sdb.get_attributes(domain, '744861056720')
=> { [...] "tracks"=>["10", "11"] }

Deleting an item is simple with delete_attributes, although the method name might suggest that you can delete individual attributes of an item. To do that you would have to delete the item and recreate it with the correct attributes.

>> sdb.delete_attributes(domain, '744861056720')
=> nil

Querying SimpleDB

Queries are where things get really interesting but we need to load in more data from our local hash before we can explore this.

>> music.each do |key, value|
?> sdb.put_attributes(domain, key, value)
>> end
=> {607618005825=>{"artist"=>"The Fall", 
"title"=>"The Wonderful and Frightening World of The Fall", "date"=>1984, 
[...]

SimpleDB's syntax for query expressions is quite different from SQL. The simplest query uses an empty string and returns all the item names for all the records in the domain

>> sdb.query(domain, '')
=> [["04577803712", "638812713728", "88088215812", "682434190825", 
"607618005825", "744861056720", "095081004429", "767981103228"], ""]

Note that it returns the Item names, not the attributes. Right now you have to do that yourself by looping through items with calls to get_attributes. SimpleDB recently added a query_with_attributes method to do this for you. At the time of writing this is not yet implemented in aws-sdb, but that will undoubtedly be added. This simple loop will fetch all the attributes from the items returned by a query.

>> items, dummy = sdb.query(domain, '')
>> items.each do |item|
?> pp sdb.get_attributes(domain, item)
>> end
{"artist"=>["The Black Keys"],
[...]

Real queries involve passing a query string to SimpleDB. This example returns all records that have a given 'artist'.

>> sdb.query(domain, "['artist' = 'The Fall']")
=> [["607618005825", "682434190825"], ""]

Note: Pay particular attention to the query expression and how its elements are quoted. The syntax is sufficiently similar to other expressions in Ruby that it is easy to get in a muddle.

Note the form of the result from the query call. It's an array of two elements. The first is itself an array of the item names that matched the query. The second element is empty in this case. If you had a large number of results, and/or only requested a certain number of results at a time, this element would be a pointer to the next set of results. We don't need to worry about that with this tutorial.

The SimpleDB documentation describes the query syntax in detail. Here are some examples that should be intuitive.

>> sdb.query(domain, "['artist' = 'The Black Keys' or 'artist' = 'Mogwai']")
>> sdb.query(domain, "['date' >= '2003' and 'date' <= '2006']")
>> sdb.query(domain, "['artist' != 'The Black Keys']")
>> sdb.query(domain, "['artist' starts-with 'The']")

Note: The syntax does not allow you to search with arbitrary substrings or regular expressions.

These are all queries of a single 'predicate', or key, but you can query multiple predicates using the 'union' and 'intersection' operators. The syntax is very different from the equivalent expressions in SQL but you soon get the hang of it. Union is the equivalent of OR between two predicates and Intersection is the equivalent of AND.

>> sdb.query(domain, "['artist' starts-with 'The'] union 
                      ['title'  starts-with 'The']")
=> [["04577803712", "767981103228", "095081004429", "607618005825",
     "682434190825"], ""]
>> sdb.query(domain, "['artist' starts-with 'The'] intersection 
                      ['title'  starts-with 'The']")
=> [["095081004429", "682434190825", "607618005825"], ""]

You can sort the results of a query like this:

>> sdb.query(domain, "['date' > '1900'] sort 'date'")

Be aware that the field used for the sort MUST appear in the query, such that this example gives an error:

>> sdb.query(domain, "['date' > '1900'] sort 'artist'")
AwsSdb::InvalidQueryExpressionError: The specified query expression syntax 
is not valid.

Let me reiterate that you have to be very careful with quoting in your queries. These are passed directly to SimpleDB and the service ONLY works on Strings. Consider these examples:

>> sdb.query(domain, "['date' >= '2003' and 'date' <= '2006']")

...Correct

>> sdb.query(domain, "['date' >= 2003 and 'date' <= 2006]")
AwsSdb::InvalidQueryExpressionError: The specified query expression syntax 
is not valid.

...Error: SimpleDB expects Strings, not numbers

>> sdb.query(domain, ['date' >= '2003' and 'date' <= '2006'])
SyntaxError: compile error

...Error: Ruby thinks this query is an array

Note: Because SimpleDB uses strings for everything, you can run into problems with numerical data. The AWS SimpleDB Developer Guide addresses these in depth and you should refer to that.

Handling Large Numbers of Items

In the simple examples given here there are only a few matches to any query. In the real world you are going to store a lot of data in SimpleDB and your queries can easily return hundreds of items. The service limits the number of results that any individual response to 250. If you have more than that then you have to fetch them in multiple responses and Simple DB provides a simple way to help you manage that by giving you a token that you include in the next request.

We can mimic the process in our small dataset by explicitly limiting the number of results returned in each query response. A query with an empty string should return all 8 items. Adding a third argument to the call limits this to two, in our case.

>> sdb.query('craic_test', '', 2)
=> [["04577803712", "638812713728"], "rO0ABX...AAAHB4"]

Here you see that the second element in the returned array is no longer an empty string. Instead there is a long string that represents where you are in the set of all results from your query. You pass this back to SimpleDB as the fourth argument in your next query call in order to fetch the next batch of results

>> sdb.query('craic_test', '', 2, "rO0ABX...AAAHB4")
=> [["88088215812", "682434190825"], "rO0ABX...AAAHB4"]

You repeat the process until the query returns an empty string as the second element in the array, which tells you there are no more results. Note that the token changes every time you make a request. They look very similar at first glance but they are unique.

Note: One function that SimpleDB does not provide is a count of the number of items that match your query. You need to step through each batch of results to figure this out yourself.

SimpleDB is NOT a Relational Database

The next step in a tutorial on Queries in Relational Databases would probably be a discussion on joins across multiple tables.

But Simple DB is not relational and does not work like that. You can store multiple types of data in the same domain but they are all stored in the same hash. It is up to you to ensure that the attribute names are distinct between the different types of data. There is no equivalent to a relational database join. You need to fetch all records of type A that match your query, then all records of type B and compute the join in your code OUTSIDE of SimpleDB.

That restriction means that some applications are not good matches for SimpleDB. You need to think through all the ways you might need to query your data before committing to using SimpleDB.

Behind the Scenes of aws-sdb

The innards of aws-sdb are fairly straightforward. One way to interact with SimpleDB is through its REST API. Readers familiar with Ruby on Rails should know all about REST. aws-sdb does the job of translating your requests into REST requests, sending those to SimpleDB over HTTP and processing the REST response, which consists of a block of XML. The developer guide explains all the gory details but you can see what is being passed back and forth by looking at the aws-sdb log file.

By default, when you create a new AwsSdb::Service object it will create a Logger instance and output messages to the file 'aws_sdb.log' in your current working directory.

Here is the URL that aws_sdb sent to SimpleDB in a list_domains call:

D, [2008-08-29T13:01:54.948027 #14979] DEBUG -- :
	http://sdb.amazonaws.com?Action=ListDomains&
	AWSAccessKeyId=xxxx&SignatureVersion=1
	&Timestamp=2008-08-29T20%3A01%3A54Z&Version=2007-11-07
	&Signature=%2FM%2FsoKbIZM%2F3E%2FI522Wt1nzX8%3D

Note that I've replaced the AWS Access Key ID

And here is the XML response, formatted to make it easier to read

D, [2008-08-29T13:01:55.361647 #14979] DEBUG -- : 200
<?xml version="1.0"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07/">
  <ListDomainsResult>
    <DomainName>craic_test</DomainName>
  </ListDomainsResult>
  <ResponseMetadata>
    <RequestId>9a36b228-e9e8-4db9-aa69-aa027339f68e</RequestId>
    <BoxUsage>0.0000071759</BoxUsage>
  </ResponseMetadata>
</ListDomainsResponse>

This put_attributes call

>>sdb.put_attributes(domain, 744861056720, music[744861056720])

produces this URL:

D, [2008-08-29T13:28:33.666574 #14979] DEBUG -- :
http://sdb.amazonaws.com?Action=PutAttributes&
 Attribute.0.Name=artist&Attribute.0.Replace=true&Attribute.0.Value=Mogwai&
 Attribute.1.Name=title&Attribute.1.Replace=true&
 Attribute.1.Value=Happy+Songs+for+Happy+People&
 Attribute.2.Name=date&Attribute.2.Replace=true&Attribute.2.Value=2003&
 Attribute.3.Name=genre&Attribute.3.Replace=true&
 Attribute.3.Value=Alternative&
 Attribute.4.Name=tracks&Attribute.4.Replace=true&Attribute.4.Value=9&
 AWSAccessKeyId=xxxx&DomainName=craic_test&ItemName=744861056720&
 SignatureVersion=1&Timestamp=2008-08-29T20%3A28%3A33Z&
 Version=2007-11-07&Signature=ukULqaBBAFvwg2FAOSmZcsc%3D

and this response

D, [2008-08-29T13:28:34.189043 #14979] DEBUG -- : 200
<?xml version="1.0"?>
<PutAttributesResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07/">
  <ResponseMetadata>
    <RequestId>37cb244d-ff83-4e93-b582-76f58bc0f3d4</RequestId>
    <BoxUsage>0.0000220157</BoxUsage>
  </ResponseMetadata>
</PutAttributesResponse>

Notice the BoxUsage tag pairs in the XML output. Box Usage is the term SimpleDB uses to calculate the cost of each request.

Having all your requests sent to a log file can be useful for debugging or for when you are just getting started, but you probably want to tone this down a bit. You can pass your own instance of Logger to the AwsSdb::new method. This example only sends fatal errors to STDERR:

>> logger = Logger.new(STDERR)
>> sdb = AwsSdb::Service.new(:logger => logger)

Conclusion

SimpleDB can be a great way to manage certain types of data, but is not for everyone. The aws-sdb gem from Tim Dysinger provides a simple and efficient way to access the service directly from Ruby.

From a Ruby on Rails application you can use the AWS SDB Proxy plugin that serves as a bridge between ActiveRecord and SimpleDB, using aws-sdb to make the calls. See Martin Rehfeld's Developer Connection article for more information.

Check out the AWS SimpleDB Developer Guide to understand the power and the limitations of the system and look at the source code for aws-sdb to understand its implementation.