NoSQL: Is It Time to Ditch Your Relational Database?

April 24, 2010

This week London hosted the largest NoSQL conference so far. The aim was to explore non-relational data stores, which have grown in prominence recently, particularly withTwitter joining Facebook and Digg as a Cassandra adopter.

Pragmatism was a common theme of all the presentations, and the principle of “using the right tool for the job” came up again and again. There seemed to be general agreement that what we are really talking about is “not only SQL”, in other words to use these technologies as a complement to relational databases, where many years of experience have been accumulated.

I recommend reading the presentations for day 1 and day 2, both of which are on myNoSQL (which Makoto Inoue called “The Hello of NoSQL” in his very entertaining talk on Tokyo Cabinet). The two days I was there were enjoyable for many reasons, including many general pearls of architectural wisdom, but I want to focus on practical examples where NoSQL is in use.

So how are people using NoSQL?

Matt Wall and Simon Willison described how they do things at the Guardian. They have an Enterprise Java platform that provides feeds on which front-end developers can build useful features. Around the edges, the team have used various tools for rapid development, including Redis (read Simon’s post) for the BNP heat map and a more performant version of the MP expenses review page, Google AppEngine for Zeitgeist, and Google Docs (specifically spreadsheets) for sharing data.

Jonathan Ellis gave a technical presentation on Cassandra, which is designed from the bottom up for replicating data. Replication across nodes is easily achieved by streaming pre-sorted blocks sequentially.

Kevin Weil told us about the challenges they face at Twitter, starting with the 7TB of data they collect each day (writing that amount of data at a typical disk speed of 80MB/s would take 24.3 hours). In addition to adopting Cassandra, Twitter has developed FlockDB, which is a social graph store.

The BBC uses CouchDB as a key value store for iPlayer, the home page layout, and the film network.  Enda Farrell explained that they control access to CouchDB through an API, which allows them to support authorisation, sharding and JMX instrumentation.

Matthew Ford has been involved in a number of NoSQL projects, and covered the pros and cons of the document-oriented data stores (CouchDB /MongoDB), and key-value stores.

Tobias Iversson‘s slide sums up where NoSQL data stores fit in to the architect’s toolbox; relational databases are suitable for the majority of cases, and we understand well how to manage these. Scaling to size vs Complexity

However, there are some specific cases, that is key-value stores and graph databases, where alternative solutions are better.

The choice of whether to choose a document data store instead of a relational database (RDBMS) is more difficult. Cassandra has been proven with large data sets, with indexing done by pre-defining “supercolumns” that provide the mapping between indexes and their corresponding data values. CouchDB queries are done through views, while MongoDB allows ad-hoc queries. Relational and non-relational databases all deal with structured data, but the non-RDBMS stores all the data in one place whereas the RDBMS requires you to join rows from normalised data tables.

The structured data model feels like a more natural fit to the data model typically used in an application, and avoids the object-relational mapping problems associated with mapping a hierarchical structure to a set of flat tables. However, adopting one of the “newer” data stores is intrinsically more risky than an RDBMS, because these have been around for less long.

Although Cassandra is apparently suitable for single-server installations I expect it will be the option for larger sites for some time, given the additional complexity associated with the super-column model. For smaller sites you may find CouchDB and MongoDBs features appealing, such as CouchDB’s replication (also being adopted by MongoDB), and the easier interfacing through JSON. However, a relational database is still likely to be the right choice for the majority of cases.


The Digital Economy Bill: A Washed Up Bill from a Washed Up Government

April 7, 2010

Like many in the tech community I followed yesterday’s debate on the Digital Economy Bill. This Bill is being sponsored by Lord Mandelson, who developed an interest in clamping down on copyright infringement after he had dinner with David Geffen on his Corfu yacht.

The Bill confers heavy sanctions (to be finalised) for illegal filesharing, yet the Government seems determined to rush it through at the tail end of this Parliament without proper debate.

The turnout was poor; perhaps this was apathy, or the fact that Brown and Mandelson chose to announce the General Election on the same day. (My MP wasn’t there.)

We were initially told by Harriet Harman that the Bill would be dealt with through “super-affirmative” actions, as part of the Parliamentary washup. This means that a committee would look at the Bill and make any necessary amendments. As was pointed out by John Grogan (Labour), pushing through a contentious bill like this without proper debate is unprecedented (all other bills that have been handled in the washup have had broad agreement). Fiona Mactaggart also noted later that “secondary legislation Committees [are] not places where scrutiny occurs; they are another example of pathetic oversight by Parliament.”

On the Labour side John Robertson and Sion Simon were firmly in favour of the Bill. In one exchange, Tom Watson stated that as copyright infringement is theft, infringers should be allowed their day in court; Simon replied indignantly that they would (by appeal at a tribunal), which is completely different thing.

Other notable points:

  • Bradshaw states that the “technical measures” that may be imposed on repeat infringers would “not involve permanent disconnection“. These measures are covered by Clause 18 of the Bill, and Fiona MacTaggart pointed out later on that most MPs hadn’t yet seen the revised clause.
  • John Redwood asked a perfectly reasonable question of what someone could do with, for example, a downloaded song; copy it to another PC, let others listen to it. Stephen Timms told him he was “barking up the wrong tree”, with Bradshaw scornfully following up “not for the first time”; Timms replied that rights owners wouldn’t need to do that. Redwood restated the question and Timms again misinterpreted the question, refusing to allow any further comeback.
  • When Adam Afriyie challenged Tom Watson that it was the latter’s party who were introducing the bill, Watson shot back that the Tories had the power to stop it. (Afriyie coined the phrase “a washed up bill from a washed up government”.)
  • Fiona MacTaggart gave an impassioned and informed speech, pointing out that sharing intellectual property (IP) can create a market for that IP, something that many of the other MPs seem to miss.

I find the Conservative Party’s argument (via Jeremy Hunt), that something needs to be done to prevent damage to the economy, rather weak. The assertion that hundreds of millions of pounds are lost every year is presumably an extrapolation of the figure from the British Phonographic Institute mentioned in the Digital Britain Report. Hunt said that the Tories reserve the right to return to the Bill and make appropriate amendments after the next election, but that pre-supposes that they are actually elected.

When Blair was Prime Minister it was terrorism that was used to justify unreasonable measures. Now it’s the economy.

Yesterday the Government again showed its contempt for democracy in this country. Remember this on May 6th.

The full text of the debate is available on Hansard: http://www.publications.parliament.uk/pa/cm200910/cmhansrd/cm100406/debindx/100406-x.htm

Update: As the Bill has evolved it has gradually strayed from the original intent of the Digital Britain White Paper. Tom Watson and others proposed a number of amendments last night, including tightening up the definitions, e.g. focusing on peer to peer file sharing instead of general online infringement, and challenging the starting assumption of liability on the part of the internet connection owner. Ultimately, Timms wasn’t having any of it. The only significant change in the Bill was to drop clause 43, which would allow for unclaimed (orphaned) works of copyright to be used by others. The Bill was passed, with numerous Labour MPs filing in who hadn’t bothered to turn up for the debate, and very few Conservative MPs voting. Stephen Timms has lost whatever geek credentials he had, as he apparently doesn’t know what an IP address is. And it looks like the Bill is going to meet with a fair amount of resistance.