The Quanta of Data Processing System Triage – Part 1, the Quanta

I’ve been reading, re-reading, and updating my working knowledge of data processing systems of all types.  It struck me while doing this that if I held a few basic questions in mind, that evaluating specific systems became easier.  Specifically, the questions have more than one answer: a set of answers to the questions represents each system I have studied and explains the target problem the system is trying to solve.

I’m calling these questions “The Quanta of Data Processing System Triage”.  These Quanta are not all inclusive.  They are, rather, simple heuristics that I’ve found helpful in quickly approximating what a given system X is trying to accomplish.  Since systems don’t describe themselves in exactly equivalent ways, direct comparing  systems can be difficult.  The idea with the Quanta is to avoid the direct comparison altogether — instead, reduce each system to the answers to the Quanta and then compare the answers.  It’s imperfect and a heuristic, but I find it a useful tool for coarse grained comparison.

I’ll update my list as I go.  The current Quanta of Data Processing System Triage are:

  1. How much data can one CPU core process in one second?
  2. How many separate read requests can a disk handle in one second?
  3. How much data can be read (from a single request) from a disk?
  4. How can I reduce the amount of data I need to sift through?
  5. How can I parallelize my data processing?

That’s it.  The data processing systems I’ve been studying all try to optimize one or more of those questions in specific ways.

For example, suppose we want to know what a columnar data store is fundamentally doing.  For a hypothetical system, the (heuristic) answer might be that they are optimizing for OLAP queries that tend to scan entire tables, but only for specific columns:

  • (Quanta 1) Nothing special
  • (Quanta 2) Reduce the magnitude of this issue by serializing queries
  • (Quanta 3) Maximize this by storing data contiguously on disk and reading at throughput rates.
  • (Quanta 4) Reduce the amount of data read by storing information by column instead of by row.  Compress the stored data.
  • (Quanta 5) Put the software into a cluster (e.g. SMP)

In future installments, I will explore each of the quanta in more detail.

Simple way to do a headless install of Sun/Oracle Java6 on ubuntu

I was looking for a simple way to perform a headless install of Sun/Oracle Java6 JDK on Ubuntu on EC2.  After hunting around quite a bit I found a recipe and thought I’d post it here

Edit /etc/apt/sources.list and add this line
deb lucid partner
From the shell
sudo apt-get update
echo "sun-java6-jdk shared/accepted-sun-dlj-v1-1 select true" | sudo debconf-set-selections
echo "sun-java6-jre shared/accepted-sun-dlj-v1-1 select true" | sudo debconf-set-selections
apt-get -y install sun-java6-jdk

NoSQL in a Sharded MySQL Context

I’ve been studying NoSQL – why it exists, what motivates it, how it differs from mere sharding, etc.   And by NoSQL, I’m thinking of only the crop of “new databases” — Cassandra, MongoDB, Voldemort, etc.   It’s a complex topic.

However, it occurred to me that one way to understand aspects of the problem is to think of a (relatively) common DB architecture and ask how/why NoSQL believes it has a better answer.  So I’m working through that process on one DB topology: a sharded, MySQL db system.   Given such a system, why might I consider using a NoSQL alternative?  Thoughts on the topic are below.

Please note that these are just thoughts in one direction — I’m not trying to capture, here, answers to the arguments a NoSQL evangelist would put forward.  Neither am I accepting the NoSQL arguments.  Rather, I’m simply trying to make myself capable of verbalizing one particular NoSQL argument, independent of whether I buy the argument itself.

So, here is the summary of the argument I’ve heard to date.  There are probably better arguments and phrasings to be had.

Suppose you are a large website and are using MySQL at scale.  A (relatively) common way to scale MySQL is to use sharding.  Each shard might be arranged into master/slave scenarios with a master machine replicating to N read slaves.  Perhaps you are even using cloud facilities (such as Amazon’s scale groups) to facilitate the entire operation.

You’re going to have to face a few realities:

  1. Master DB machines go down or get sick.   Given a number of read slaves of that master, it’s non trivial to figure out which of the slaves becomes the new master and how to get the other read slaves to begin replicating from that new master at the proper point.  (Amazon’s upcoming RDS read slaves are, therefore, pretty amazing technology)
  2. When master DB machines go down, you will incur a delay in setting up a new master
  3. How do you add new shards and rebalance the data?
  4. You need a way to know which shard to talk to.  Perhaps by implementing your own version of Consistent Hashing?
  5. Caching is also your problem.  You can use off the shelf systems, say memcached, to reduce the work but its till not zero work.
  6. You will most likely give up distributed joins between the various shards.

Now, you do all this in the interest of making a CAP theorem tradeoff — you give up instantaneous consistency and distributed joins to get availability and resilience to network partitioning.   You are maximizing A and P.  And it’s all worth it, provided a database is what you really need.

But suppose instead that you find you don’t do *any* joins or range queries — not even in a single shard.  Perhaps, at the extreme, you are even storing blobs of data under a key.  The database becomes little more than a transaction manager.

At this point, the new crop of NoSQL becomes attractive — Cassandra, for example — and for the following reasons:

  1. Cassandra is already focused on making a CAP tradeoff, and in the same direction; it already does data replication among nodes, implements consistent hashing for reading the data, does quorum reads, etc.
  2. Because of its base assumptions, Cassandra  is better at adding/removing nodes from the group — they can be added/removed without causing wholesale rebalancing and without losing data.  No single master write node causes the system to hang.  Clients are relatively isolated from concerns about mapping a read to the proper server to read from.
  3. You gain more than you lose.  Yes,you give up a lot of DB capability (e.g. range queries, arbitrary joins, proper data normalization, etc.).  But since you weren’t using those anyway, what have you really lost?  Moreover, you gain simpler administration, operation, and expansion of your CAP tradeoff.
  4. Caching in Cassandra is somewhat inherent and transparent
  5. A key/value code stack is faster than a SQL stack, even MySQL’s stack