I’ve been studying NoSQL – why it exists, what motivates it, how it differs from mere sharding, etc. And by NoSQL, I’m thinking of only the crop of “new databases” — Cassandra, MongoDB, Voldemort, etc. It’s a complex topic.
However, it occurred to me that one way to understand aspects of the problem is to think of a (relatively) common DB architecture and ask how/why NoSQL believes it has a better answer. So I’m working through that process on one DB topology: a sharded, MySQL db system. Given such a system, why might I consider using a NoSQL alternative? Thoughts on the topic are below.
Please note that these are just thoughts in one direction — I’m not trying to capture, here, answers to the arguments a NoSQL evangelist would put forward. Neither am I accepting the NoSQL arguments. Rather, I’m simply trying to make myself capable of verbalizing one particular NoSQL argument, independent of whether I buy the argument itself.
So, here is the summary of the argument I’ve heard to date. There are probably better arguments and phrasings to be had.
Suppose you are a large website and are using MySQL at scale. A (relatively) common way to scale MySQL is to use sharding. Each shard might be arranged into master/slave scenarios with a master machine replicating to N read slaves. Perhaps you are even using cloud facilities (such as Amazon’s scale groups) to facilitate the entire operation.
You’re going to have to face a few realities:
- Master DB machines go down or get sick. Given a number of read slaves of that master, it’s non trivial to figure out which of the slaves becomes the new master and how to get the other read slaves to begin replicating from that new master at the proper point. (Amazon’s upcoming RDS read slaves are, therefore, pretty amazing technology)
- When master DB machines go down, you will incur a delay in setting up a new master
- How do you add new shards and rebalance the data?
- You need a way to know which shard to talk to. Perhaps by implementing your own version of Consistent Hashing?
- Caching is also your problem. You can use off the shelf systems, say memcached, to reduce the work but its till not zero work.
- You will most likely give up distributed joins between the various shards.
Now, you do all this in the interest of making a CAP theorem tradeoff — you give up instantaneous consistency and distributed joins to get availability and resilience to network partitioning. You are maximizing A and P. And it’s all worth it, provided a database is what you really need.
But suppose instead that you find you don’t do *any* joins or range queries — not even in a single shard. Perhaps, at the extreme, you are even storing blobs of data under a key. The database becomes little more than a transaction manager.
At this point, the new crop of NoSQL becomes attractive — Cassandra, for example — and for the following reasons:
- Cassandra is already focused on making a CAP tradeoff, and in the same direction; it already does data replication among nodes, implements consistent hashing for reading the data, does quorum reads, etc.
- Because of its base assumptions, Cassandra is better at adding/removing nodes from the group — they can be added/removed without causing wholesale rebalancing and without losing data. No single master write node causes the system to hang. Clients are relatively isolated from concerns about mapping a read to the proper server to read from.
- You gain more than you lose. Yes,you give up a lot of DB capability (e.g. range queries, arbitrary joins, proper data normalization, etc.). But since you weren’t using those anyway, what have you really lost? Moreover, you gain simpler administration, operation, and expansion of your CAP tradeoff.
- Caching in Cassandra is somewhat inherent and transparent
- A key/value code stack is faster than a SQL stack, even MySQL’s stack