I’ve been reading, re-reading, and updating my working knowledge of data processing systems of all types. It struck me while doing this that if I held a few basic questions in mind, that evaluating specific systems became easier. Specifically, the questions have more than one answer: a set of answers to the questions represents each system I have studied and explains the target problem the system is trying to solve.
I’m calling these questions “The Quanta of Data Processing System Triage”. These Quanta are not all inclusive. They are, rather, simple heuristics that I’ve found helpful in quickly approximating what a given system X is trying to accomplish. Since systems don’t describe themselves in exactly equivalent ways, direct comparing systems can be difficult. The idea with the Quanta is to avoid the direct comparison altogether — instead, reduce each system to the answers to the Quanta and then compare the answers. It’s imperfect and a heuristic, but I find it a useful tool for coarse grained comparison.
I’ll update my list as I go. The current Quanta of Data Processing System Triage are:
- How much data can one CPU core process in one second?
- How many separate read requests can a disk handle in one second?
- How much data can be read (from a single request) from a disk?
- How can I reduce the amount of data I need to sift through?
- How can I parallelize my data processing?
That’s it. The data processing systems I’ve been studying all try to optimize one or more of those questions in specific ways.
For example, suppose we want to know what a columnar data store is fundamentally doing. For a hypothetical system, the (heuristic) answer might be that they are optimizing for OLAP queries that tend to scan entire tables, but only for specific columns:
- (Quanta 1) Nothing special
- (Quanta 2) Reduce the magnitude of this issue by serializing queries
- (Quanta 3) Maximize this by storing data contiguously on disk and reading at throughput rates.
- (Quanta 4) Reduce the amount of data read by storing information by column instead of by row. Compress the stored data.
- (Quanta 5) Put the software into a cluster (e.g. SMP)
In future installments, I will explore each of the quanta in more detail.