[System Design] Part#2: My notes on - How will you choose any component for System Design

Key database concepts which can help us choosing the RIGHT database for our system design -

(1) CAP theorem
(2) Transactions / Need for ACID properties
(3) Schema structure : fixed/not frequently changing, dynamic
(4) Scalability need

Table of contents:

CAP theorem in selections:

C+P (consistency + partition tolerance) :

  • HBase(part of HDFS and runs on top of the Hadoop Cluster)

Apache Cassandra [ NoSQL]:

  • No ad-hoc query support: You need to model the data based on “how you query” !
  • Leaderless replication relies on quorum to ensure consensus; write to any of the replica and replica is responsible for broadcasting it to other nodes
  • Aggregation is on the partition level. If you want to run aggregate queries on Cassandra then you need to specify the partition key since the aggregation will happen on the partition level. So better use some other db like Apache Spark streaming for “aggregation”
  • No joins or Can not sort by a partition key column
  • Cassandra’s data distribution is based on consistent hashing. It works like this: every node has a token defining the range of this node’s hash values. During the write, Cassandra transforms the data’s partition key into a hash value and checks the tokens to identify the needed node. When Cassandra finds the needed node, it stores the data on it and replicates it to a number of other nodes. This particular number depends on the tunable replication factor, but usually, it’s 3. This means that your data is stored on 3 separate nodes, and if one or even two of them fail, your data will still be available.
  • Scanning performance issue:
    Cassandra’s read is very quick and efficient as long as you know the primary key of the data you need. If you don’t, to find the required data, you may need to resort to scanning. And Cassandra doesn’t like scans: if it takes longer than a particular time, it returns an error and your data will probably not be found. However, if you integrate Cassandra with Apache Spark, performant scans become more available.
  • Cassandra is good for IoT, recommendation and personalization engines, fraud detection, messaging systems, etc. Cassandra’s quick write and read operations coupled with extremely low latency and linear scalability make it a nice fit for these applications
  • “Not good-fit for” :
    - When you want a lot of different types of queries or you can’t predict your data usage
    - When you want strong ACID compliance
    - When you want many-to-many mappings or join tables
    - When you don’t want a rigid schema
Cassandra table partition example

Zookeeper:

  • It can be used as a Distributed File System or as a Message Queue(It guarantees ordering)
  • Although Zookeeper provides similar functionality as Paxos algorithm(Consensus protocol based), the core consensus algorithm of Zookeeper is not Paxos; its called ZAB, short for ZooKeeper Atomic Broadcast. Like Paxos, it relies on a quorum for durability.
  • What’s “having a quorum”? — It means that more than half of the number of nodes are up and running. If your client is connecting with a Zookeeper server which does not participate in a quorum, then it will not be able to answer any queries. This is the only way Zookeeper is capable of protecting itself against split brains in case of a network partition.
    Remember, zookeeper is CP system w.r.t CAP theorem; it sacrifices Availability to achieve Consistency.
  • Possibly “Not good fit for” =>
    * application logging; choose something with less consistency requirements
    * storing binaries; Instead store binaries in S3 and store the URLs in zookeeper
    * metrics; scaling is problem

Apache Spark:

  • Cassandra + Spark is suitable for “Recommendation Engine + Real-time analytics”

Amazon Dynamo DB:

  • Key-value and document NoSQL database
  • A table which has collections of items and each item is collection of attributes
  • Uses primary key to shard the data and sort key to sort within the partition
  • Eventually and strongly consistent reads
  • Conditional writes help complex transactions

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
svalak

svalak

Passionate about learning ; Will write about #systemdesign #DSA #algorithms #linuxinternals #technology; Painting/Poem writing are my hobbies; Voracious Reader