Some notes on DynamoDB 2022 paper

Last modification on 2022-09-16

It's been a long time since DynamoDB publish a paper, I read the paper a few days ago, and I thought that would be one of the most practical papers in recent years for large-scale distributed systems from an engineering perspective. It doesn't even explain much about the architecture (because it doesn't need to, Shared Nothing systems look similar, and the authors know exactly who the readers of this paper are ^_^), and the paper is basically written in plain words, not fancy at all (and no math!). After all, DynamoDB doesn't need to try to 'prove' anything, it has fully proven itself in the past 10 years, both in terms of scale and stability, I found that the paper is relatively new, and there are not too many blogs talk about it, so why not I write something about it?

Predictable Performance > High Performance

This is the first point I put in this post, and it is also the point that touches me most deeply when we built TiDB's cloud service in recent years. I always make this statement on different occasions: Stable slow is better than unstable fast, .99 Latency is more reflective of the system design skills than Avg Latency.

I don't know if this is intentional, but it is also presented in the first section of the paper DynamoDB 2022, so you can see the importance.

DynamoDB wants to provide predictable performance, the first step is to abstract the workload, which introduces the concept of RCU, and WCU; in fact, RCU and WCU are very close to QPS in the traditional sense, only the size of the target item is added, so that you can do relatively accurate workload planning, for example, 1 WCU = 1KB item’s 1 OPS. When the user can describe the workload in terms of RCU and WCU, the first step to predictability is complete, DynamoDB's scheduler can do a lot of things like pre-partitioning and pre-allocating resources, because the hardware capabilities for different models can be simply abstracted into a combination of WCUs and RCUs.

Once you know the quota of each partition, it is probably a backpack problem for the scheduling. DynamoDB will consider the sum of the quotas of the partitions of the same machine, the sum should be less than the total RCU/WCU that the machine can provide when allocating the quota of partition, which is about 20%-30% from the example given in the paper. In the real world, inexperienced developers or system administrators will usually squeeze the last bit of CPU and IO out of the machine in order to pursue the 'ultimate' performance, and must see 100% CPU Usage before they are satisfied, but in such case the machine is already in a very unhealthy state, as reflected by the long tail latency of requests will become very high, even though there may be an increase in throughput, but because of this unstable long tail, the observed performance from user’s perspective is 'unpredictable'. In a production environment, we usually recommend over-provisioning about 30% of the machines for the same reason.

Burst is a simple idea. When allocating quota, DynamoDB will reserve some capacity for each partition . When the traffic spikes in the short term, use the reserved capacity. Adaptive Capacity is to dynamically adjust the quotas of different partitions after the user's workload is skewed (but the total amount cannot exceed the total quota of the table).

It is important to note that Burst and Adaptive Capacity are based on an assumption that the user's workload does not change much, and also the flow control is focused on the partition level (that is, almost at the Storage Node level), that is, local scheduling.

In a large-scale storage system, flow control is actually a global problem. Using TiDB as an example, TiDB's SQL layer is a layer is a stateless layer. The requests will be forwarded to TiKV. TiDB SQL layer is a kind of 'Request Router'(using the term from the paper), but if multiple TiDB SQL nodes are deployed, flow control is only done on TiKV (storage layer). In extreme cases, TiDB SQL nodes might still keep hitting the overloaded TiKV Node with requests. To solve this problem, you actually need to do flow control in the TiDB layer, return errors directly to the client on the TiDB layer, and not penetrate the overloaded TiKV nodes.

This part in DynamoDB's paper is a bit vague. My personal understanding of the strategy is that Request Router applies request quota to GAC periodically. GAC will maintain a global allocation policy, and if some partition has been overloaded, the corresponding Request Router can directly deny service to customers to protect the rest of the system. The flow control at the partition level is also kept as the last line of defense at the node level.

For a Shared-Nothing storage system, the key to scale-out is the sharding strategy, DynamoDB chooses a dynamic sharding strategy, just like TiKV, it also uses Range Split, but the difference is that TiKV's Region (similar to DynamoDB's Partition concept) by default use size as the splitting threshold (in TiDB 6.0, the Load-based Split strategy was introduced), DynamoDB directly adopts Load-based Split. Partitions have a default throughput threshold, which will be split when it exceeds this value. And once you start monitoring the state of the load distribution over the key range, it is easy to get the optimal splitting point (which is not always the midpoint).

Another interesting thing about splitting is, when you should avoid splitting, the paper mentions:

single row hotspot, which is well understood.
The access pattern is sequential order of keys (similar to iteration on Key). In this case, DynamoDB will avoid splitting.

To summarize, the core of DynamoDB's predictable performance lies in:

More accurate abstraction of Workload (WCU/RCU), instead of simple TPS/QPS/Data Size
Pre-allocate quotas to partitions and strictly limit the flow
leave a margin on the node-level as a temporary resource pool for unexpected scenarios (Burst)
use global information for traffic scheduling and do flow control at all levels

Failover with as little observable impact as possible

Let’s first talk about WAL. DynamoDB, like TiKV, also replicates logs to multiple replicas (the default is 3 as implied in the paper) via Distributed Consensus algorithm (Multi-Paxos for DynamoDB, Raft for TiKV), but DynamoDB also synchronizes WALs (note that it's not DB Snapshot) to S3 periodically for higher durability, which I understand is not only for higher durability, but also for PITR (Point-in-time -recovery).

Another interesting detail about failover is that when a node in DynamoDB's replication group fails, for example, if one of the 3 copies fails, the Group Leader will immediately add a Log Replica to the replication group, and the Log Replica is actually what we often call Witness (I will also use Witness below instead of Log Replica), which is a node that only stores logs and does not provide services. In fact, this is a very smart approach, for the above case, although also meets the majority, but the system is very fragile at this time, and to completely restore a new member of the time is usually longer (first copy DB snapshot and then apply the recent logs), especially the case of Partition Snapshot is relatively large, and the process of copying Snapshot may introduce extra pressure to existing peers.

Adding a Witeness is low cost (the time mentioned in the paper is in seconds), it can at least ensure the security of the logs during data recovery, and what’s more, for cross-AZ deployment scenarios, this optimization can also reduce the write latency in failover phrase. For example, we have two AZs, one of which is the primary, we call it AZ-A, which carries the main write traffic, and another AZ, AZ-B is a standby for disaster recovery (or serves some local read traffic). When a node in AZ-A hangs, the Raft Group of the data this node will not stop writing (after leader re-election), but according to the classical Raft, in order to meet the requirements of the majority, it must ask the peers in AZ-B and make sure the log is persist successfully and then could return the success to the client, which means the performance jitter is observed from the client (until a new peer in AZ-A is added). What we detect node fail, and immediately find a healthy node in AZ-A as a witness and add it to this unhealthy group, so the write to AZ-A can still achieve majority, which saves the latency of synchronizing log to AZ-B’s replica. From the client's observation, the system does not show significant jitter.

To reduce observable impact during failover, DynamoDB also improves Leader Election for replication groups. For large-scale systems, network jitter or network partitioning are common. Let’s image a case, let’s say a peer in replication group, we call it Peer X. Once X cannot connect to the Group Leader, according to the traditional election protocol, will rashly initiate a new election with a bigger voting term, and then other peers will stop voting for it, and then observed from the user side is that this part of the data is basiclly unavailable (service will be stopped during the election for data consistency), but the old leader might still be alive.

This is a common problem, the solution is also very straightforward. Actually, in TiKV 2.1, an optimization called Pre-Vote was introduced: when a peer wants to initiate an election, it will first ask other peers whether they will vote for itself, and at the same time, it will confirm the status of the old leader is still alive, and then it can launch a new election. In DynamoDB, a similar mechanism is also mentioned in the paper: before a peer initiates a new election, it asks other peers whether they also think the old Leader is disconnected, and if not, it means that it is the candidate’s own problem, and it does not affect the normal nodes.

Also, the worst failure for large systems is cascading failure, for example the DynamoDB failure in 2015 (https://aws.amazon.com/message/5467D2/) is a typical one. The improvements mentioned in this paper for the Metadata Service remind me of this case (I guess it was probably because of this case that improvements were made), and the solution is very intelligent, so I'll paraphrase a little.

DynamoDB observed that one of the root causes of cascading failures is: traffic mutation in a short time, and one of the common factors that cause traffic mutation is cache failure, although we mostly think that the higher the cache hit rate, the better (the paper probably mentions that the cache hit rate of the partition router table is about 99.75%), such a high cache hit rate means that when there is the cache failed(or a cache warm-up phase when a large number of new nodes join), the metadata service has to be able to carry 400 times (worst case, 0.25% → 100%) the traffic surge, DynamoDB solves this problem by:

adding a level of distributed memory cache MemDS in the middle of Request Router and Metadata Service . After the local cache of Request Router is missed, it does not access Meta Service directly, but accesses MemDS first, and then MemDS accesses Metadata Service in the background to fill the data. By adding a layer of cache for peak-shaving, it is equivalent to adding another layer of insurance, which is a common way.
The second way is very smart, just mentioned Request Router in fact through MemDS to get metadata, when the request did not hit the cache in MemDS, it’s easy to understand. But what is really smart is: even if the cache hits, MemDS will also asynchronously access MetaData Service. Reason: 1. it ensures that the existing cache in MemDS is updated as soon as possible 2. bring 'stable' traffic to the MetaData Service (although it may be larger)
The 'stable' but larger traffic is, for example, the equivalent of playing in the water so that you can have good confidence when the flood comes, :)

Also, GAC based on limited token(capacity) reduces the impact of cascading failures.

In addition, for services on the cloud, one big advantage is that the pace of release cloud is faster than traditional enterprise software, but the deployment of a new release is usually the most vulnerable time of the system, and DynamoDB, such a large-scale system is unlikely to do offline updates. For rolling updates, the only point to note is that the old and new versions of nodes will coexist during the update process, so the new version needs to be able to communicate with the nodes running the old version and then switch to the new protocol at some point (after all nodes have deployed the new version).

DynamoDB's work on stability and failover can be summarized in one sentence: minimizing observable client-side impact, which I think is actually part of 'predictable' performance.

Database ≠ Database as a Service

I would say the Dynamo described in the paper a decade ago is more like a DB(laugh). This DynamoDB is actually a real DBaaS (Database as a Service). You may wonder what is the difference. I think building a DBaaS is not simply deploying multiple database instances to the cloud and hosting them. From the user's point of view, the endpoint provided by a DBaaS may act like a database instance, but under the hood, the implementation may not be so straightforward. Take an extreme example: Let’s say a DBaaS provides SQLite as a service, I think it is likely that it will not really create a new container for every user, provision the environment, and then really start an SQLite process to expose publicly, it is likely to be a shared service, but just behave the same as SQLite externally, to better utilize the resources.

So to build a DBaaS, the first thing to consider is multi-tenancy, and the reason why DynamoDB has to be redesigned is that the old Dynamo does not support multi-tenancy, which is unacceptable for cloud services. When TiDB was transforming to the cloud, we learned that cloud-native is not simply moving a database to the cloud and deploying it for different users, but requires a deep transformation from the kernel to the control platform for the capabilities and environment provided by the cloud.

Another significant difference between DBaaS and Database is that DBaaS is often difficult to deploy locally, and a modern DBaaS is in fact, built with many micro-services or heavily dependent on features provided by cloud vendors (especially storage and security-related services). You can also see in the DynamoDB paper: Request Router is a service, in charge of connections from different tenants, GAC is a service, Authentication System is a service, MetaData is a service, Storage is a service, not to mention the dependencies on other services like S3/IAM and so on ...but it's interesting: the paper doesn't mention any EC2 or EBS, which makes me guess that DynamoDB's hardware infra is probably maintained by itself, i.e., running on the bare-metal machine.

Takeaways

For TiDB, the problem is a little bit more complex than DynamoDB, after all, TiDB Cloud provides SQL service, for example, if a user type: a SELECT * FROM table; it will make it challenging to calculate RCU (especially TiDB has computing push down), but it is not impossible, maybe I can write about this topic in the future. TiDB Cloud has recently completed the separate the control platform as an independent service, and the Session Management Service (similar to DynamoDB's Request Router) that is being split from TiDB’s kernel codebase recently. So for me, the DynamoDB paper reinforces our judgment about the importance of the path of unbundling and micro-services transformation for building a cloud database.

Finally, here are my takeaways from this paper:

The more you understand the abstraction of workload, the better it is to build predictable systems, and the more granular the measurement of workload, the more room you have to make money (or save costs).
Consider multi-tenancy for your system from the beginning, and from a global perspective.
On the cloud, the access layer is very important. There are many things we can do in the access layer to improve the predictability of the system, like flow control, high availability, tenant isolation, and non-stop update ...
Abstract scheduling, do flow control at different layers.
Use micro-services built platform to achieve multi-tenancy.
The use of cloud infrastructure will save a lot of work, such as S3

After reading this paper, in fact, I feel that there are many things that have not been written, such as Serverless, GC strategy, Distributed Transaction (ACID), etc., but these don't stop this paper from becoming a classic. I learned a lot from this paper, If any of the readers of this blog are in DynamoDB team, please let me know, I'd be happy to buy you a beer (if you're in the Bay Area 😉)

Author: Ed Huang(i@huangdx.net), CTO, PingCAP/TiDB