One Year with TiDB Serverless - Part - I

Last modification on

Ed Huang, Co-founder / CTO of PingCAP

In July 2023, after TiDB Serverless was officially GA on AWS, driven by the philosophy of "eating our own dog food," I began to build my personal projects based on it. Even putting aside my identity as a co-founder of TiDB, the usability of TiDB Serverless is quite delightful. Imagine: out-of-the-box usability, connect from anywhere, no need to worry about data consistency / scalability, a great UI, and an ecosystem compatible with the MySQL protocol. What's more, it's free. Serverless has also become a new trend in foundational data software in the Valley, as demonstrated by Databricks' "Serverless for All" slogan during their Data + AI Summit keynote this year. So, after using our own product for nearly a year and developing more than a dozen small projects, I have gathered some thoughts. Today, I will summarize them. This blog will be divided into two parts: the first focuses on macro-level thoughts, and the next one will focus more on best practices and usage insights at the micro level.

Open vs. Closed?

As a developer, I really don’t like closed ecosystems. I've seen many database service providers try to provide a "better" user experience by encapsulating everything and offering a "simple" RESTful API endpoint (I won’t name vendors here)... You can't see what's happening behind the scenes, you don't know where the workload bottlenecks might be, and you don’t even know where your data is stored. I believe that good usability shouldn't and doesn’t need to sacrifice transparency. The sense of insecurity brought by closed systems is a major reason why users hesitate to adopt them. In my opinion, a good practice would be:

  1. Relying on a solid open-source core, maintaining an upstream-first principle to ensure the core is stable and open, which is the foundation of trust.
  2. Using widely validated protocols and ecosystems, with standard and universal migration solutions (including import/export). For databases, the mainstream SQL protocol is the best, ensuring that users can easily migrate to other systems when necessary.
  3. The value-added services on cloud platforms should focus on optimizing user experience and encapsulating best operational practices, rather than relying on hidden new features (features that do not exist in the open-source version or are exclusive to paid versions).

Many people challenge this: if everything is open-source, how do you commercialize it?

Indeed, some users may choose to use the open-source version and manage it themselves. However, except for the geeks who like to tinker (they wouldn’t pay anyway...), almost all enterprise users who choose this route do so for one of two reasons:

  1. Cost sensitivity
  2. Compliance

For cost-sensitive users, it’s true that, in the infrastructure field, an excellent architecture vs. a poor one can result in huge differences in total cost, especially in large-scale scenarios. However, most companies struggle to find talent with such skills, and that talent itself is a high cost. In reality, people tend to overestimate short-term hardware costs (especially when the scale is small) and underestimate the hidden costs of time/opportunity and technical debt in the long term. Another reason to reflect on is: have you (as a vendor) provided a sufficiently low entry point? Does it allow users to quickly try and validate the direct value your product brings?

Regarding compliance, there’s not much to say. Ensure your product meets mainstream compliance standards. If it's non-standard, assess whether it’s worth doing. Typically, users with non-standard compliance needs are large clients and may be worth pursuing. If the long-term value (LTV) calculation shows it’s not worth it, then don't bother.

Even for these two types of customers, open-source doesn’t hinder paid conversions. For large customers, open-source is actually a source of security and confidence. Even if these customers choose not to use your cloud product or not to pay for your services, the open-source remains an option for them, and you are still indirectly creating value.

Therefore, you can see that an open-source and upstream-first strategy does not hinder commercialization. On the contrary, for large enterprises or cloud customers, it is one of the key reasons they choose you.

Scalability vs. Ease of Use? Not a Contradiction

Years of experience building database systems have taught me that this is not a simple task, especially when scaling is needed. More importantly, the moment you need to scale is often unpredictable. But if you wait until that moment to react, it may already be too late.

There’s a common misconception: Scalability typically means massive data and workloads, making such systems more expensive, complex, bulky, and slow. This might have been true 20 years ago, but today it likely needs revision: from a user’s perspective, scalable systems are simpler to use, have lower startup costs, and perform better than non-scalable systems. A great example is S3 (one of my favorite systems). There’s no denying that, under the hood, S3 is an incredibly complex and sophisticated distributed system (this blog by Amazon CTO Werner Vogels showcases its complexity and scale well: Link). But from the user’s perspective, it's incredibly simple, works out of the box, and, due to the scale of the underlying infrastructure, provides nearly infinite scalability and extremely high durability. And all of this is offered at a very low price, far lower than what it would cost to build a similar system yourself. Even if you don’t have massive amounts of data to store, if your application needs a static storage solution, you’d still use S3 because it’s just there.

Thinking along these lines, you’ll realize that if you want to offer better user experience and lower costs, choosing a distributed architecture is almost a necessity. Let me give an example from TiDB: Traditionally, distributed databases are thought to have more components and require more hardware resources, and in private environments, using distributed databases with small datasets may indeed be uneconomical. But in the cloud, for small users, the startup cost is no longer an issue (as infrastructure costs are amortized across many tenants). Meanwhile, from the user’s perspective, the scalability benefits of distributed systems still hold. Take note: many so-called cloud-native databases today simply place a single-node database instance on cloud storage and claim to have scalability—this is not correct. Single-node databases themselves cannot scale, even if the storage can expand. The compute instance becomes the bottleneck, and the cost-sharing benefits of scale are not obvious, as you still have to allocate compute nodes to each customer, rather than truly sharing. TiDB Serverless’s startup cost is certainly lower than AWS RDS for the same reason: a distributed database core design + cloud-native storage allows for:

  1. Extremely low startup costs,
  2. Scalability, maintaining a consistent user experience even as the scale grows.

This is why TiDB Serverless is designed the way it is today. I’ve written about its design in previous blogs, so I won’t go into detail here.

Performance vs. Predictability?

When it comes to performance, everyone loves high performance. But think carefully—there’s a premise. In private environments, the hardware cost has already been paid upfront in one go, so higher software performance means supporting more business on the same hardware, leading to a higher ROI. But in the cloud, there’s a subtle shift: cloud resources are pay-as-you-go. You don’t have to pay for "unrealized" business. So, assuming the system design itself is scalable, benchmarking system throughput becomes a numbers game. It’s like how no one benchmarks DynamoDB's QPS/TPS limits (except for testing hotspots) or tests S3’s capacity limits.

In the cloud era (or when hardware resources are abundant), what do users really want? I believe it’s "predictability." As I often tell my team: stable and slow > unstable and fast. The more users seek predictability, the more critical their scenarios are (e.g., OLTP scenarios). Such customers are also more willing to pay for stability and predictability.

So how is predictability achieved? It boils down to two key points:

  1. At the macro level: Plenty of resources, and redundancy.
  2. At the micro level: Fine-grained resource control.

Resources are essential. Most stability issues stem from limited hardware resources. For example, in the past, some users complained about TiDB’s instability, but upon closer inspection, we found that the system’s CPU load was already at 90%, or the disk I/O had reached its limit. No software can run well in such an environment—it’s like trying to cook without ingredients. But in the cloud, if it’s a cloud service, we maintain the infrastructure ourselves, keeping the system’s resource usage at a reasonable level, with ample redundancy. These redundant resources are still amortized across many users and can take advantage of the cloud’s elasticity to be purchased on-demand. This is why cloud services are often more stable than self-managed systems. Again, a distributed design is necessary here. In a single-node database architecture on a cloud service, even with plenty of redundant resources, you can only offer limited scale-up for an instance, as the best you can do is reach the hardware limits of the best machine available.

Fine-grained resource control at the micro level is essential to ensure that the system remains predictable and stable, even under heavy workloads. How does this work? Let’s break it down.

When system resources, such as CPU or memory, are close to being fully utilized, applications running on top of the system become almost unusable. This means that we cannot rely solely on the operating system's overload protection. Instead, we need to provide application-level flow control. For example, we can intercept SQL queries that are likely to consume excessive resources in advance, or reserve resources for high-priority workloads to ensure stability. For lower-priority users (such as free-tier users) or offline workloads, we can allocate them to a shared resource pool, thus using resources more efficiently and avoiding waste. This further helps reduce overall usage costs.

As we can see, many traditional notions about databases are being challenged in the cloud and serverless era. In these cases, we need to maintain an open mind and embrace new trends. There may be pleasant surprises in store! In the next article, I will introduce some workload patterns and best practices I’ve discovered using TiDB Serverless (along with some friendly complaints). Stay tuned, and thank you for reading!