Big Data and Beyond: My Predictions for 2024

Last modification on

db-trend-2024

by Ed Huang, Cofounder / CTO, PingCAP

It's been a long time since I wrote about database trends. Taking advantage of the holiday break, I decided to write a bit. If I had to describe the world of data technology in the past few years with one word, it would be: change. From the architectural changes in database kernel technology to the various innovations of new and old manufacturers, and even the real-world workloads of users, I feel that we are in a period of significant transformation. In this article, I will discuss some points that have made a strong impression on me. As these are personal opinions, they might be wrong :)

The problem of big data has been (almost) solved, and elasticity will become the new hotspot

Looking back nine years ago, when we were just starting to design TiDB, the biggest pain point for databases in the world was mostly about scalability. This is evident from the systems that were born around that time, such as DynamoDB, Spanner, CockroachDB, TiDB, and even Aurora initially. Their main selling point was better scalability compared to traditional RDBMS. TiDB's initial motivation was also to solve the various inconveniences brought by MySQL's past sharding solutions, which were essentially a compromise for scalability.

In 2024, we see that these systems have almost solved the general scalability issues. For general OLTP businesses under the scale of hundreds of terabytes, I believe the above systems can confidently handle them (Why hundreds of terabytes? Based on our observations, 100TB is already a large scale for most single application).

Most of these databases are shared-nothing systems. Shared-nothing systems usually have an assumption: the nodes in the cluster are equivalent, only then can data and workload be evenly distributed across all nodes. This assumption works fine for scenarios with massive data + uniform access patterns, but many businesses still exhibit distinct hot and cold characteristics. One of the most common database issues we deal with is localized hotspots. If data access skew is a natural attribute of a business, the assumption of equivalence becomes unreasonable. A more rational approach is to allocate better hardware resources to hot data, while using cheaper storage for cold databases. For instance, TiDB, from the beginning, separated storage nodes (TiKV) / computing nodes (TiDB) / metadata (PD), and later introduced custom Placement Rules in version 5.0, allowing users to decide data placement strategies as much as possible, to weaken the assumption of node equivalence.

But the ultimate solution lies in the cloud. After the basic scalability issues are resolved, people start to pursue higher resource utilization efficiency. At this stage, for OLTP businesses, a possible better evaluation standard might be Cost Per Request. In the cloud, the cost difference between computing and storage is significant. For cold data, if there is no traffic, you can almost consider the cost to be nearly zero (storage cost is very low in the cloud). However, computing is expensive, and online services inevitably need computing (CPU resources). Therefore, efficiently utilizing computing resources and cloud elasticity will become key.

Also, please don't misunderstand me, elasticity does not always mean cheap. On-demand resources in the cloud are usually more expensive than pre-provisioned resources. Continuous bursting is definitely not cost-effective. In such cases, using reserved resources is more appropriate. The cost of bursting is the fee users pay for uncertainty. Thinking carefully about this, this might become a profit model for cloud-based databases in the future.

More diverse data access interfaces, vector search becoming standard feature, but SQL remains the cornerstone

If in the past CRUD applications were a static encapsulation of database access, then with the proliferation of GenAI, especially in the form of Chatbot products, the use of data will become more flexible and dynamic. In the past, centralized data storage and applications existed due to technological limitations, making it difficult to provide personalized services for individuals. Although modern SaaS actually aspires to develop in this direction, providing a personalized experience for every user poses too high a challenge in terms of computing power and development. However, GenAI and LLMs greatly reduce the cost of providing personalized services (possibly just a few prompts), which brings several changes to databases:

  1. The value of data generated by an individual (or an organization) will become increasingly high, and such data usually won't be very large (not typically Big Data).
  2. GenAI will use more dynamic and flexible ways to directly access data, which is the most efficient way.
  3. Data access will be initiated from the edge (directly by an Agent or GenAI).

A good example is GPTs. GPTs support creating your own ChatGPT through custom prompts (or document) and user-provided RESTful APIs. The basic ChatGPT will flexibly call the action you specify when it deems necessary. The way and parameters of this call are unpredictable to the backend action provider. It's foreseeable that GPTs will soon provide mechanisms for marking personal identity information. For action providers, this means the backend database will have the most important index: UserID, and the rest is easy to understand.

You might challenge this by asking, isn't RAG the standard practice? The current construction methods of RAG are almost all static, while knowledge should be updated in place in real time.

Here, I must mention vector databases. Support for vectors was a focus last year, resulting in many specialized vector databases. However, I believe vector search should not warrant a separate database but should instead be a feature within existing databases, just like:

INSERT INTO tbl (user_id, vec, ...) VALUES (xxx, [f32, f32, f32 ...], ...);
SELECT * FROM tbl WHERE user_id = xxx and vector_search([f32,f32,f32,f32 ...]) 

Such access pattern are likely more in line with developers' intuition.

Relational databases inherently support insertions and updates. Furthermore, combined with the search capabilities of vector indices, they can turn RAG into a positive feedback loop for real-time fact-finding updates (utilizing LLMs to introduce secondary Summaries, and then storing the updated Index in the DB). Additionally, the introduction of relational databases eliminates the data silo issue brought by vector databases. When you can join the data filtered out by vector indices to other data in the same DB, the value brought by this flexibility becomes apparent.

Another benefit is that the Serverless product form returns data ownership to the users themselves. Think about it, in the past our data was hidden behind the services of various internet companies, and we had no direct access. However, in the application scenarios of GenAI, data interaction becomes a triangular relationship: User - Data (RAG) - GenAI. Interestingly, this aligns with one of the ideals of Web3. Achieving this was difficult in the era of Web2, but the proliferation of GenAI could also inadvertently fulfill Web3's aspiration of returning data ownership to users, which is quite interesting :).

Indeed, I believe that in the future, advanced RAG will become a very important new application scenario for databases. In such scenarios, Serverless cloud database services will become standardized.

Database kernels are becoming cloud-native, and database software is evolving into database platforms, with a greater focus on usability and developer experience

As I've emphasized on various occasions, in the future, databases will not just be software. I've already discussed the part about kernel technology in detail in my blog [https://me.0xffff.me/dbaas1.html], so I won't elaborate on that here. Another interesting topic is how Change Data Capture (CDC) will become standard in databases in the cloud era. This topic deserves a full article, maybe in the future, so I'll just simply skip it.

In this post I want to talk about the observability of cloud databases. For cloud database services, the demands for observability are higher because, for developers, the service provider's dashboard is almost the only diagnostic tool. There's a lot written about observability, and due to space constraints, I won't delve into the similar parts. What's different in the cloud is that all workloads will be part of the customer's bill.

For users, a new question arises: Why does my bill look like this? What can I do to make it cheaper? The better the bill's explainability, the better the user experience. However, if the granularity of billing measurement is too fine, it can also affect the product's performance and increase implementation costs. A balance needs to be struck here. But it's certain that in thinking about the direction of observability products, cost analysis can be a new perspective. This trend is evident from AWS's new Cost and Usage Dashboard and Amazon CTO Dr. Werner's speech at Reinvent, focusing on the art of cost-based architecture.

Platformization of databases is not just about a beautiful web management dashboard and a stack of fancy features. I really like Planetscale CEO Sam Lambert's description of Developer Experience in his personal blog [https://samlambert.com/dx].

A good tool is good because it contains the clever ideas and tastes of the designer, who must also be a heavy user. Only then can they feel the subtle joys and pains without becoming so immersed as to be blind. This is a very high demand for product managers responsible for developer experience. Database management tools, as tools that are not used frequently but are taken very seriously each time, need to adhere to some design principles in the era of cloud:

  1. API First: Database platforms should provide stable/forward-compatible APIs. Everything that can be done on the management platform should be possible via API, and ideally, your management platform should be built on your API. This is also key to providing a fully functional, easy-to-use CLI tool.
  2. Use a unified authentication system: In the design phase, integrate the management authentication and user system with the database's internal authentication system. The traditional database's username and password-based permission system is not enough for the cloud era. This lays the foundation for subsequent integration with the cloud's IAM and Secret management systems.
  3. Build different/stable small tools for different functionalities (Do one thing, do things well), but call them through a unified CLI entry and semantic system. Good examples are rustup, and even git.

There are other tips about developer tools that I discussed in an article a few years ago about developer experience, which is still relevant. Feel free to check it out if interested.

That's all for now. To sum up, in 2024, data and database technology are still in a period of significant transformation, no one can predict the future because we are in such an era of immense uncertainty. But the good news is, innovation is still emerging. What I predict today might be completely overturned by myself in a few months, which is quite normal. But if it can inspire you now, that's enough. Also, wish everyone a Happy New Year!

Don't forget to use TiDB for developing your applications in the new year, thank you :)