Build a more sustainable on-call experience (Sponsored)Keeping your systems reliable shouldn’t come at the expense of your team. This practical guide from Datadog shows how to design sustainable on-call processes that reduce burnout and improve response. Get step-by-step best practices so you can:
Disclaimer: The details in this post have been derived from the details shared online by the Datadog Engineering Team and the P99 Conference Organizers. All credit for the technical details goes to the Datadog Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. In the world of cloud monitoring, Datadog operates at a massive scale. The company’s platforms must ingest billions of data points every single second from millions of hosts around the globe. This constant flow of information creates an immense engineering challenge: how do you store, manage, and query this data in a way that is not only lightning-fast but also cost-effective? For the Datadog Engineering Team, the answer was to build their own solution from the ground up. In this article, we will understand how the Datadog engineering team built Monocle, their custom-built time series storage engine to power their real-time metrics platform. This article analyzes the technical decisions and clever optimizations behind the database. Move faster with AI: Write code you can trust (Sponsored)AI is speeding things up, but all that new code creates a bottleneck — who’s verifying the quality and security? Don’t let new technical debt and security risks slip past. Sonar’s automated review gives you the trust you need in every line of code, human- or AI-written.
Get started with SonarQube today to fuel AI-enabled development and build trust into all code. High-Level Metrics Platform ArchitectureBefore diving into the custom database engine, it is important to understand where it fits. Their custom engine, named Monocle, is just one specialized component within a much larger “Metrics Platform.” This platform is the entire system responsible for collecting, processing, storing, and serving all of its customer metrics. The journey of a single data point begins at the “Metrics Edge.” This component acts as the front door, receiving the flood of data from millions of customer systems. From there, it is passed to a “Storage Router.” Just as the name suggests, this router’s main job is to process the incoming data and intelligently decide where it needs to be stored. This is where Datadog’s first major design decision becomes clear. The Datadog Engineering Team recognized that not all data queries are the same. An engineer asking for a performance report from last year has very different needs than an automated alert checking for a failure in the last 30 seconds. To serve both, they split their storage into two massive, specialized systems.
A time series data point has two parts:
The Datadog Engineering Team made the critical decision to store these two parts in separate, specialized databases:
Using KafkaPerhaps the most unique architectural decision the Datadog Engineering Team made is how their database clusters are organized. In many traditional distributed databases, the server nodes (the individual computers in the cluster) constantly talk to each other. They “chatter” to coordinate who is doing what, to copy data between themselves (a process called replication), and to figure out what to do if one of them fails. Datadog’s RTDB nodes do not do this. Instead, the entire system is designed around Apache Kafka. Here, Kafka acts as a central place where all new data is written first before it even touches the database. This Kafka-centric design is the key to the cluster’s stability and speed. See the diagram below that shows an RTDB cluster and the role of Kafka: The Datadog Engineering Team uses Kafka to perform three critical functions that the database nodes would otherwise have to do themselves.
Monocle: The Custom Built Engine in RustAt the heart of each RTDB node is Monocle, Datadog’s custom-built storage engine. See the diagram below: This is where the team’s pursuit of performance gets truly impressive. While earlier versions of the platform used RocksDB, a popular and powerful open-source database engine, the team ultimately decided to build its own. By creating Monocle from scratch, they could tailor every single decision to their specific needs, unlocking a new level of efficiency. Monocle is written in Rust, a modern programming language famous for its safety guarantees and “C-like” performance. It is built on Tokio, a popular framework in the Rust ecosystem for writing high-speed, asynchronous applications that can handle many tasks at once without getting bogged down. The Core Data Model: Hashing TagsMonocle’s key innovation is its simple data model. As mentioned, a time series is defined by its tags, like “system.cpu.user”, “host:web-01”, and “env:prod”. This set of tags is what makes a series unique. However, these tag sets can be long and complex to search. The Datadog Engineering Team simplified this dramatically. Instead of working with these complex strings, Monocle hashes the entire set of tags for a series, turning it into a single, unique number. The database then just stores data in a simple map: This design is incredibly fast because finding all the data for any given time series becomes a direct and efficient lookup using that single hash. The separate Index Database is responsible for the human-friendly part: it tells the system that a query for env: prod corresponds to a specific list of “Tag Hashes.” Inside MonocleMonocle’s speed comes from two main areas: its concurrency model and its storage structure. Monocle uses what is known as a “thread-per-core” or “shared-nothing” architecture. You can imagine each CPU core in the server has its own dedicated worker, which operates in total isolation. Each worker has its own data, its own cache, and its own memory. They do not share anything. When new data comes in from Kafka, it is hashed. The system then sends that data to the specific queue for the one worker that “owns” that hash. Since each worker is the only one who can ever access their own data, there is no need for locks, no coordination, and no waiting. This eliminates a massive performance bottleneck common in traditional databases, where different threads often have to wait for each other to access the same piece of data. See the diagram below: Monocle’s storage layout is a Log-Structured Merge-Tree (LSM-Tree). This is a design that is extremely efficient for write-heavy workloads like Datadog’s. Here are the main concepts associated with LSM Trees:
The Datadog Engineering Team added two critical optimizations to this design:
Staying Fast Under PressureHandling so many queries with the “thread-per-core” design creates a unique challenge. Since a query is fanned out to all workers, it is only as fast as its slowest worker. If one worker is busy performing a background task, like a heavy data compaction, it can stall the entire query for all the other workers. This is a classic computer science problem known as head-of-line blocking. To solve this, the team built a two-layer system to manage the query load and stay responsive.
ConclusionThe Datadog Engineering Team’s work on Monocle is far from over. They are already planning the next evolution of their platform, which involves two major changes.
To achieve this, the team will move to a columnar database format. In a columnar database, data is stored by columns instead of rows. This means a query can read only the specific tags and values it needs, which is a massive speedup for analytics. This is a complex undertaking that will likely require a complete redesign of their “thread-per-core” model, but it highlights Datadog’s drive to push the boundaries of performance. References: SPONSOR USGet your product in front of more than 1,000,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com. |






0 Comments
VHAVENDA IT SOLUTIONS AND SERVICES WOULD LIKE TO HEAR FROM YOU🫵🏼🫵🏼🫵🏼🫵🏼