TLDR Data 2025-11-06

Accelerate data and AI workflows with Databricks (Sponsor)

Disconnected tools slow down development. Databricks Data Intelligence Platform unifies data engineering, analytics, and ML on a single lakehouse built on Amazon S3.

Explore how you can use Databricks for vector indexing, embedding generation, and RAG deployment—all within your AWS environment. Technical resources walk you through integrations with Amazon Redshift, Amazon Bedrock, and SageMaker to simplify pipelines, optimize security, and move AI apps into production faster.

Start building today with a 14-day free trial in AWS Marketplace.

📱

Deep Dives

A Decade of AI Platform at Pinterest (15 minute read)

Pinterest evolved its AI infrastructure from fragmented, team-specific ML stacks to a unified platform supporting hundreds of millions of inferences per second, enabled by tiered abstractions for features definition, storage, representation, and training frameworks. Key advances included transitioning to GPU-based inference (achieving 100x larger model capacity without increasing cost or latency), scaling architecture with Ray and Model Farm, and standardizing on PyTorch. Success stemmed from balancing local innovation and unification timing with efficiency, iteration velocity, and enablement.

Glacierbase: Managing Iceberg Schema Migrations at Scale (7 minute read)

WHOOP's data platform manages large Iceberg tables, where schema inconsistencies can cause inefficiencies such as excessive compute or wasted reads. Glacierbase, inspired by Liquibase but tailored for open formats, standardizes migrations for high-value "silver" and "gold" layer tables (e.g., feature tables and ML datasets), excluding raw ingestion or stable CDC tables.

Replication Redefined: How We Built a Low-latency, Multi-tenant Data Replication Platform (7 minute read)

Datadog encountered scaling challenges with a shared PostgreSQL database managing both OLTP and complex search queries. To address this, it implemented asynchronous Change Data Capture using Debezium to detect changes in PostgreSQL, streaming via Kafka, and leveraging Kafka Connect for sinks to a dedicated search platform, while denormalizing data during replication to optimize faceted searches and aggregations.

🚀

Opinions & Advice

We Moved Analytics into an IDE — and Haven't Looked Back (8 minute read)

Cursor turned Faire's analytics work into an AI-native workflow by combining SQL generation, codebase search, and context from multiple systems in a single IDE, which cut analysis time from days to hours. Adoption required onboarding, QA habits, and visible wins, but once analysts saw that they could query data, inspect ETL logic, and generate runnable SQL in one place, usage grew rapidly. Key insight for data professionals: the future of analytics is not just AI-assisted, it is AI-native, where analysts interrogate code and data directly and spend more time on decisions, not searching for context.

Event Streaming is Topping Out (14 minute read)

The event streaming market (led by Kafka/Confluent) is saturated, growth is slowing, prices are collapsing, and too many vendors are chasing a small demand. The author predicts heavy consolidation and pivots because Apache Kafka will remain, but many streaming companies and business models around it will not survive.

Cluster Fatigue. Polars and PyArrow to Postgres and Apache Iceberg (streaming mode) (5 minute read)

"Cluster fatigue" refers to the exhaustion and costs associated with managing distributed compute clusters for data processing. To combat this, the strategy involves migrating from resource-intensive distributed workflows to leaner, local or small-scale compute options leveraging tools like Polars and PyArrow for managing massive datasets, facilitating streaming ingestion into Postgres and writes to Apache Iceberg tables in streaming mode.

4 Senior Data Engineers Answer 10 Top Reddit Questions (27 minute read)

Insights from four veteran data engineers emphasize selecting data warehouses versus lakes or lakehouses according to practical constraints like budget, deadlines, and team expertise, challenging initial demands for "real-time" processing to sidestep undue complexity, and initiating data quality efforts with straightforward tests, refining them via failure-driven iterations while prioritizing business understanding, using techniques such as write-audit-publish (WAP).

💻

Launches & Tools

QuackStore (GitHub Repo)

The QuackStore extension automatically stores frequently accessed file portions in a local, block-based cache. This dramatically reduces load times for repeated queries on the same data.

Apache Arrow's Final Frontier: Replacing Outdated Database Drivers (3 minute read)

Columnar, founded by core Apache Arrow contributors, has launched with $4 million in seed funding to address performance bottlenecks in analytical data transfers caused by legacy row-based protocols like ODBC and JDBC. Its new ADBC drivers, built on the Arrow columnar format, deliver over 90% query time reductions in some cases and eliminate costly serialization overhead. Supported targets include Redshift, MySQL, SQL Server, and Trino.

The Data Engineering Agent is Now in Preview (5 minute read)

Google has launched the Data Engineering Agent in BigQuery, a Gemini-powered tool that automates complex data engineering tasks, including pipeline creation, transformation, modeling, troubleshooting, and migration, using natural language prompts and best-practice automation.

🎁

Miscellaneous

Pragmatic Orthodoxy - Data Signals #1 - 03.11.25 (6 minute read)

Some pragmatic shifts are underway in data architecture: teams prioritize data replication over sophisticated history techniques, or virtualized, view-based models, materializing only proven hot paths (resources are cheap - your time is not). Semantic clarity and domain-driven modeling enable projecting various schema forms from a single semantic foundation, accelerating adaptation for new use cases. These approaches value measured investment in complexity, optimizing for human and operational efficiency rather than hypothetical future needs or technology limitations.

Catalog of Patterns of Distributed Systems (Website)

Patterns of Distributed Systems catalogs over 25 foundational architectural patterns addressing consistency, synchronization, partitioning, replication, and coordination challenges in distributed environments. Key techniques explored include clock-based ordering, leader election, majority quorum, idempotency, log replication, partition management, and hybrid clocks. These proven methods enable reliable data storage, efficient messaging, and robust state management in distributed data platforms.

⚡

Quick Links

Boring Ducklake Semantic Fishing Demo (GitHub Repo)

A multi-stage pipeline ingests and transforms the Google Analytics dataset into a structured fact table and semantic layer, enabling fast querying and analysis in DuckDB/MotherDuck.

ClickHouse Welcomes LibreChat: Introducing the Open-Source Agentic Data Stack (7 minute read)

ClickHouse has acquired LibreChat, integrating its open-source, multi-LLM chat platform as a core component of a unified Agentic Data Stack for agent-facing analytics.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please unsubscribe.