TLDR Data 2025-12-01

Aquata: batteries-included data platform for financial services (Sponsor)

Finance is the OG data-driven industry... but it's often running on a data stack that's held together with duct tape and breaks at the smallest schema change.

With Aquata, you get one cloud-native data platform purpose-built for financial services. It's used by asset managers & data teams at hedge funds, private markets, and capital markets to deliver faster insights - with total confidence in every figure.

Hundreds of built-in data models + thousands of attributes.
Connectors to custodians, fund admins, and market data providers.
Governed data pipelines for APIs, databases, CSV, and cloud sources.

Try it free for 30 days. Secure, single tenant, no payment required. Start here

📱

Deep Dives

The Pragmatic Guide to Federated AI: Building Compliant LLM/XGBoost Pipelines for Sensitive Data (7 minute read)

Practical federated learning in regulated sectors requires a layered architecture: synchronous orchestration (FedAvg), secure aggregation (to keep client updates confidential via masking or encryption), and differential privacy (to guarantee release-time privacy). Models can be tailored for differential federation, such as XGBoost (histogram sharing and per-client weighting) and TabNet (FedAvg with regularization and schema alignment). Monitoring for data/concept drift, rigorous audit trails, and differentiated release policies ensure robust governance and regulatory defensibility while maintaining model utility without ever centralizing raw data.

How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework (30 minute podcast)

Stripe engineered a Spark-based regression testing and simulation framework capable of processing over 400 billion rows (2–5 TB) of historical data in hours for migration and backtesting critical billing systems. By architecting JVM service logic as reusable libraries, Stripe enables the same processing code to be invoked both in real-time and at Spark scale. This massively parallel approach automates validation, what-if simulations, and preemptive defect detection, strengthening confidence in code changes and accelerating developer feedback loops.

CQRS Design Pattern (8 minute read)

CQRS separates write and read operations into independent models and databases to eliminate bottlenecks, enable independent scaling, and improve security in traditional and microservice architectures. Debezium solves the resulting replication challenge by using change data capture (CDC) to stream changes in real time from the write database to one or more read-optimized databases, including heterogeneous targets, as demonstrated in a live Quarkus-based voting application.

🚀

Opinions & Advice

Building Your Own Schema.org (8 minute read)

Organizations can scale data integration by inverting responsibility: instead of a central team cleaning and stitching data, each application maps its own outputs to shared concepts defined in an internal schema.org-style JSON-LD context. This lets the central team simply pull in pre-aligned data to build a Knowledge Graph, delivering organization-wide integration with far less bottlenecked human effort.

Predicting the Map of Requirements for Long-Term Data Platform Relevance (11 minute read)

A "Cuboid" framework that maps analytics needs against consumer type, inquiry mode, and decision tier dimensions reveals current, emergent, and inevitable gaps in data consumption. Enduring data platforms intentionally design for these "vacant spots" by decoupling data products as stable, semantic interfaces between ever-evolving business needs and changing tools or architectures. This "data as interfaces" approach creates resilience against tool churn and long-term relevance, sustaining ROI as enterprise requirements shift over time.

Why (Senior) Engineers Struggle to Build AI Agents (6 minute read)

Senior engineers struggle to build AI agents because their traditional deterministic mindset—centered on strict types, predictable control flow, and error-free code—clashes with the probabilistic systems that thrive on ambiguity, natural language state, and non-linear behavior. Success requires embracing semantic flexibility, handing over control to the agent, treating errors as inputs, replacing unit tests with evaluations, and designing forgiving APIs, fundamentally shifting from enforcing correctness to engineering for resilience and trust.

💻

Launches & Tools

Welcome PostgreSQL 18 on Aiven: A New Era of Performance (Sponsor)

PostgreSQL 18 is now available on Aiven, with major new features, including native UUIDv7 support for faster indexing and enhanced temporal constraints for data integrity. Aiven explores the key changes you need to know, from advanced indexing improvements to managing data history. See how Aiven can help you prepare your database for the future.

Writes in DuckDB-Iceberg (3 minute read)

DuckDB v1.4.2 introduces full delete and update support for Iceberg v2 tables, expanding upon its previously released read and initial write capabilities. Data engineers can now perform standard SQL UPDATE and DELETE operations, with merge-on-read semantics and respect for Iceberg table properties. However, updates are currently limited to non-partitioned, non-sorted tables, with only positional deletes supported.

StarRocks Incremental MV: A Bridge Over Shifting Ice (9 minute read)

StarRocks now supports true incremental view maintenance (IVM) with append-only, Iceberg-backed materialized views, enabling fast, row-level updates proportional to actual data changes rather than full or partition-level refreshes. Leveraging deterministic snapshot-based deltas, aggregate combinator functions, and an optimizer rule set, the framework brings efficient, native incremental computation directly into the database, minimizing external infrastructure and reducing compute costs. Upcoming Iceberg V3 lineage and V4 Root Manifest features will further streamline CDC and delta discovery.

Water, Water Everywhere: How Microsoft Ignite 2025 Turned Data Into Intelligence (15 minute read)

Microsoft reported Azure's 39% growth to a $93B run rate, with its data and analytics products outpacing even that (Microsoft Fabric rose 60%, Cosmos DB 50%, and SQL Database Hyperscale 75%). Major Ignite 2025 launches included HorizonDB (cloud-native PostgreSQL), open-source DocumentDB (99.4% MongoDB compatibility), and DiskANN-powered vector search integrated across databases and Fabric. Fabric unifies operational and analytical data via OneLake, supporting cross-platform queries, real-time intelligence, and a robust semantic/ontology layer for AI and business users

🎁

Miscellaneous

Cloud fragility is costing us billions (4 minute read)

Critical cloud outages at major hyperscalers like AWS, Google Cloud, and Azure increasingly trigger cascading failures across businesses due to complex, opaque indirect dependencies. Outages not only cause direct disruptions, costing organizations hundreds of millions in downtime, lost transactions, and reputational harm, but also expose insufficient planning and disaster recovery in digital ecosystems.

What Happens When AI Runs Out of Data? (5 minute read)

Concerns about AI "running out of data" misunderstand the real challenge: static, text-based datasets cannot capture the richness of experiential, multimodal data streams. The next wave of AI development is shifting toward simulation, robotics, sensor-rich interaction, and real-world feedback, leveraging gigabyte-scale, embodied signals rather than just curated text. This experiential learning approach (already exemplified in robotics and autonomous vehicles) expands data volume and diversity dynamically, eliminating the ceiling imposed by traditional web scraping.

⚡