📱

Deep Dives

Continuous batching from first principles (12 minute read)

Continuous Batching enables LLM inference engines to dynamically mix and match live requests at the token-generation level, letting finished sequences exit and new ones join, keeping GPUs fully utilized. This yields dramatic gains: much higher throughput, lower latency, and better memory efficiency, especially under high-load or variable-length workloads. In real-world deployments, it turns many GPUs into high-throughput, low-cost LLM servers rather than idle resources.

Graphs, Algorithms, and My First Impression of DataFusion (19 minute read)

Implementing the two-phase "big-star/small-star" connected components algorithm in Apache DataFusion delivers 4-5x faster performance and half the memory footprint compared to Spark GraphFrames on in-memory graph workloads. DataFusion's Rust-based architecture provides superior UDF performance, robust SQL/DataFrame tools, and efficient I/O, but lacks distributed execution and disk persistence for iterative algorithms. These constraints limit scalability for large-scale or out-of-core graph processing. DataFusion excels for high-performance, in-memory, and developer-friendly graph analytics on medium-sized data sets.

How Zalando Delivers Real-Time Insights to Its Partners Brands (16 minute read)

Zalando replaced slow, manual data exports with Delta Sharing, giving thousands of partner brands secure, real-time, zero-copy access to governed Delta Tables, so no ETL or data duplication is required. This eliminated ~1.5 FTEs of monthly manual work per partner, slashed onboarding time to minutes, and enabled everyone from small retailers using Excel to large brands using Spark to instantly query fresh, TB-scale data with full history and strict governance.

Reducing Experiment Duration with Predicted Control Variates (8 minute read)

Etsy reduced A/B test duration by up to 50% by introducing Predicted Control Variates (PCV), a statistical technique that uses pre-experiment user features to predict each user's counterfactual control-metric behavior and then subtract that prediction from the observed treatment metric as a powerful noise-reduction covariate.

🚀

Opinions & Advice

Stop Hacking SQL: How to Build a Scalable Query Automation System (9 minute read)

A robust SQL automation system needs to treat queries as code, not UI clicks or ad-hoc scripts. Data engineers should move to spec-driven jobs in Git with templates, validation, dry-runs, cost limits, and CI-based deployment, which eliminate silent failures, copy-paste drift, and runaway spend. Adding structured logs and metrics makes jobs observable, predictable, and far easier to maintain at scale.

Branch, Test, Deploy: A Git-Inspired Approach for Data (12 minute read)

Git-like workflows for data, enabled by tools like LakeFS, Nessie, and Tigris, introduce instant branching, zero-copy cloning, robust rollback, and atomic snapshots across massive data sets without physical duplication. By leveraging metadata-driven approaches, data engineers can test, revert, and parallelize changes on production-scale data with minimal overhead, eliminating slow manual copies. This paradigm dramatically accelerates CICD cycles and simplifies complex rollback scenarios.

Why Most Enterprise Ontologies & Knowledge Graphs Fail (6 minute read)

Enterprise knowledge graphs and ontologies face an unavoidable challenge: consensus, stability, completeness, and transparency are all unattainable at scale. Effective solutions require dynamic reconciliation algorithms, adaptive models for ongoing change, mechanisms for continuous learning and expansion, and trust management even without full data visibility. Building enterprise knowledge graphs requires embracing these realities to ensure alignment with the complex, distributed nature of real-world organizations.

On Idempotency Keys (7 minute read)

Idempotency keys enable exact-once processing in distributed systems by letting consumers detect and skip duplicate messages from at-least-once deliveries, using atomic persistence to make retries safe. Common approaches include random or time-based UUIDs for simplicity, monotonic sequences or database WAL positions (via CDC) for storage efficiency and natural ordering. The best choice depends on scale, concurrency needs, and operational complexity.

💻

Launches & Tools

Apache Hudi 1.1 is Here—Building the Foundation for the Next Generation of Lakehouse (15 minute read)

Apache Hudi 1.1 introduces a pluggable table format framework with Iceberg/Delta adapters, partition-aware indexing, dynamic bucket sizing, storage-based locking, and major engine-specific boosts. It delivers up to 15x faster clustering speed, 4x faster lookups, and 2–3x higher Flink throughput through zero-copy processing and optimized metadata handling.

Super fast aggregations in PostgreSQL 19 (4 minute read)

PostgreSQL 19 introduces a major query optimizer enhancement that allows the engine to aggregate data before joining reference tables, significantly accelerating aggregation queries, especially when dealing with large fact tables and small lookup tables. Early benchmarks show up to a 5x speed increase, all without code or configuration changes required. This optimization boosts performance for common GROUP BY operations.

🎁

Miscellaneous

Evolution and Scale of Uber's Delivery Search Platform (9 minute read)

Uber's semantic search platform leverages a Qwen-based two-tower DNN with Matryoshka Representation Learning (MRL) for flexible embedding sizes, scalar quantization (int7), and shard-level K tuning, combined with locale-aware lexical fields, pre-filters, ANN search in Lucene, and lightweight neural re-ranking. These innovations deliver >50% latency reduction, ~50% storage savings, 34% lower latency from K-tuning, and sustained >0.95 recall, while biweekly automated index refreshes keep results fresh.

Building agentic RAG with PostgreSQL and n8n (4 minute read)

Agentic RAG transforms traditional linear retrieval-augmented generation pipelines by implementing a reasoning loop, enabling AI agents to dynamically choose between SQL queries and vector searches in PostgreSQL via n8n orchestration. You can consolidate chat memory, vector storage, and business data within a single Postgres instance to eliminate infrastructure complexity typical of multi-database setups.

⚡