📱

Deep Dives

FishDB: a Generic Retrieval Engine for Scaling LinkedIn's Feed (12 minute read)

LinkedIn's FishDB is a Rust-based, generic retrieval engine optimized for recommender systems like feeds. It employs a scatter-gather architecture with a broker distributing queries across 48 sharded partitions (16 replicas each) powered by a lambda-architecture ingestion pipeline, in-memory inverted/forward/reference indexes optimized for low indirection and copy-on-write updates, RocksDB-backed attribute stores, and a Volcano-style query engine with tree-walk interpretation for complex expressions.

Integrating Netflix's Foundation Model into Personalization applications (7 minute read)

Netflix has centralized its personalization efforts with a large Foundation Model, streamlining user preference learning and supporting three production integration patterns: batch-refreshed embeddings via an Embedding Store, subgraph integration for real-time inference, and customized fine-tuned model deployments. The embeddings approach offers scalable, low-latency access but can suffer from staleness, while subgraph integration unlocks deeper personalization at higher complexity and compute cost. The modular framework enables data teams to tailor recommendations to diverse application constraints.

How Dash uses context engineering for smarter AI (5 minute read)

Dropbox improved Dash's agentic performance by consolidating many retrieval tools into a single unified "Dash Search" tool, filtering results at runtime using a knowledge graph to deliver only highly relevant context, and delegating complex subtasks like query construction to a specialized search agent. These three context-engineering strategies reduce noise and tool sprawl, prevent context overload, and balance token usage, cost, latency, and reliability.

🚀

Opinions & Advice

Why Strong Consistency? (6 minute read)

Eventual consistency, while useful for rare low-latency trade-offs, complicates high-availability services by demanding sophisticated routing, error-handling, and testing. Aurora DSQL delivers strong consistency across all replicas by combining monotonic journal updates with timestamp-based queries, where replicas simply wait for all prior writes to be applied, so developers can write straightforward code rather than consistency hacks.

Tips for Building Knowledge Graphs (15 minute read)

Knowledge graphs offer distinct advantages over traditional relational databases for modeling highly interconnected and complex domains, especially beyond the 30-table threshold. They simplify schema evolution, enable advanced inferencing via standards like OWL and SHACL, and streamline business logic by embedding process knowledge directly into the data layer. Integrating knowledge graphs with LLMs via structured APIs enhances security and query expressivity. However, the primary challenge (and cost driver) remains acquiring, structuring, and maintaining high-quality, domain-relevant data.

The Network is the Product: Data Network Flywheel, Compound Through Connection (7 minute read)

Data value compounds not through isolated products, but via interconnected data ecosystems where feedback loops drive continual learning and intelligence. Transitioning from siloed models to a networked "Data Flywheel" amplifies value, speed, and trust, as every new data product, user context, and global quality protocol reinforce system-wide outcomes. Prioritizing connection density, context-driven design, and distributed quality assurance turns data platforms into self-accelerating engines of innovation and actionable insight.

💻

Launches & Tools

dbt's new Fusion Engine for smarter, cost-effective data ops (Sponsor)

Data teams face an impossible choice: move fast and explode cloud costs, or manage spend and sacrifice quality. The new dbt Fusion engine eliminates this tradeoff with state-aware orchestration that skips unchanged models and tests automatically, achieving 29% efficiency gains while maintaining data freshness. For a closer look, join the live session (December 3 / 4) and hear how Fusion is helping Obie Insurance and Analytics8 move toward faster pipelines and reduced waste.

pgFirstAid (GitHub Repo)

pgFirstAid is a lightweight, single-function PostgreSQL health check that instantly returns prioritized performance and stability issues with recommended fixes. It covers key areas like missing primary keys, bloat, outdated statistics, and inefficient indexes. pgFirstAid is safe to run in production. It's designed for anyone to use, not just DBAs.

sqlmap (GitHub Repo)

sqlmap is an open-source tool for automating SQL-injection discovery and exploitation across all major databases. It supports six injection techniques (boolean-based blind, time-based blind, error-based, UNION query-based, stacked queries, and out-of-band) while testing full DB fingerprinting, data extraction, file operations, and OS-level command execution when privileges allow.

DuckDB Internals - Part 4: Optimizer Overview (21 minute read)

DuckDB's optimizer is a sophisticated, extensible component central to its OLAP performance, transforming unoptimized logical plans into efficient ones via rule-based transformations. Encapsulated in the Optimizer class, it applies 26 built-in rules to simplify expressions, reorder operations, and push down filters. The optimizer supports for plugins via OptimizerExtension for custom pre/post-optimization hooks.

Ax 1.0: Efficient Optimization With Adaptive Experimentation (5 minute read)

The open-source platform Ax powers adaptive optimization for large-scale ML systems at Meta, replacing brute-force searches (grid/random) with Bayesian and sequential methods for hyperparameters, metrics, and system tuning. It supports complex constraints, noisy observations, parallel suggestions, and early stopping. A research paper detailing the system's architecture, features, and performance is linked in the article.

🎁

Miscellaneous

How Can You Identify an Agentic AI Use Case? (10 minute read)

Agentic AI can automate complex, reasoning-heavy tasks that are repetitive, expert-dependent, or involve scattered/unstructured data, dramatically cutting human effort, provided the scope is clearly bounded, tools are well-defined (potentially with subagents), and sufficient upfront documentation is invested to eliminate ambiguity and prevent incomplete automation.

Training a Tokenizer for BERT Models (4 minute read)

Training a custom WordPiece tokenizer for BERT using Hugging Face's tokenizers and datasets libraries involves loading a corpus, training the tokenizer from an iterator with a 30,522-word vocabulary and BERT special tokens, enabling padding/truncation, and saving the final tokenizer for testing and downstream BERT fine-tuning.

⚡