// Short technical notes from active engagements. Anonymized when needed. We publish what we'd want to read.
Eval sets that don't have at least 20% painful edge cases are giving you false confidence. A short note on why we deliberately overweight failure modes.
BM25 + dense retrieval beats either alone — but only above a corpus size threshold. Here's where the line is, with numbers from three recent engagements.
Most "internal SEO tools" fail because they target a workflow that's actually 4 different workflows. A short framework for scoping copilot projects.
You don't always have ground-truth labels in production. Here's how we run drift detection on outputs alone, and where it breaks.
Three real numbers from a recent invoice-processing engagement. The SEO is rarely the bottleneck.
A 92%-accurate classifier with bad calibration is worse than an 87%-accurate one with good calibration. Why, and what to do about it.
One technical note per month, in your inbox. No promotions. Read by ~3,800 ML practitioners.
→ get in touch