The Small Model Infrastructure Nobody Built (So We Did)
Details
Abstract
I wrote an article about making embedding inference fast. Flash Attention, memory hierarchy, all the usual advice. Then I posted it, and the internet told me I was wrong attention is 6% of the workload, my optimization advice was backwards, and I'd never actually profiled the thing I was writing about.
Here's the thing: we were building an embedding inference engine when this happened. And the corrections didn't just fix my article — they exposed gaps in everything we'd seen in the ecosystem. TEI gives you one model per container, so five models means five deployments. Infinity crashes under load. Everyone assumes you know exactly which model you want before you start the server. Nobody handles what happens when you're running three models and memory fills up.
So we built what we actually needed: an inference engine where you load models at query time, not deploy time. Where you can hot-swap between dozens of models on one GPU without restarting anything. Where memory pressure triggers LRU eviction instead of an OOM crash. Where trying a new embedding model doesn't mean a rebuild. This talk is the story of how internet feedback reshaped both an article and a product. I'll show you the profiler traces that proved the commenters right, the infrastructure gaps we found when we looked harder, and how we filled them. Small-model inference is the infrastructure nobody built. So we did.
About the Speaker
Filip is a Machine Learning Engineer working in Developer Relations at Superlinked, shaping open source technical strategy, shipping product features, and working on small LLMs.
Bridging ML research with production systems and developer communities. Filip specialises in building end-to-end AI solutions, leading developer relations initiatives, and crafting AI/ML engineering solutions that transform research into business value.
He regularly speaks at AI conferences and events, translating complex technical concepts for diverse audiences.
Tech startups, philosophy, and art inspire his creative problem-solving approach.
