A small team of researchers-who-build. We ship agent systems that survive Tuesday mornings — eval-first, MCP-native, process rewards where they earn their keep. Sometimes we chase a paper down a rabbit hole for the sport.
Half our time is client work — consulting, training, agent evaluations. The other half is whatever the group finds interesting this month. Both halves make each other better.
We design, build, and harden agent systems that survive real users — MCP-native, sub-agent orchestrated, boring on purpose. If your last agent broke at step 40, call us.
If you're deploying an agent and want to know whether it works on Tuesdays — not demos — we build the harness (τ-bench-style, process-aware), run the numbers, and hand you a reproducible suite.
Process rewards, RLVR, synthetic data pipelines, Claude Skills that hold up under adversarial use. Where fine-tuning actually earns its keep, and where it doesn't.
Workshops for engineering teams who want to stop pasting prompts into chat and start shipping agents. Hands-on, no slides. We bring the eval harness.
Small experiments on what nobody's paying us for yet — long-horizon agent reliability, process rewards without labels, MCP-server topology analysis. Published when it holds up.
We build tools for ourselves, then publish the ones that survive. A few have landed in the Claude marketplace. MIT by default.
Weekly. We read the paper so you don't have to. Sometimes the writeup is better than the paper.
The one-weekend kind. Browser agents doing unreasonable things, voice loops that answer before you finish the question, Skills that probably shouldn't ship. Occasionally interesting.
We don't do discovery phases or steering committees. We do a call, then a prototype, then a demo. Most of our engagements look roughly like this:
You describe the problem. We ask annoyingly specific questions. By the end we either send a scope or politely decline.
Shared repo from day one. Traces in your observability tool. Friday demo on the real stack, not a staged one. No status reports.
We build the eval suite before we build the polish. You keep the code, the eval harness, the traces. We stay on-call for six more weeks.
We'd love to show you a polished case-study carousel. We also love our clients' confidentiality more. Here's what we can say — ask us about the rest on a call and we'll show you what we're allowed to.
Each line below is a real engagement. Names and domains redacted; numbers and methods real. If one sounds like the thing you need, that's a good reason to get in touch.
We met in university. Most of us are still there — some studying, some already shipping, all still meeting in the same group chat. If we work with you, you'll meet everyone who's relevant.
Low-ceremony research log. Dates, code, traces, honest findings. Closer in spirit to an
Obsidian Bases
view than to Medium — more internal #research channel than blog. Lately on our
desk: the 2026 AI Index, the Mythos Preview system card,
and the Qwen3.5 weight drop.
// more coming. we write when we have something to say.
Tell us in two paragraphs what's going on and what "done" looks like. If it's a fit, we'll send back a scope within a week. If it isn't, we'll say so, and probably point you to someone better.