MiaoMind — Special forces for AI, for hire or for fun.

// what we do

Two modes. One team.

Half our time is client work — consulting, training, agent evaluations. The other half is whatever the group finds interesting this month. Both halves make each other better.

For hire

// paid work

Agent systems for production

We design, build, and harden agent systems that survive real users — MCP-native, sub-agent orchestrated, boring on purpose. If your last agent broke at step 40, call us.

2–4 weeks · flat fee · you keep the repo

Evals & red-teaming

If you're deploying an agent and want to know whether it works on Tuesdays — not demos — we build the harness (τ-bench-style, process-aware), run the numbers, and hand you a reproducible suite.

fixed scope · your data stays yours · repo + report

Post-training & Skill authoring

Process rewards, RLVR, synthetic data pipelines, Claude Skills that hold up under adversarial use. Where fine-tuning actually earns its keep, and where it doesn't.

4–8 weeks · open-weight friendly · model-agnostic

Team training

Workshops for engineering teams who want to stop pasting prompts into chat and start shipping agents. Hands-on, no slides. We bring the eval harness.

1–3 days on-site · or 4 weeks async

For fun

// self-directed

Frontier research

Small experiments on what nobody's paying us for yet — long-horizon agent reliability, process rewards without labels, MCP-server topology analysis. Published when it holds up.

ongoing · notes below · repos on GitHub

Shared Skills & MCP servers

We build tools for ourselves, then publish the ones that survive. A few have landed in the Claude marketplace. MIT by default.

open source · skills + mcp · half-library, half-playbook

Reading group

Weekly. We read the paper so you don't have to. Sometimes the writeup is better than the paper.

public archive · posts in Notes

Demos that shouldn't exist

The one-weekend kind. Browser agents doing unreasonable things, voice loops that answer before you finish the question, Skills that probably shouldn't ship. Occasionally interesting.

irregular · no SLA · no regrets

// how we work

Short loops. Honest numbers.

We don't do discovery phases or steering committees. We do a call, then a prototype, then a demo. Most of our engagements look roughly like this:

01 · week 0

30-minute call

You describe the problem. We ask annoyingly specific questions. By the end we either send a scope or politely decline.

02 · week 1–2

Prototype in the open

Shared repo from day one. Traces in your observability tool. Friday demo on the real stack, not a staged one. No status reports.

03 · week 3–4

Eval, harden, hand off

We build the eval suite before we build the polish. You keep the code, the eval harness, the traces. We stay on-call for six more weeks.

// selected work

Most of it is under NDA.

We'd love to show you a polished case-study carousel. We also love our clients' confidentiality more. Here's what we can say — ask us about the rest on a call and we'll show you what we're allowed to.

Shapes, not stories.

Each line below is a real engagement. Names and domains redacted; numbers and methods real. If one sounds like the thing you need, that's a good reason to get in touch.

Ops agent for a bio lab · 14 MCP servers, 80-step workflowsshipped
On-prem coding agent · Qwen3.5-Coder + post-training, beats cloud on their stackshipped
Contract-review post-training · process rewards, 3× cheaper at parityshipped
Browser agent for back-office ops · τ-bench-style internal eval suitelive
Voice dispatch agent, realtime · sub-300ms turn latencylive
Long-horizon research assistant · 100+ step traces, human-auditableongoing
Weekend demo · still on our laptops, maybe yoursgremlin

// team

Ten pairs of eyes.

We met in university. Most of us are still there — some studying, some already shipping, all still meeting in the same group chat. If we work with you, you'll meet everyone who's relevant.

Evan Agent orchestration · sub-agents. Reads distributed systems papers for fun.

Zayd Evals · τ-bench-style harnesses. Trusts nothing until the numbers say so.

Freya Post-training · process rewards. Currently obsessed with unlabeled traces.

Lucas Agent runtime · MCP infra. Likes when things are boring in production.

Clara Skill authoring · agent UX. Asks "but what does the human see?" a lot.

Elias Long-horizon agents · reliability. The patience to stare at 100-step traces.

Arthur Browser agents · human-in-the-loop. Every workflow gets an escape hatch.

Lin Synthetic data · self-play pipelines. Writes surprisingly clean Rust.

Kai Voice & realtime · sub-300ms loops. Shaves milliseconds like it's personal.

Anya Wild cards · demos that shouldn't exist. Ships them anyway.

// note Some of us are still in grad school. That's a feature, not a bug — it's why we read papers in the morning and ship before dinner. Our internal wiki runs on Obsidian; the aesthetic rubbed off. Stay SOTA to stay SOTA.

// notes

We read the paper so you don't have to.

Low-ceremony research log. Dates, code, traces, honest findings. Closer in spirit to an Obsidian Bases view than to Medium — more internal #research channel than blog. Lately on our desk: the 2026 AI Index, the Mythos Preview system card, and the Qwen3.5 weight drop.

2026-04-14

The 2026 AI Index · the chart everyone is quoting, and the one they're skipping

SWE-bench Verified went 60% → ~100% in a year — that's the headline chart. The one we're watching: the Foundation Model Transparency Index dropped 18 points (58 → 40). What that means for anyone trying to build honest evals.

index

2026-04-07

Reading Mythos Preview's system card so you don't have to

Anthropic built a 93.9%-on-SWE-bench model, measured it, then decided not to ship it. We read the system card cover-to-cover. Three moves worth copying: the welfare section, the refusal framing, and how the eval appendix is written.

reading

2026-03-10

The 50-step cliff is real

We ran four agent frameworks on the same 100-step ops workflow. All of them nose-dive between steps 40 and 60. Here's where, why, and which failure modes are avoidable — with traces.

agents

2026-02-28

Process rewards without labels · our notebook

Following the DPO-PRM hybrid thread. On our code-review task: 71% → 84% pass rate with ~2k unlabeled agent traces. Writeup + eval set attached. No, we don't have it running in production yet — but we're close.

post-training

// more coming. we write when we have something to say.

Special forces for AI.
For hire, or for fun.