What I haven't learned yet
I’ve spent the last few weeks writing posts about how I use AI coding agents. Rules. Skills. Subagents. Workflows. Each post was confident enough to read like I knew what I was doing. Most of the time I do, more or less. But there’s a real list of things I’m still confused about, and I’ve been hesitating to write about them because they don’t fit the tone of “here’s what works for me.”
I’m going to write that list anyway. Not because vulnerability is content — it can be, and that’s a different post — but because I think the list itself is interesting. The shape of what an early-but-not-beginner practitioner is still figuring out probably says something useful about where the field actually is.
MCP servers — I’ve barely scratched the surface
MCP — the Model Context Protocol — is the thing that lets AI agents talk to external systems through a standardized interface. It’s been around for a little while now. People are building servers for everything: filesystems, databases, calendars, search engines, custom internal APIs. The model integrates with those servers and gains a structured way to do things outside its own context.
I’ve used MCP servers — the Slack MCP, the Context7 documentation MCP, a couple of others. They work. They’re useful when they’re already configured.
What I haven’t done is write one. I haven’t sat down and built an MCP server for a custom internal system, exposed the right surface area, thought through authentication and rate limiting and error handling. The reason is honest: I haven’t needed to yet, badly enough, to push through the activation energy. The existing MCPs cover most of what I want.
What I suspect I’m missing is the moment where writing a custom MCP becomes the obviously-right move for a specific problem. I imagine that moment is when an agent needs to operate on an internal system more deeply than a script-based skill can manage. But I don’t have first-hand experience there yet.
The shape of my gap: I can use MCP. I cannot yet design one. The conceptual model — server-side state, tool registration, resource versus tool versus prompt distinctions, how clients negotiate capabilities — is still a little hazy in my head.
If you’ve built an MCP server and have opinions about when it’s the right tool versus when a plain script-skill is enough, please tell me. The current literature is mostly tutorials; what I want is the seasoned-engineer take.
LangGraph — I copied patterns without fully understanding the runtime
This is an embarrassing one. The agent project I work on is built on LangGraph — a Python framework for building stateful, multi-step AI agents using a graph of nodes. I’ve contributed plenty of code to that project. The graph runs. Things go through it.
But if I’m honest, my mental model of how the runtime actually works is fuzzier than it should be. I know the broad pieces — checkpointing, streaming, interruption, subgraph composition, serialization — but I don’t have the deep model I want for their edge cases. What gets saved, when, in what order? How do overlapping channels behave at subgraph boundaries? Why do some Pydantic shapes round-trip cleanly through the checkpointer and others don’t?
I’ve shipped working code despite these gaps. The pattern is: copy a shape that works elsewhere in the codebase, adapt it to my case, run the tests. When something is weird, ask the agent or a teammate. The teammate, in particular, has a much deeper model of LangGraph than I do, and several of these gaps closed only when they explained the runtime to me.
That’s not a sustainable model long-term. The fact that I can ship without a full mental model means the framework is well-designed; it doesn’t mean I can keep avoiding the depth forever.
The work to fix this is mine. It looks like reading the LangGraph source, building a few toy graphs from scratch (not adapted from existing ones), and probably writing a few notes-to-self that I’d be embarrassed to publish.
Prompt engineering subtleties I keep getting wrong
I’ve written a lot of prompts. Some of them I’m proud of. Some of them I look at six months later and don’t recognize.
The pattern in the ones that don’t age well is usually that I overspecified at the wrong layer. I added constraints that were too brittle, examples that were too specific, instructions that contradicted each other in edge cases I didn’t anticipate. The prompt worked for the case I wrote it for and produced surprising failures when the input drifted.
Specific things I keep getting wrong:
- Order of instructions. I know it matters. I do not have a reliable model of which order beats which other order, beyond “important things first, examples after instructions.” Different model families seem to react differently to ordering choices, and I haven’t built up the intuition for each.
- When to use few-shot examples vs. when they hurt. Few-shot examples often help. They also sometimes anchor the model to surface patterns that don’t generalize. I’ve watched the same examples improve the output in one task and degrade it in another. I cannot reliably predict which.
- Negative instructions. “Don’t do X” is famously fragile. The model often does X anyway, or avoids X by avoiding Y, which is what I actually wanted. I have heuristics (“rephrase as positive instructions when possible”) but no deep model.
- The temperature/sampling tradeoffs. I usually leave the defaults. I know there are tasks where lower temperature would help and others where higher would help. I do not run my own evaluations to find out which.
- Reasoning vs. non-reasoning models. The newest reasoning models behave differently from instruction-tuned ones. Prompts that worked great on the older generation sometimes underperform on the new one. I’ve adapted by trial and error rather than by understanding.
The thread connecting these gaps is that I’ve been writing prompts as a craft — by feel, with iteration. I haven’t built up the equivalent of an empirical practice, where I run controlled comparisons and learn from the results. That practice is what would turn the craft into engineering. I haven’t started it.
Evaluation infrastructure I haven’t really invested in
This is the meta-version of the previous gap. Building rigorous evaluation harnesses — small suites of test inputs that score model outputs against expected behaviors — is the mature-engineer way to get past prompt-engineering-by-feel.
I have one or two of these. Mostly small, mostly informal. They catch obvious regressions and miss subtle ones.
What I haven’t built is the infrastructure to run a meaningful evaluation routinely: a tagged dataset of representative inputs, scoring rubrics that produce comparable numbers across runs, a way to see “this prompt change improved the score by X” in a way I can actually trust.
The reason is the same as the MCP gap, mostly: I can ship without it, by accepting some volatility in output quality. The cost of building it is real, and the immediate payback is unclear because most of the prompts I write today aren’t surviving long enough to merit the investment.
I think this is wrong. Eventually I should build the harness. The longer I wait, the more compounding I lose.
Multi-agent coordination patterns
I have one subagent in regular use (the security reviewer) and a handful of other agents that occasionally spawn for parallel work. I have not built anything that would qualify as coordinated multi-agent systems — multiple specialized agents that pass work between each other, negotiate over shared state, or run in long-running collaborative loops.
The literature here is busy. There are papers, frameworks, and arguments. Some of them seem genuinely promising; others seem like solutions in search of problems. I don’t have the experience to tell the difference.
What I’d want to know better:
- When does the multi-agent coordination overhead pay off, versus a single agent with the right rules and skills?
- What patterns hold up under real workloads, and which ones look good in demos and fall apart in production?
- How do coordination failures show up, and how do you debug them?
Until I have answers I can defend, I’m staying in the single-parent-agent-with-occasional-subagents pattern, which has been working. Moving to a more elaborate setup before I understand the tradeoffs would be premature.
Fine-tuning — never seriously tried
I have done zero fine-tuning of language models. I read about it. I see other people doing it. I have never produced a fine-tuned model myself, even for a hobby project.
Why the gap exists: every time I’ve considered fine-tuning, the question “would a better prompt or a smaller model with the right context get me 90% of the way there?” has answered yes. Fine-tuning has stayed below the threshold of “the obvious next move.”
That may continue to be true. It also may stop being true at some point — for tasks where the right prompt is too long, or where the latency budget rules out a large model, or where the use case is repetitive enough that a specialized small model would clearly win.
What I want to know: which signals indicate that fine-tuning is the right move? And conversely, which apparent signals are red herrings that lead people to fine-tune when they shouldn’t?
What I’m doing about it
I’m not doing nothing, but I’m also trying not to dabble shallowly in all of it. The first area I should genuinely commit to is evaluation infrastructure, because better evals make better prompt engineering possible, and better prompt engineering makes better skills and subagents. After that, probably LangGraph internals, because the project I work on uses them daily.
MCP servers, multi-agent coordination, and fine-tuning can wait until they’re closer to my current work.
A small request
If any of this is something you know well, I want to learn from you. Nobody has a complete map of this territory yet, and pretending otherwise ages badly.
If you have something to teach me — about MCP, LangGraph, evaluation, prompts, anything else on the list — please send it. I’ll read it carefully.