Random Thoughts

AI and knowledge work

What I haven't learned yet

Friday, May 8, 2026

  • human-written
  • #ai
  • #learning
  • #developer-experience
  • #community
  • #mcp
  • #langgraph
  • #prompt-engineering
  • #openai
  • #anthropic
  • #python

I’ve spent the last few weeks writing posts about how I use AI coding agents. Rules. Skills. Subagents. Workflows. Each post was confident enough to read like I knew what I was doing. Most of the time I do, more or less. But there’s a real list of things I’m still confused about, and I’ve been hesitating to write about them because they don’t fit the tone of “here’s what works for me.”

I’m going to write that list anyway. Not because vulnerability is content — it can be, and that’s a different post — but because I think the list itself is interesting. The shape of what an early-but-not-beginner practitioner is still figuring out probably says something useful about where the field actually is.

Antique cartography on aged parchment paper with deckled edges, slight foxing stains, and the warm sepia tone of a 17th-century explorer's hand-drawn map. The composition shows several distinct landmasses scattered across the page like islands in an unknown sea, separated by hand-inked ripple-lines suggesting ocean. Several of the landmasses are rendered in confident sepia line work with finely detailed cross-hatching, tiny inset symbols (small mountain peaks, miniature huts, tiny tree clusters, decorative inland flourishes), and clean cross-hatched coastline shading — these are the surveyed regions, fully inked in. Other landmasses are drawn only as light dashed sepia outlines with vague boundaries that fade off into the surrounding sea, their interiors empty of detail. Above the dashed regions, small filigree question-mark glyphs float in delicate sepia ink. From the sea around the dashed regions, two small antique sea-monster silhouettes — a coiled serpent head and a tentacled creature — rise as if in classic 'here be dragons' style. In the lower-right corner of the map, an ornate compass rose sits with eight delicate points; in the upper-left corner, an empty decorative cartouche banner. A few tiny silhouettes of three-masted ships dot the open water. Sepia ink on aged parchment, hand-drawn flourishes throughout, no readable text or letters anywhere in the composition.
The inked regions are where I can write confidently. The dashed regions are where I either copy charts or hand off. The whole point of this post is to name the second territory out loud.

MCP servers — I’ve barely scratched the surface

MCP — the Model Context Protocol — is the thing that lets AI agents talk to external systems through a standardized interface. It’s been around for a little while now. People are building servers for everything: filesystems, databases, calendars, search engines, custom internal APIs. The model integrates with those servers and gains a structured way to do things outside its own context.

I’ve used MCP servers — the Slack MCP, the Context7 documentation MCP, a couple of others. They work. They’re useful when they’re already configured.

What I haven’t done is write one. I haven’t sat down and built an MCP server for a custom internal system, exposed the right surface area, thought through authentication and rate limiting and error handling. The reason is honest: I haven’t needed to yet, badly enough, to push through the activation energy. The existing MCPs cover most of what I want.

What I suspect I’m missing is the moment where writing a custom MCP becomes the obviously-right move for a specific problem. I imagine that moment is when an agent needs to operate on an internal system more deeply than a script-based skill can manage. But I don’t have first-hand experience there yet.

The shape of my gap: I can use MCP. I cannot yet design one. The conceptual model — server-side state, tool registration, resource versus tool versus prompt distinctions, how clients negotiate capabilities — is still a little hazy in my head.

If you’ve built an MCP server and have opinions about when it’s the right tool versus when a plain script-skill is enough, please tell me. The current literature is mostly tutorials; what I want is the seasoned-engineer take.

LangGraph — I copied patterns without fully understanding the runtime

This is an embarrassing one. The agent project I work on is built on LangGraph — a Python framework for building stateful, multi-step AI agents using a graph of nodes. I’ve contributed plenty of code to that project. The graph runs. Things go through it.

But if I’m honest, my mental model of how the runtime actually works is fuzzier than it should be. I know the broad pieces — checkpointing, streaming, interruption, subgraph composition, serialization — but I don’t have the deep model I want for their edge cases. What gets saved, when, in what order? How do overlapping channels behave at subgraph boundaries? Why do some Pydantic shapes round-trip cleanly through the checkpointer and others don’t?

I’ve shipped working code despite these gaps. The pattern is: copy a shape that works elsewhere in the codebase, adapt it to my case, run the tests. When something is weird, ask the agent or a teammate. The teammate, in particular, has a much deeper model of LangGraph than I do, and several of these gaps closed only when they explained the runtime to me.

That’s not a sustainable model long-term. The fact that I can ship without a full mental model means the framework is well-designed; it doesn’t mean I can keep avoiding the depth forever.

The work to fix this is mine. It looks like reading the LangGraph source, building a few toy graphs from scratch (not adapted from existing ones), and probably writing a few notes-to-self that I’d be embarrassed to publish.

Prompt engineering subtleties I keep getting wrong

I’ve written a lot of prompts. Some of them I’m proud of. Some of them I look at six months later and don’t recognize.

The pattern in the ones that don’t age well is usually that I overspecified at the wrong layer. I added constraints that were too brittle, examples that were too specific, instructions that contradicted each other in edge cases I didn’t anticipate. The prompt worked for the case I wrote it for and produced surprising failures when the input drifted.

Specific things I keep getting wrong:

  • Order of instructions. I know it matters. I do not have a reliable model of which order beats which other order, beyond “important things first, examples after instructions.” Different model families seem to react differently to ordering choices, and I haven’t built up the intuition for each.
  • When to use few-shot examples vs. when they hurt. Few-shot examples often help. They also sometimes anchor the model to surface patterns that don’t generalize. I’ve watched the same examples improve the output in one task and degrade it in another. I cannot reliably predict which.
  • Negative instructions. “Don’t do X” is famously fragile. The model often does X anyway, or avoids X by avoiding Y, which is what I actually wanted. I have heuristics (“rephrase as positive instructions when possible”) but no deep model.
  • The temperature/sampling tradeoffs. I usually leave the defaults. I know there are tasks where lower temperature would help and others where higher would help. I do not run my own evaluations to find out which.
  • Reasoning vs. non-reasoning models. The newest reasoning models behave differently from instruction-tuned ones. Prompts that worked great on the older generation sometimes underperform on the new one. I’ve adapted by trial and error rather than by understanding.

The thread connecting these gaps is that I’ve been writing prompts as a craft — by feel, with iteration. I haven’t built up the equivalent of an empirical practice, where I run controlled comparisons and learn from the results. That practice is what would turn the craft into engineering. I haven’t started it.

Evaluation infrastructure I haven’t really invested in

This is the meta-version of the previous gap. Building rigorous evaluation harnesses — small suites of test inputs that score model outputs against expected behaviors — is the mature-engineer way to get past prompt-engineering-by-feel.

I have one or two of these. Mostly small, mostly informal. They catch obvious regressions and miss subtle ones.

What I haven’t built is the infrastructure to run a meaningful evaluation routinely: a tagged dataset of representative inputs, scoring rubrics that produce comparable numbers across runs, a way to see “this prompt change improved the score by X” in a way I can actually trust.

The reason is the same as the MCP gap, mostly: I can ship without it, by accepting some volatility in output quality. The cost of building it is real, and the immediate payback is unclear because most of the prompts I write today aren’t surviving long enough to merit the investment.

I think this is wrong. Eventually I should build the harness. The longer I wait, the more compounding I lose.

Multi-agent coordination patterns

I have one subagent in regular use (the security reviewer) and a handful of other agents that occasionally spawn for parallel work. I have not built anything that would qualify as coordinated multi-agent systems — multiple specialized agents that pass work between each other, negotiate over shared state, or run in long-running collaborative loops.

The literature here is busy. There are papers, frameworks, and arguments. Some of them seem genuinely promising; others seem like solutions in search of problems. I don’t have the experience to tell the difference.

What I’d want to know better:

  • When does the multi-agent coordination overhead pay off, versus a single agent with the right rules and skills?
  • What patterns hold up under real workloads, and which ones look good in demos and fall apart in production?
  • How do coordination failures show up, and how do you debug them?

Until I have answers I can defend, I’m staying in the single-parent-agent-with-occasional-subagents pattern, which has been working. Moving to a more elaborate setup before I understand the tradeoffs would be premature.

Fine-tuning — never seriously tried

I have done zero fine-tuning of language models. I read about it. I see other people doing it. I have never produced a fine-tuned model myself, even for a hobby project.

Why the gap exists: every time I’ve considered fine-tuning, the question “would a better prompt or a smaller model with the right context get me 90% of the way there?” has answered yes. Fine-tuning has stayed below the threshold of “the obvious next move.”

That may continue to be true. It also may stop being true at some point — for tasks where the right prompt is too long, or where the latency budget rules out a large model, or where the use case is repetitive enough that a specialized small model would clearly win.

What I want to know: which signals indicate that fine-tuning is the right move? And conversely, which apparent signals are red herrings that lead people to fine-tune when they shouldn’t?

Antique parchment page on aged paper with deckled edges and slight foxing stains, drawn in warm sepia ink in the style of a 17th-century explorer's logbook. The page is divided into three vertical columns separated by faint hand-inked rule lines. At the top of each column sits a small ornamental flourish — a different sigil per column: a filled circle in the first, a half-moon in the second, an open eight-pointed star in the third. Below each flourish, small rectangular parchment claim-cards are stacked vertically, each card showing a different miniature inked sketch. The first column carries four cards, each with a small fully-inked sketch of a different landmark: a small solid island, a small filled mountain peak, a small filled hut, a small filled tree. The second column carries two cards, each with a partially-drawn sketch — one card shows an island outline that is half complete, the other shows a coastline cross-hatched on only one side; a small ink-blot dot sits beside each, suggesting work-in-progress. The third column carries four cards, each with only a faint dashed silhouette: a dashed island shape, a dashed mountain shape, a dashed sea-monster head, a dashed empty hand reaching outward; each card carries a small filigree question-mark glyph in its corner. At the top of the page, an ornate empty cartouche banner spans the columns. Sepia ink on aged parchment, hand-drawn flourishes, no readable text or letters anywhere in the composition.
Three columns, no due dates. The point isn't to clear the page — it's to be honest about which territory each thing belongs to.

What I’m doing about it

I’m not doing nothing, but I’m also trying not to dabble shallowly in all of it. The first area I should genuinely commit to is evaluation infrastructure, because better evals make better prompt engineering possible, and better prompt engineering makes better skills and subagents. After that, probably LangGraph internals, because the project I work on uses them daily.

MCP servers, multi-agent coordination, and fine-tuning can wait until they’re closer to my current work.

A small request

If any of this is something you know well, I want to learn from you. Nobody has a complete map of this territory yet, and pretending otherwise ages badly.

If you have something to teach me — about MCP, LangGraph, evaluation, prompts, anything else on the list — please send it. I’ll read it carefully.