Random Thoughts

AI and knowledge work

What I haven't learned yet

Friday, May 8, 2026

  • human-written
  • #ai
  • #learning
  • #developer-experience
  • #community
  • #mcp
  • #langgraph
  • #prompt-engineering
  • #openai
  • #anthropic
  • #python
Antique cartography on aged parchment paper with deckled edges, slight foxing stains, and the warm sepia tone of a 17th-century explorer's hand-drawn map. The composition shows several distinct landmasses scattered across the page like islands in an unknown sea, separated by hand-inked ripple-lines suggesting ocean. Several of the landmasses are rendered in confident sepia line work with finely detailed cross-hatching, tiny inset symbols (small mountain peaks, miniature huts, tiny tree clusters, decorative inland flourishes), and clean cross-hatched coastline shading — these are the surveyed regions, fully inked in. Other landmasses are drawn only as light dashed sepia outlines with vague boundaries that fade off into the surrounding sea, their interiors empty of detail. Above the dashed regions, small filigree question-mark glyphs float in delicate sepia ink. From the sea around the dashed regions, two small antique sea-monster silhouettes — a coiled serpent head and a tentacled creature — rise as if in classic 'here be dragons' style. In the lower-right corner of the map, an ornate compass rose sits with eight delicate points; in the upper-left corner, an empty decorative cartouche banner. A few tiny silhouettes of three-masted ships dot the open water. Sepia ink on aged parchment, hand-drawn flourishes throughout, no readable text or letters anywhere in the composition.
The inked regions are where I can write confidently. The dashed regions are where I either copy charts or hand off. The whole point of this post is to name the second territory out loud.

I’ve spent the last few weeks writing posts about how I use AI coding agents. Rules. Skills. Subagents. Workflows. Each post was confident enough to read like I knew what I was doing. Most of the time I do, more or less. But there’s a real list of things I’m still confused about, and I’ve been hesitating to write about them because they don’t fit the tone of “here’s what works for me.”

I’m going to write that list anyway. Not because vulnerability is content — it can be, and that’s a different post — but because I think the list itself is interesting. The shape of what an early-but-not-beginner practitioner is still figuring out probably says something useful about where the field actually is.

If you know any of this better than I do — and several of you almost certainly do — I’d be glad to hear from you. The contact channels for the blog are at the bottom; the comment thread is open under each post. Honest corrections are the most valuable thing I get out of writing publicly.

This post is tagged human-written. No assistant helped me put this list together. I wanted the discomfort to be on the page exactly as I’m carrying it.

MCP servers — I’ve barely scratched the surface

MCP — the Model Context Protocol — is the thing that lets AI agents talk to external systems through a standardized interface. It’s been around for a little while now. People are building servers for everything: filesystems, databases, calendars, search engines, custom internal APIs. The model integrates with those servers and gains a structured way to do things outside its own context.

I’ve used MCP servers — the Slack MCP, the Context7 documentation MCP, a couple of others. They work. They’re useful when they’re already configured.

What I haven’t done is write one. I haven’t sat down and built an MCP server for a custom internal system, exposed the right surface area, thought through authentication and rate limiting and error handling. The reason is honest: I haven’t needed to yet, badly enough, to push through the activation energy. The existing MCPs cover most of what I want.

What I suspect I’m missing is the moment where writing a custom MCP becomes the obviously-right move for a specific problem. I imagine that moment is when an agent needs to operate on an internal system more deeply than a script-based skill can manage. But I don’t have first-hand experience there yet.

The shape of my gap: I can use MCP. I cannot yet design one. The conceptual model — server-side state, tool registration, resource versus tool versus prompt distinctions, how clients negotiate capabilities — is still a little hazy in my head.

If you’ve built an MCP server and have opinions about when it’s the right tool versus when a plain script-skill is enough, please tell me. The current literature is mostly tutorials; what I want is the seasoned-engineer take.

LangGraph — I copied patterns without fully understanding the runtime

This is an embarrassing one. The agent project I work on is built on LangGraph — a Python framework for building stateful, multi-step AI agents using a graph of nodes. I’ve contributed plenty of code to that project. The graph runs. Things go through it.

But if I’m honest, my mental model of how the runtime actually works is fuzzier than it should be. Specifically:

  • Checkpointing. I know LangGraph persists state to a checkpointer (Postgres in our case). I know the format is roughly msgpack. I do not have a deep model of how the checkpointer reconciles concurrent updates, how the channel system interacts with the runtime in edge cases, what the failure modes are when the graph is interrupted mid-step.
  • Streaming and interruption. The graph supports streaming intermediate results and human-in-the-loop interrupts. I’ve used both. I don’t have a clear sense of the exact lifecycle — what gets saved, when, in what order — when an interrupt happens mid-node.
  • Subgraph composition. I’ve written graphs that use subgraphs. I roughly know how state flows in and out. The exact semantics of channel mapping at the boundary, especially when the parent and subgraph have overlapping channel names, I’ve worked around rather than understood.
  • The serialization layer. Custom Pydantic models with non-trivial fields (datetimes, enums, nested models) sometimes don’t round-trip cleanly through the checkpointer. I have a few workarounds. I do not have a unified model of why certain shapes break.

I’ve shipped working code despite these gaps. The pattern is: copy a shape that works elsewhere in the codebase, adapt it to my case, run the tests. When something is weird, ask the agent or a teammate. The teammate, in particular, has a much deeper model of LangGraph than I do, and several of these gaps closed only when they explained the runtime to me.

That’s not a sustainable model long-term. The fact that I can ship without a full mental model means the framework is well-designed; it doesn’t mean I can keep avoiding the depth forever.

The work to fix this is mine. It looks like reading the LangGraph source, building a few toy graphs from scratch (not adapted from existing ones), and probably writing a few notes-to-self that I’d be embarrassed to publish.

Prompt engineering subtleties I keep getting wrong

I’ve written a lot of prompts. Some of them I’m proud of. Some of them I look at six months later and don’t recognize.

The pattern in the ones that don’t age well is usually that I overspecified at the wrong layer. I added constraints that were too brittle, examples that were too specific, instructions that contradicted each other in edge cases I didn’t anticipate. The prompt worked for the case I wrote it for and produced surprising failures when the input drifted.

Specific things I keep getting wrong:

  • Order of instructions. I know it matters. I do not have a reliable model of which order beats which other order, beyond “important things first, examples after instructions.” Different model families seem to react differently to ordering choices, and I haven’t built up the intuition for each.
  • When to use few-shot examples vs. when they hurt. Few-shot examples often help. They also sometimes anchor the model to surface patterns that don’t generalize. I’ve watched the same examples improve the output in one task and degrade it in another. I cannot reliably predict which.
  • Negative instructions. “Don’t do X” is famously fragile. The model often does X anyway, or avoids X by avoiding Y, which is what I actually wanted. I have heuristics (“rephrase as positive instructions when possible”) but no deep model.
  • The temperature/sampling tradeoffs. I usually leave the defaults. I know there are tasks where lower temperature would help and others where higher would help. I do not run my own evaluations to find out which.
  • Reasoning vs. non-reasoning models. The newest reasoning models behave differently from instruction-tuned ones. Prompts that worked great on the older generation sometimes underperform on the new one. I’ve adapted by trial and error rather than by understanding.

The thread connecting these gaps is that I’ve been writing prompts as a craft — by feel, with iteration. I haven’t built up the equivalent of an empirical practice, where I run controlled comparisons and learn from the results. That practice is what would turn the craft into engineering. I haven’t started it.

Evaluation infrastructure I haven’t really invested in

This is the meta-version of the previous gap. Building rigorous evaluation harnesses — small suites of test inputs that score model outputs against expected behaviors — is the mature-engineer way to get past prompt-engineering-by-feel.

I have one or two of these. Mostly small, mostly informal. They catch obvious regressions and miss subtle ones.

What I haven’t built is the infrastructure to run a meaningful evaluation routinely: a tagged dataset of representative inputs, scoring rubrics that produce comparable numbers across runs, a way to see “this prompt change improved the score by X” in a way I can actually trust.

The reason is the same as the MCP gap, mostly: I can ship without it, by accepting some volatility in output quality. The cost of building it is real, and the immediate payback is unclear because most of the prompts I write today aren’t surviving long enough to merit the investment.

I think this is wrong. Eventually I should build the harness. The longer I wait, the more compounding I lose.

Multi-agent coordination patterns

I have one subagent in regular use (the security reviewer) and a handful of other agents that occasionally spawn for parallel work. I have not built anything that would qualify as coordinated multi-agent systems — multiple specialized agents that pass work between each other, negotiate over shared state, or run in long-running collaborative loops.

The literature here is busy. There are papers, frameworks, and arguments. Some of them seem genuinely promising; others seem like solutions in search of problems. I don’t have the experience to tell the difference.

What I’d want to know better:

  • When does the multi-agent coordination overhead pay off, versus a single agent with the right rules and skills?
  • What patterns hold up under real workloads, and which ones look good in demos and fall apart in production?
  • How do coordination failures show up, and how do you debug them?

Until I have answers I can defend, I’m staying in the single-parent-agent-with-occasional-subagents pattern, which has been working. Moving to a more elaborate setup before I understand the tradeoffs would be premature.

Fine-tuning — never seriously tried

I have done zero fine-tuning of language models. I read about it. I see other people doing it. I have never produced a fine-tuned model myself, even for a hobby project.

Why the gap exists: every time I’ve considered fine-tuning, the question “would a better prompt or a smaller model with the right context get me 90% of the way there?” has answered yes. Fine-tuning has stayed below the threshold of “the obvious next move.”

That may continue to be true. It also may stop being true at some point — for tasks where the right prompt is too long, or where the latency budget rules out a large model, or where the use case is repetitive enough that a specialized small model would clearly win.

What I want to know: which signals indicate that fine-tuning is the right move? And conversely, which apparent signals are red herrings that lead people to fine-tune when they shouldn’t?

Antique parchment page on aged paper with deckled edges and slight foxing stains, drawn in warm sepia ink in the style of a 17th-century explorer's logbook. The page is divided into three vertical columns separated by faint hand-inked rule lines. At the top of each column sits a small ornamental flourish — a different sigil per column: a filled circle in the first, a half-moon in the second, an open eight-pointed star in the third. Below each flourish, small rectangular parchment claim-cards are stacked vertically, each card showing a different miniature inked sketch. The first column carries four cards, each with a small fully-inked sketch of a different landmark: a small solid island, a small filled mountain peak, a small filled hut, a small filled tree. The second column carries two cards, each with a partially-drawn sketch — one card shows an island outline that is half complete, the other shows a coastline cross-hatched on only one side; a small ink-blot dot sits beside each, suggesting work-in-progress. The third column carries four cards, each with only a faint dashed silhouette: a dashed island shape, a dashed mountain shape, a dashed sea-monster head, a dashed empty hand reaching outward; each card carries a small filigree question-mark glyph in its corner. At the top of the page, an ornate empty cartouche banner spans the columns. Sepia ink on aged parchment, hand-drawn flourishes, no readable text or letters anywhere in the composition.
Three columns, no due dates. The point isn't to clear the page — it's to be honest about which territory each thing belongs to.

What I’m doing about it

I’m not doing nothing. I’ve been chipping away at these gaps with low-friction reading and small experiments. I’ve started — and abandoned — a couple of toy projects to push specific pieces of LangGraph that I don’t understand.

But I’ve also been deliberate about not rushing. Each of these areas needs sustained investment, and I’d rather pick one and get to real fluency than dabble in all of them and remain shallow across the board.

The first one I’ll genuinely commit to is probably evaluation infrastructure. It’s the one whose payback compounds across every other gap on this list. Better evals make better prompt engineering possible. Better prompt engineering makes better skills and subagents. The whole stack levels up if the bottom layer (objective measurement) gets stronger.

After that, probably the LangGraph internals — because the project I work on uses them daily, and the gap there has the most direct cost on my work.

MCP servers, multi-agent coordination, fine-tuning — those will wait. They’re not adjacent enough to my current work to merit the investment yet.

A small request

I’ll close with the same request I started with. If any of this is something you know well, I want to learn from you.

The community of practitioners working on this stuff seriously is not large. Each of us is sitting on private knowledge that would be valuable to the rest. I’ve tried to share what I know in this series. I’d be glad to be on the receiving end of corrections, in particular for the parts of the list above where I’m clearly fumbling.

The honest version is that nobody, today, has a complete map of the territory. We’re all early. Pretending otherwise is the kind of thing that ages badly. So I’m naming what I haven’t learned yet, on purpose, because I’d rather be on the receiving end of useful corrections than on the receiving end of “you confidently posted that and you were wrong.”

If you’ve read this far and you have something to teach me — about MCP, LangGraph, evaluation, prompts, anything else on the list — please send it. I’ll read it carefully.

Further reading