Random Thoughts

Developer tooling

Behavioral guardrails for AI coding assistants

Thursday, April 30, 2026

  • ai-assisted
  • #ai
  • #ai-agents
  • #vibecoding
  • #developer-experience
  • #best-practices
  • #karpathy
  • #prompt-engineering
  • #code-quality
  • #cursor
  • #claude
Classical Greco-Roman architectural rendering in confident pen-and-ink line work with dramatic cross-hatched chiaroscuro shading, on cream parchment paper. Four tall fluted Doric columns are arranged in a row across the foreground, each rendered with the precise vertical fluting, sharp echinus capitals, and stepped stylobate base of classical Greek temple architecture. Each column carries a small carved relief medallion at the height where its capital meets the shaft, and the four medallions are visibly different from one another: the first column's medallion shows a small profile of a thoughtful head with a slightly furrowed brow, the second shows a small stylized branch with several leaves clipped close as if pruned, the third shows a small finely-detailed scalpel-shaped tool, and the fourth shows a small laurel wreath surrounding a circular target. Behind and between the columns, a stone road paved in receding rectangular flagstones extends toward a distant horizon line where soft atmospheric haze blurs the perspective. The flagstone lines converge cleanly to a single vanishing point in the upper middle of the composition. Strong raking light enters from the upper left, casting precise classical shadows behind each column. Confident pen-and-ink cross-hatching for the shaded sides of the columns and the underside of the architrave detail; cream parchment background; no readable text or letters anywhere in the composition.
Four behaviors, four columns. The road runs between them. The agent walks through.

I have an always-on rule file at the top of every project I work in. It’s about a hundred lines long. It doesn’t describe my tech stack, my coding style, or my project conventions. It describes four behaviors I want the AI coding agent to follow when it’s writing or reviewing code.

The rule is loaded into every conversation. Every model I use reads it before doing anything. The behaviors are deliberately tool-agnostic — they don’t depend on which AI agent I’m running today, or which language model is behind it. They’re about the same kinds of mistakes any LLM-assisted coding workflow makes when you don’t push back.

Each behavior came from watching the agent get something wrong, more than once, in a way that felt structural rather than incidental. This post is about those four behaviors, why they’re each worth their own section in the rule file, and the work each one does in practice.

Where the four came from

The four are loosely adapted from observations Andrej Karpathy has shared about the kinds of pitfalls LLMs fall into when writing code. The wording is mine, the framing is mine, but the underlying observation is his: there are recurring failure modes that are predictable once you know what to look for, and worth naming explicitly so the agent (and you) can avoid them.

I’ll walk through each one with the wording I actually use, the failure mode it addresses, and the kind of work it does on real tasks.

1. Think Before Coding

Don’t assume. Don’t hide confusion. Surface tradeoffs.

The failure mode this addresses is the agent’s bias toward action. Ask a vague question and the agent starts producing diffs. By the time you realize the question had multiple valid answers, you’ve got a half-implemented version of one of them.

The rule’s body, in the file:

Before implementing:
- State your assumptions explicitly. If uncertain, ask.
- If multiple interpretations exist, present them — don't pick silently.
- If a simpler approach exists, say so. Push back when warranted.
- If something is unclear, stop. Name what's confusing. Ask.

Four lines that change the behavior in a noticeable way.

The first line — state your assumptions explicitly — is the one I underweighted at first and now consider essential. The agent always has assumptions. Without the rule, those assumptions stay implicit and you only discover them when the diff doesn’t do what you expected. With the rule, the agent says them out loud. “I’m assuming the input is already validated upstream. Is that correct?” Five seconds to confirm or correct, instead of fifteen minutes of debugging later.

The third line — push back when warranted — was harder to make stick. The default mode of an AI coding agent is helpful agreement. Even when the request is dumb, the agent wants to do it. The rule explicitly tells it to push back. The result, after a few sessions of getting used to it, is that the agent will sometimes say “the simpler approach here is X; do you want that instead?” and I’ll realize my original framing was overcomplicated. That’s the agent doing its job.

The fourth line is the one I add for myself as much as for the model. If something is unclear, stop. Name what’s confusing. Once the agent stops trying to fake fluency, the conversation gets faster, not slower. Confusion that gets named gets resolved. Confusion that gets hidden becomes bugs.

2. Simplicity First

Minimum code that solves the problem. Nothing speculative.

This addresses the most common failure I’ve seen, by a wide margin: the agent writes more code than the problem needs. Abstractions for single-use functions. Configuration options for things that will never be configured. Error handling for cases that can’t happen. A pattern of “what if we needed to extend this later?” applied to code that will never be extended.

The rule’s body:

- No features beyond what was asked.
- No abstractions for single-use code.
- No "flexibility" or "configurability" that wasn't requested.
- No error handling for impossible scenarios.
- If you write 200 lines and it could be 50, rewrite it.

Ask yourself: "Would a senior engineer say this is overcomplicated?"
If yes, simplify.

The “would a senior engineer say this is overcomplicated?” check is the one that does the most work. It’s a single mental test the agent can apply to its own output before committing it. The model is good at running that test on itself once it’s been told to. Without the prompt, it doesn’t run it at all.

I’ll write a whole post tomorrow about what this looks like in practice — the specific shapes of overengineering, the before/after diffs, the moments where I had to push back on a 200-line solution that wanted to be 50. The short version: this rule, on its own, changed more about the quality of generated code than any other single change I made.

3. Surgical Changes

Touch only what you must. Clean up only your own mess.

This addresses an opposite-flavored failure: when asked to change one thing, the agent fixes seven other things it didn’t need to. Reformatting that adjacent function. Updating that comment that was technically slightly out of date. Renaming that variable while it was in there.

Each individual touch is well-intentioned. The aggregate is a diff that’s hard to review, hard to revert, and hard to explain in a commit message.

The rule:

When editing existing code:
- Don't "improve" adjacent code, comments, or formatting.
- Don't refactor things that aren't broken.
- Match existing style, even if you'd do it differently.
- If you notice unrelated dead code, mention it — don't delete it.

When your changes create orphans:
- Remove imports/variables/functions that YOUR changes
  made unused.
- Don't remove pre-existing dead code unless asked.

The test: every changed line should trace directly to the
user's request.

That last sentence is the test. Every changed line should trace directly to the user’s request. You can apply it line by line to a diff. If you can’t trace a line back to the request, it shouldn’t be in the diff.

The most common pushback against this rule is “but the adjacent code really was wrong.” It probably was. The fix isn’t to silently include it in this diff. The fix is to mention it (“I noticed helper_x is unused; happy to remove it in a separate change if you want”) and move on. That separation makes review possible and revert safe. It’s the engineering practice the rule is trying to encode.

There’s a second clause that matters: clean up only your own mess. If a change you made leaves an import unused, remove it. That’s yours. If an import was already unused before your change, that’s pre-existing — leave it. The principle keeps the diff scope-tight in both directions.

Classical Greco-Roman architectural rendering diptych in confident pen-and-ink line work on cream parchment paper, separated by a thin vertical pen line down the middle of the composition. The left panel shows a wide horizontal classical entablature — architrave, frieze, and cornice — cut from heavy stone and rendered in detailed cross-hatched chiaroscuro shading. The frieze is over-decorated with overlapping carved relief panels, ornamental rosettes, additional small medallions stacked into the corners, and several extraneous flourishes that crowd the underlying beam line so the structure gets visually lost in ornament. Tiny chisel-marks in the margin suggest the carvings were added on as afterthoughts. The right panel shows the same width and proportion of classical entablature, but trimmed back to its essential structure: three evenly spaced restrained triglyph relief panels that align with the structural rhythm of the beam, no extra medallions, no added flourishes — just the structural geometry, with confident cross-hatching for the shadow under the cornice. Both panels share the same classical drawing style: confident ink line work, dramatic raking light from the upper left, classical proportion, cream parchment background. No readable text or letters anywhere in the composition.
Same architrave, half the carving. Every chiseled mark should trace directly to the structure.

4. Goal-Driven Execution

Define success criteria. Loop until verified.

The fourth behavior addresses what happens at the end of a task. Without this rule, the agent finishes when it runs out of things to do. With this rule, the agent finishes when the task is verified done.

The rule:

Transform tasks into verifiable goals:
- "Add validation" → "Write tests for invalid inputs,
  then make them pass"
- "Fix the bug" → "Write a test that reproduces it,
  then make it pass"
- "Refactor X" → "Ensure tests pass before and after"

For multi-step tasks, state a brief plan:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]

Strong success criteria let you loop independently.
Weak criteria ("make it work") require constant clarification.

The reframing in the first block is the move. “Add validation” by itself is open-ended; the agent will produce something validation-shaped and stop. “Write tests for invalid inputs, then make them pass” gives the agent both the implementation goal and the verification criterion. The agent now has an objective measure of done.

For multi-step tasks, the plan-with-verification format is the structural fix. Each step has its own check. The agent doesn’t move to step two until step one passes. This sounds obvious in retrospect; without the rule it doesn’t happen, because the agent will happily stack four uncertain steps on top of each other and call the result “done.”

The last sentence is the diagnostic. If you have to keep clarifying “is it actually working?” after the agent claims done, your success criteria were weak. The rule makes you (and the agent) commit to stronger criteria up front, so the verification loop runs without you babysitting it.

Why these four and not others

I tried longer lists. They didn’t survive contact with reality.

A list of fifteen behaviors is a list nobody reads. The agent loads it, sure, but a long list dilutes any individual instruction. The four-item version is short enough that the agent treats each one as binding rather than as a suggestion in a sea of suggestions.

I also tried adding more specific behaviors — “always write tests,” “prefer functional over imperative style,” “use type hints.” Each of those is sometimes right, often wrong, and depends on the project. Putting them in the always-on rule made the rule less universal and less useful. They belong in project-specific rules, where the context is appropriate.

The four behaviors that stayed are the ones that apply regardless of stack, project, or task. Think before you act. Don’t overengineer. Touch what you must, no more. Define done. Those four hold across every codebase I’ve ever worked in.

A short closing tradeoff

These guidelines bias the agent toward caution over speed. For trivial tasks — renaming a variable, fixing a typo — they’re overhead. The rule file says so explicitly, with one line at the top: “For trivial tasks, use judgment.”

That escape hatch matters. A rule that turns a one-line fix into a multi-step planning exercise is making the system slower, not better. The point of the rule is to install good defaults for the non-trivial case, where the agent’s tendency to act prematurely or overengineer would otherwise produce real damage. On trivial tasks, the agent recognizes the lower stakes and skips the ceremony.

How I’d know the rule is working

The rule file ends with a short success criterion of its own:

These guidelines are working if: fewer unnecessary changes
in diffs, fewer rewrites due to overcomplication, and
clarifying questions come before implementation rather than
after mistakes.

That’s the test. If diffs are tighter, if I’m rewriting less, and if the agent is asking the smart question at the start of a task instead of after — the rule is doing its job. If those signals aren’t moving, the rule isn’t earning its place in the always-on context.

So far, on the projects where I’ve used it, the signals have moved in the right direction. Not all the way. The agent still occasionally produces a 200-line solution that wants to be 50, and I still catch myself accepting too-clever changes that violate the surgical-changes principle. But the floor is meaningfully higher than it was before, which is what guardrails are for.

Further reading