Behavioral guardrails for AI coding assistants

Thursday, April 30, 2026

ai-assisted
#ai
#ai-agents
#vibecoding
#developer-experience
#best-practices
#karpathy
#prompt-engineering
#code-quality
#cursor
#claude

I have an always-on rule file at the top of every project I work in. It doesn’t describe my tech stack, coding style, or project conventions. It describes four tool-agnostic behaviors I want the AI coding agent to follow when it’s writing or reviewing code.

Each behavior came from watching the agent get something wrong more than once. The four are loosely adapted from observations Andrej Karpathy has shared about LLM coding pitfalls; the wording and framing here are mine.

Four behaviors, four columns. The road runs between them. The agent walks through.

1. Think Before Coding

Don’t assume. Don’t hide confusion. Surface tradeoffs.

The failure mode this addresses is the agent’s bias toward action. Ask a vague question and the agent starts producing diffs. By the time you realize the question had multiple valid answers, you’ve got a half-implemented version of one of them.

The rule’s body, in the file:

Before implementing:
- State your assumptions explicitly. If uncertain, ask.
- If multiple interpretations exist, present them — don't pick silently.
- If a simpler approach exists, say so. Push back when warranted.
- If something is unclear, stop. Name what's confusing. Ask.

Four lines that change the behavior in a noticeable way.

The first line — state your assumptions explicitly — is the one I underweighted at first and now consider essential. The agent always has assumptions. Without the rule, those assumptions stay implicit and you only discover them when the diff doesn’t do what you expected. With the rule, the agent says them out loud. “I’m assuming the input is already validated upstream. Is that correct?” Five seconds to confirm or correct, instead of fifteen minutes of debugging later.

The third line — push back when warranted — was harder to make stick. The default mode of an AI coding agent is helpful agreement. Even when the request is dumb, the agent wants to do it. The rule explicitly tells it to push back. The result, after a few sessions of getting used to it, is that the agent will sometimes say “the simpler approach here is X; do you want that instead?” and I’ll realize my original framing was overcomplicated. That’s the agent doing its job.

The fourth line is the one I add for myself as much as for the model. If something is unclear, stop. Name what’s confusing. Once the agent stops trying to fake fluency, the conversation gets faster, not slower. Confusion that gets named gets resolved. Confusion that gets hidden becomes bugs.

2. Simplicity First

Minimum code that solves the problem. Nothing speculative.

This addresses the most common failure I’ve seen, by a wide margin: the agent writes more code than the problem needs. Abstractions for single-use functions. Configuration options for things that will never be configured. Error handling for cases that can’t happen. A pattern of “what if we needed to extend this later?” applied to code that will never be extended.

The rule’s body:

- No features beyond what was asked.
- No abstractions for single-use code.
- No "flexibility" or "configurability" that wasn't requested.
- No error handling for impossible scenarios.
- If you write 200 lines and it could be 50, rewrite it.

Ask yourself: "Would a senior engineer say this is overcomplicated?"
If yes, simplify.

The “would a senior engineer say this is overcomplicated?” check is the one that does the most work. It’s a single mental test the agent can apply to its own output before committing it. The model is good at running that test on itself once it’s been told to. Without the prompt, it doesn’t run it at all.

The short version: this rule, on its own, changed more about the quality of generated code than any other single change I made.

3. Surgical Changes

Touch only what you must. Clean up only your own mess.

This addresses an opposite-flavored failure: when asked to change one thing, the agent fixes seven other things it didn’t need to. Reformatting that adjacent function. Updating that comment that was technically slightly out of date. Renaming that variable while it was in there.

Each individual touch is well-intentioned. The aggregate is a diff that’s hard to review, hard to revert, and hard to explain in a commit message.

The rule:

When editing existing code:
- Don't "improve" adjacent code, comments, or formatting.
- Don't refactor things that aren't broken.
- Match existing style, even if you'd do it differently.
- If you notice unrelated dead code, mention it — don't delete it.

When your changes create orphans:
- Remove imports/variables/functions that YOUR changes
  made unused.
- Don't remove pre-existing dead code unless asked.

The test: every changed line should trace directly to the
user's request.

That last sentence is the test. Every changed line should trace directly to the user’s request. You can apply it line by line to a diff. If you can’t trace a line back to the request, it shouldn’t be in the diff.

The most common pushback against this rule is “but the adjacent code really was wrong.” It probably was. The fix isn’t to silently include it in this diff. The fix is to mention it (“I noticed helper_x is unused; happy to remove it in a separate change if you want”) and move on. That separation makes review possible and revert safe. It’s the engineering practice the rule is trying to encode.

There’s a second clause that matters: clean up only your own mess. If a change you made leaves an import unused, remove it. That’s yours. If an import was already unused before your change, that’s pre-existing — leave it. The principle keeps the diff scope-tight in both directions.

Classical Greco-Roman architectural rendering diptych in confident pen-and-ink line work on cream parchment paper, separated by a thin vertical pen line down the middle of the composition. The left panel shows a wide horizontal classical entablature — architrave, frieze, and cornice — cut from heavy stone and rendered in detailed cross-hatched chiaroscuro shading. The frieze is over-decorated with overlapping carved relief panels, ornamental rosettes, additional small medallions stacked into the corners, and several extraneous flourishes that crowd the underlying beam line so the structure gets visually lost in ornament. Tiny chisel-marks in the margin suggest the carvings were added on as afterthoughts. The right panel shows the same width and proportion of classical entablature, but trimmed back to its essential structure: three evenly spaced restrained triglyph relief panels that align with the structural rhythm of the beam, no extra medallions, no added flourishes — just the structural geometry, with confident cross-hatching for the shadow under the cornice. Both panels share the same classical drawing style: confident ink line work, dramatic raking light from the upper left, classical proportion, cream parchment background. No readable text or letters anywhere in the composition. — Same architrave, half the carving. Every chiseled mark should trace directly to the structure.

4. Goal-Driven Execution

Define success criteria. Loop until verified.

The fourth behavior addresses what happens at the end of a task. Without this rule, the agent finishes when it runs out of things to do. With this rule, the agent finishes when the task is verified done.

The rule:

Transform tasks into verifiable goals:
- "Add validation" → "Write tests for invalid inputs,
  then make them pass"
- "Fix the bug" → "Write a test that reproduces it,
  then make it pass"
- "Refactor X" → "Ensure tests pass before and after"

For multi-step tasks, state a brief plan:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]

Strong success criteria let you loop independently.
Weak criteria ("make it work") require constant clarification.

The reframing in the first block is the move. “Add validation” by itself is open-ended; the agent will produce something validation-shaped and stop. “Write tests for invalid inputs, then make them pass” gives the agent both the implementation goal and the verification criterion. The agent now has an objective measure of done.

For multi-step tasks, the plan-with-verification format is the structural fix. Each step has its own check. The agent doesn’t move to step two until step one passes. This sounds obvious in retrospect; without the rule it doesn’t happen, because the agent will happily stack four uncertain steps on top of each other and call the result “done.”

The last sentence is the diagnostic. If you have to keep clarifying “is it actually working?” after the agent claims done, your success criteria were weak. The rule makes you (and the agent) commit to stronger criteria up front, so the verification loop runs without you babysitting it.

Why these four and not others

I tried longer lists. They didn’t survive contact with reality.

A list of fifteen behaviors is a list nobody reads. The agent loads it, sure, but a long list dilutes any individual instruction. The four-item version is short enough that the agent treats each one as binding rather than as a suggestion in a sea of suggestions.

I also tried adding more specific behaviors — “always write tests,” “prefer functional over imperative style,” “use type hints.” Each of those is sometimes right, often wrong, and depends on the project. Putting them in the always-on rule made the rule less universal and less useful. They belong in project-specific rules, where the context is appropriate.

The four behaviors that stayed are the ones that apply regardless of stack, project, or task. Think before you act. Don’t overengineer. Touch what you must, no more. Define done. Those four hold across every codebase I’ve ever worked in.

How I’d know the rule is working

These guidelines bias the agent toward caution over speed. For trivial tasks — renaming a variable, fixing a typo — they’re overhead, so the rule file has an escape hatch: “For trivial tasks, use judgment.” For non-trivial work, the success criterion is simple.

The rule file ends with a short success criterion of its own:

These guidelines are working if: fewer unnecessary changes
in diffs, fewer rewrites due to overcomplication, and
clarifying questions come before implementation rather than
after mistakes.

That’s the test. If diffs are tighter, if I’m rewriting less, and if the agent is asking the smart question at the start of a task instead of after — the rule is doing its job. If those signals aren’t moving, the rule isn’t earning its place in the always-on context.

So far, on the projects where I’ve used it, the signals have moved in the right direction. Not all the way. The agent still occasionally produces a 200-line solution that wants to be 50, and I still catch myself accepting too-clever changes that violate the surgical-changes principle. But the floor is meaningfully higher than it was before, which is what guardrails are for.