Behavioral guardrails for AI coding assistants
I have an always-on rule file at the top of every project I work in. It’s about a hundred lines long. It doesn’t describe my tech stack, my coding style, or my project conventions. It describes four behaviors I want the AI coding agent to follow when it’s writing or reviewing code.
The rule is loaded into every conversation. Every model I use reads it before doing anything. The behaviors are deliberately tool-agnostic — they don’t depend on which AI agent I’m running today, or which language model is behind it. They’re about the same kinds of mistakes any LLM-assisted coding workflow makes when you don’t push back.
Each behavior came from watching the agent get something wrong, more than once, in a way that felt structural rather than incidental. This post is about those four behaviors, why they’re each worth their own section in the rule file, and the work each one does in practice.
Where the four came from
The four are loosely adapted from observations Andrej Karpathy has shared about the kinds of pitfalls LLMs fall into when writing code. The wording is mine, the framing is mine, but the underlying observation is his: there are recurring failure modes that are predictable once you know what to look for, and worth naming explicitly so the agent (and you) can avoid them.
I’ll walk through each one with the wording I actually use, the failure mode it addresses, and the kind of work it does on real tasks.
1. Think Before Coding
Don’t assume. Don’t hide confusion. Surface tradeoffs.
The failure mode this addresses is the agent’s bias toward action. Ask a vague question and the agent starts producing diffs. By the time you realize the question had multiple valid answers, you’ve got a half-implemented version of one of them.
The rule’s body, in the file:
Before implementing:
- State your assumptions explicitly. If uncertain, ask.
- If multiple interpretations exist, present them — don't pick silently.
- If a simpler approach exists, say so. Push back when warranted.
- If something is unclear, stop. Name what's confusing. Ask.
Four lines that change the behavior in a noticeable way.
The first line — state your assumptions explicitly — is the one I underweighted at first and now consider essential. The agent always has assumptions. Without the rule, those assumptions stay implicit and you only discover them when the diff doesn’t do what you expected. With the rule, the agent says them out loud. “I’m assuming the input is already validated upstream. Is that correct?” Five seconds to confirm or correct, instead of fifteen minutes of debugging later.
The third line — push back when warranted — was harder to make stick. The default mode of an AI coding agent is helpful agreement. Even when the request is dumb, the agent wants to do it. The rule explicitly tells it to push back. The result, after a few sessions of getting used to it, is that the agent will sometimes say “the simpler approach here is X; do you want that instead?” and I’ll realize my original framing was overcomplicated. That’s the agent doing its job.
The fourth line is the one I add for myself as much as for the model. If something is unclear, stop. Name what’s confusing. Once the agent stops trying to fake fluency, the conversation gets faster, not slower. Confusion that gets named gets resolved. Confusion that gets hidden becomes bugs.
2. Simplicity First
Minimum code that solves the problem. Nothing speculative.
This addresses the most common failure I’ve seen, by a wide margin: the agent writes more code than the problem needs. Abstractions for single-use functions. Configuration options for things that will never be configured. Error handling for cases that can’t happen. A pattern of “what if we needed to extend this later?” applied to code that will never be extended.
The rule’s body:
- No features beyond what was asked.
- No abstractions for single-use code.
- No "flexibility" or "configurability" that wasn't requested.
- No error handling for impossible scenarios.
- If you write 200 lines and it could be 50, rewrite it.
Ask yourself: "Would a senior engineer say this is overcomplicated?"
If yes, simplify.
The “would a senior engineer say this is overcomplicated?” check is the one that does the most work. It’s a single mental test the agent can apply to its own output before committing it. The model is good at running that test on itself once it’s been told to. Without the prompt, it doesn’t run it at all.
I’ll write a whole post tomorrow about what this looks like in practice — the specific shapes of overengineering, the before/after diffs, the moments where I had to push back on a 200-line solution that wanted to be 50. The short version: this rule, on its own, changed more about the quality of generated code than any other single change I made.
3. Surgical Changes
Touch only what you must. Clean up only your own mess.
This addresses an opposite-flavored failure: when asked to change one thing, the agent fixes seven other things it didn’t need to. Reformatting that adjacent function. Updating that comment that was technically slightly out of date. Renaming that variable while it was in there.
Each individual touch is well-intentioned. The aggregate is a diff that’s hard to review, hard to revert, and hard to explain in a commit message.
The rule:
When editing existing code:
- Don't "improve" adjacent code, comments, or formatting.
- Don't refactor things that aren't broken.
- Match existing style, even if you'd do it differently.
- If you notice unrelated dead code, mention it — don't delete it.
When your changes create orphans:
- Remove imports/variables/functions that YOUR changes
made unused.
- Don't remove pre-existing dead code unless asked.
The test: every changed line should trace directly to the
user's request.
That last sentence is the test. Every changed line should trace directly to the user’s request. You can apply it line by line to a diff. If you can’t trace a line back to the request, it shouldn’t be in the diff.
The most common pushback against this rule is “but the adjacent code really was wrong.” It probably was. The fix isn’t to silently include it in this diff. The fix is to mention it (“I noticed helper_x is unused; happy to remove it in a separate change if you want”) and move on. That separation makes review possible and revert safe. It’s the engineering practice the rule is trying to encode.
There’s a second clause that matters: clean up only your own mess. If a change you made leaves an import unused, remove it. That’s yours. If an import was already unused before your change, that’s pre-existing — leave it. The principle keeps the diff scope-tight in both directions.
4. Goal-Driven Execution
Define success criteria. Loop until verified.
The fourth behavior addresses what happens at the end of a task. Without this rule, the agent finishes when it runs out of things to do. With this rule, the agent finishes when the task is verified done.
The rule:
Transform tasks into verifiable goals:
- "Add validation" → "Write tests for invalid inputs,
then make them pass"
- "Fix the bug" → "Write a test that reproduces it,
then make it pass"
- "Refactor X" → "Ensure tests pass before and after"
For multi-step tasks, state a brief plan:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]
Strong success criteria let you loop independently.
Weak criteria ("make it work") require constant clarification.
The reframing in the first block is the move. “Add validation” by itself is open-ended; the agent will produce something validation-shaped and stop. “Write tests for invalid inputs, then make them pass” gives the agent both the implementation goal and the verification criterion. The agent now has an objective measure of done.
For multi-step tasks, the plan-with-verification format is the structural fix. Each step has its own check. The agent doesn’t move to step two until step one passes. This sounds obvious in retrospect; without the rule it doesn’t happen, because the agent will happily stack four uncertain steps on top of each other and call the result “done.”
The last sentence is the diagnostic. If you have to keep clarifying “is it actually working?” after the agent claims done, your success criteria were weak. The rule makes you (and the agent) commit to stronger criteria up front, so the verification loop runs without you babysitting it.
Why these four and not others
I tried longer lists. They didn’t survive contact with reality.
A list of fifteen behaviors is a list nobody reads. The agent loads it, sure, but a long list dilutes any individual instruction. The four-item version is short enough that the agent treats each one as binding rather than as a suggestion in a sea of suggestions.
I also tried adding more specific behaviors — “always write tests,” “prefer functional over imperative style,” “use type hints.” Each of those is sometimes right, often wrong, and depends on the project. Putting them in the always-on rule made the rule less universal and less useful. They belong in project-specific rules, where the context is appropriate.
The four behaviors that stayed are the ones that apply regardless of stack, project, or task. Think before you act. Don’t overengineer. Touch what you must, no more. Define done. Those four hold across every codebase I’ve ever worked in.
A short closing tradeoff
These guidelines bias the agent toward caution over speed. For trivial tasks — renaming a variable, fixing a typo — they’re overhead. The rule file says so explicitly, with one line at the top: “For trivial tasks, use judgment.”
That escape hatch matters. A rule that turns a one-line fix into a multi-step planning exercise is making the system slower, not better. The point of the rule is to install good defaults for the non-trivial case, where the agent’s tendency to act prematurely or overengineer would otherwise produce real damage. On trivial tasks, the agent recognizes the lower stakes and skips the ceremony.
How I’d know the rule is working
The rule file ends with a short success criterion of its own:
These guidelines are working if: fewer unnecessary changes
in diffs, fewer rewrites due to overcomplication, and
clarifying questions come before implementation rather than
after mistakes.
That’s the test. If diffs are tighter, if I’m rewriting less, and if the agent is asking the smart question at the start of a task instead of after — the rule is doing its job. If those signals aren’t moving, the rule isn’t earning its place in the always-on context.
So far, on the projects where I’ve used it, the signals have moved in the right direction. Not all the way. The agent still occasionally produces a 200-line solution that wants to be 50, and I still catch myself accepting too-clever changes that violate the surgical-changes principle. But the floor is meaningfully higher than it was before, which is what guardrails are for.