Why AI Ignores Your "Use TDD" Instruction: The Compliance Gap Between Instruction and Rationale

You wrote "Use TDD" in CLAUDE.md, but the AI doesn't write tests. Or it writes them, but they're clearly going through the motions. If you've tried practicing test-driven development with AI coding tools, you've likely experienced this.

The instruction is right there. So why doesn't the AI follow it? The cause may not be an AI capability problem — it may be a problem with the type of instruction.

Same Instruction, Same Prompt, Completely Different Results

Here's the most striking result from our controlled experiment.

In a TypeScript workflow approval system development experiment, we gave the prompt "Use Playwright to verify the entire system end-to-end" near the end of the project. There were three conditions. All had the same instruction in CLAUDE.md: "Develop using TDD." The only difference was whether design intent recorded via sqlew included the rationale for why TDD was adopted.

The results diverged dramatically.

In conditions where the rationale was recorded, AI interpreted "verify" as "write re-runnable test specs" and spontaneously generated Playwright test files. At best, it created 10 spec files containing 25 E2E tests, detecting 8 bugs in the process.

In the condition with instruction only and no rationale, AI interpreted the same "verify" as "manually check in the browser." E2E test generation: zero. Bug detection: zero.

Image

The Difference Between "What to Do" and "Why to Do It"

The implication is clear. Even with the same instruction written down, AI behavior can diverge completely. What made the difference wasn't the presence or absence of instruction — it was the presence or absence of rationale.

"Develop using TDD" is a what instruction. "We adopted TDD to ensure testability and enable early regression detection" is a why rationale.

The AI that knew the rationale could interpret the ambiguous word "verify" in alignment with the underlying purpose. If TDD exists for regression detection, then verification means writing re-runnable tests. The AI without rationale interpreted "verify" at face value — "look and confirm."

This isn't a capability problem. A human worker given tasks without knowing the purpose will also default to the most literal interpretation. AI behaves the same way.

Research Confirms the Limits of Instruction Following

This phenomenon has academic backing. LLM instruction following research reports that compliance rates decline as reasoning chains grow longer. In complex contexts like the late stages of a project, instructions given early on are particularly prone to being ignored.

Rationale, however, operates through a different mechanism. Instructions function as "rules to follow," while rationale serves as "criteria for judgment." Rules are forgotten as complexity increases, but judgment criteria are actually referenced more as situations grow complex. In our experimental data, prior decision reference frequency in the condition with design intent accelerated upward through the later stages of the project.

The Hidden Cost of Missing Rationale

In the instruction-only condition, the absence of E2E tests was discovered after the fact, requiring an additional task. This supplementary work alone cost 29 minutes and 21,751K tokens.

This wasn't a case of forgetting to write "create tests" as an instruction. The instruction was in CLAUDE.md from the start. It was the absence of rationale that changed AI behavior.

This kind of hidden cost likely occurs frequently in everyday AI coding. When AI doesn't behave as expected, we tend to think "maybe the instructions weren't detailed enough." But what was actually missing may have been rationale, not instructions.

Provide Rationale, Not More Instructions

sqlew records design decisions in a structured format — context, decision, and consequences — and makes them available to AI agents via MCP for just-in-time reference.

Instead of writing "Develop using TDD," record why TDD was adopted. That single difference produced the gap between 25 E2E tests and zero. Rather than piling on more instructions, provide the rationale behind decisions. We believe this shift in thinking is the key to fundamentally improving AI coding quality.


References

  • "Rediscovering Architectural Decision Records: How Persistent Design Context Improves LLM Code Generation" — Shingo Kitayama (2026) — sqlew Efficacy Study
  • "The Instruction Gap: LLMs Get Lost in Following Instructions" — Liu et al. (2025) — arXiv:2601.03269
  • "Scaling Reasoning, Losing Control: Measuring Instruction Following in Reasoning Models" — He et al. (2025) — arXiv:2505.14810

sqlew OSS

  • Retain your projects' Memories
  • No external transaction
  • Open source & free forever
View on GitHub

sqlew Cloud

  • Team collaboratiom ready
  • Easy to setup, including audit features
  • 14 days Free trial available
Try for Free