Building Test Scenarios with Claude Code (Claude CLI)

What determines the quality of development delegated to an AI agent is not so much its ability to generate code as the "verification mechanism" in place. Based on research into official and practical English-language blogs, this article explains how to build test scenarios with Claude Code (Claude CLI).

This article summarizes, in English, the key points of English-language official documentation and technical blogs that we researched. The source for each technique is listed in the "References" section at the end of the article. Please be sure to check the original articles for full details.

The core: "a verification loop Claude can run on its own"

The top principle Anthropic emphasizes officially is to "give Claude a way to verify its work." Verification refers to "anything that returns a signal Claude can read within the conversation — test suites, build exit codes, linters, scripts that compare output against expected values, screenshots compared against a design." Claude "works, runs checks, reads the results, and iterates until they pass" (Source 1).

For this reason, it is effective to include concrete test cases in your prompt. For example, rather than "implement an email validation function," write something like the following.

Write a validateEmail function. Example test cases: user@example.com → true, invalid → false, user@.com → false. Run the tests after implementing it.

Enforce TDD "explicitly"

Left to its own devices, Claude tends to write the implementation first. So explicitly instruct it to follow the RED → GREEN → REFACTOR order. The Shipyard blog states that "the process of writing tests before writing the feature produces higher-quality results," and recommends having Claude write E2E tests with Cypress or Playwright (Source 4).

alexop.dev introduces a "TDD skill" that separates each phase into a subagent and structurally prevents skipped steps with blocking instructions such as "do not proceed to the GREEN phase until you have confirmed the test fails." The test-authoring agent is required to ensure that "the test must fail when run — confirm this before returning" (Source 5).

The RED-GREEN-REFACTOR flow: (1) write a failing test first → (2) confirm it really fails (RED) → (3) commit → (4) write the minimal implementation that makes the test pass (GREEN) → (5) clean up (REFACTOR). Tests function as "an external standard of judgment that stays accurate even as a session drags on" (Sources 1, 4, 5).

Unit tests alone are not enough — E2E verification

Anthropic's engineering blog clearly points out an easily overlooked failure mode: "Claude would make code changes and even verify with unit tests or curl, yet fail to recognize that the feature did not work end to end."

The solution is to give it a browser automation tool. When the agent was set up to test the app by operating it like a real user, they report that "it became able to identify and fix bugs that cannot be seen from the code alone, dramatically improving performance." They also describe a technique where, at the start of each session, the development server is started and an initial health check (send a chat and check whether a response comes back) is run to detect a broken state (Source 2).

A crucial prohibition: the same blog explicitly states that "tests must not be deleted or edited, because doing so leads to missing features and overlooked bugs." A rule preventing Claude from "deleting tests to turn things green" is essential to running test scenarios (Source 2).

Long-running autonomous execution: state files and retry limits

When running many scenarios autonomously, use a state file to track the result of each test. In Nathan Onn's "Ralph Loop," a status.json records the state of each test case (pending / pass / fail / knownIssue) and the number of fix attempts, serving as Claude's persistent memory. The loop proceeds as "select a test → run it in the browser → move on if it passes, fix and re-test if it fails," and fixes are capped at three attempts; beyond that the case is marked known_issue to prevent infinite loops. Completion is only declared via an explicit signal such as <promise>ALL_TESTS_RESOLVED</promise> (Source 3).

Separate the grader from the implementer

A classic way to raise quality is "not letting the agent that wrote the code grade its own work." Have a verification subagent, or a separate session with fresh context (the Writer/Reviewer pattern), challenge the results to catch overfitting to the tests. That said, the official guidance cautions that "a reviewer instructed to look for gaps tends to report something even for sound work. Tell it to raise only gaps related to correctness or requirements" (Source 1).

A checklist for building test scenarios

Our perspective: At Hashito System, we view test scenarios as "specifications written on the premise that they will be handed to an AI." Preparing scenarios up front whose expected values are clear and whose pass/fail can be returned mechanically is the key to raising both the productivity and the quality of AI-driven development at the same time. Contact us for help, starting from the design of your test strategy.

Frequently Asked Questions (FAQ)

What is the most important thing when building test scenarios with Claude Code?

The top principle Anthropic emphasizes officially is to "give Claude a way to verify its work." Verification refers to any signal Claude can read within the conversation: test suites, build exit codes, linters, scripts that compare output against expected values, screenshots compared against a design, and more. Claude works, runs checks, reads the results, and iterates until they pass. For this reason, it is effective to include concrete test cases with inputs and expected values in your prompt.

How do I make Claude Code follow TDD properly?

Left to its own devices, Claude tends to write the implementation first, so explicitly instruct it to follow the RED → GREEN → REFACTOR order. Write a failing test first, confirm that it really fails (RED) and commit, write the minimal implementation that makes the test pass (GREEN), and clean up (REFACTOR). It is also effective to separate each phase into a subagent and structurally prevent skipped steps with blocking instructions such as "do not proceed to GREEN until you have confirmed the test fails."

Aren't unit tests alone enough?

Sometimes they are not. Even after making code changes, writing unit tests, and verifying with curl, Claude may fail to recognize that a feature does not work end to end. Giving it a browser automation tool so it can operate the app like a real user lets it identify and fix bugs that cannot be seen from the code alone, improving performance. Note that the prohibition against deleting or editing tests (which leads to missing features and overlooked bugs) is also important.

References

  1. Anthropic, "Best practices for Claude Code" (official documentation) — https://code.claude.com/docs/en/best-practices
  2. Justin Young / Anthropic, "Effective harnesses for long-running agents" (November 26, 2025) — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
  3. Nathan Onn, "How to Make Claude Code Test and Fix Its Own Work (The Ralph Loop Method)" (February 13, 2026) — https://www.nathanonn.com/claude-code-testing-ralph-loop-verification/
  4. Natalie Lunbeck / Shipyard, "E2E Testing with Claude Code" (July 3, 2025) — https://shipyard.build/blog/e2e-testing-claude-code/
  5. Alexander Opalic, "A Claude Code TDD Skill: Forcing Red-Green-Refactor Discipline" (November 30, 2025) — https://alexop.dev/posts/custom-tdd-workflow-claude-code-vue/
  6. Claude Fast, "Claude Code Workflow: Create Tight Feedback Loops" — https://claudefa.st/blog/guide/development/feedback-loops
← Back to the Tech Blog list