Experiments
AgentV eval files are the only runnable authoring artifact. Use top-level
experiment: inside eval.yaml for runtime choices: targets, workers,
timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.
name: support-regression
experiment: targets: [codex-gpt5, claude-sonnet] workers: 2 timeout_seconds: 720 repeat: count: 4 strategy: pass_at_k cost_limit_usd: 2.00
workspace: hooks: before_all: command: ["bash", "-lc", "bun install && bun run build"]
tests: - id: refund-eligibility input: Can this customer get a refund? criteria: Applies the refund policy correctlyexecution: is accepted only as a legacy top-level alias for existing eval
files. Do not use both experiment: and execution: in the same eval.
Tests Imports
Section titled “Tests Imports”Use tests[] for composition, imports, and selection.
tests: - include: evals/support/*.eval.yaml type: suite select: test_ids: - refund-* - missing-order-date tags: regression metadata: priority: high run: threshold: 1.0 repeat: count: 2 strategy: pass_all - include: cases/*.cases.yaml type: tests - include: cases/regression.jsonl type: tests - cases/smoke/*.cases.yamltype: suite preserves the imported suite’s task contract: metadata,
workspace, shared input, shared assertions, and tests. The child suite’s
experiment: or legacy execution: runtime block is ignored; the parent eval’s
runtime block controls the run.
type: tests imports only raw test entries. It intentionally drops shared
context from an imported eval suite, so parent suite fields apply to those raw
cases.
tests[].select.test_ids filters imported test IDs with glob patterns.
tests[].select.tags filters each imported case’s effective metadata.tags.
Effective case tags are suite-first and deduped:
suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags
still remain suite identity metadata for discovery and reporting; selection reads
the merged case metadata view. tests[].select.metadata filters case metadata by
key/value, where selector values may be scalars or lists. Globbed include paths
are resolved in deterministic path order, then test order.
String-valued tests and string entries inside tests[] are raw-case import
shorthand. They are equivalent to include with type: tests and may point at
raw case files, directories, or globs. Importing another eval suite must use
object form with include: and type: suite.
Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does
not recursively load suite runtime blocks.
Imported suite artifacts are nested under the source suite name inside a wrapper
eval result directory, for example
.agentv/results/<wrapper-eval>/<timestamp>/<imported-suite>/<test-id>/....
Direct tests owned by the wrapper eval and raw case imports live directly under
<test-id>/....
Scoped Run Overrides
Section titled “Scoped Run Overrides”Use scoped run: blocks for result interpretation and scheduling policies that
vary by include group or test case. Precedence is:
test.run > tests[].run > experimentexperiment: target: agent threshold: 0.8 repeat: count: 3 strategy: pass_at_k
tests: - include: ./evals/flaky-agentic/**/*.eval.yaml type: suite select: tags: [agentic] run: repeat: count: 3 strategy: pass_at_k
- include: ./evals/regression/**/*.eval.yaml type: suite select: tags: [must-pass] run: threshold: 1.0 repeat: count: 2 strategy: pass_all
- id: critical-case input: "..." criteria: Must pass exactly run: threshold: 1.0 repeat: count: 1Scoped run: supports threshold, repeat, timeout_seconds, and
budget_usd. Candidate-changing fields such as target and targets stay
parent-level under experiment:. Workspace mutation belongs in
workspace.hooks, and runner-specific setup belongs in targets[].hooks.
Lifecycle Ownership
Section titled “Lifecycle Ownership”experiment: configures evaluation policy. It does not own commands that
prepare files, dependencies, repos, or target-specific runner state.
| Need | Put it in |
|---|---|
| Install dependencies, build the repo, seed files | workspace.hooks.before_all |
| Reset or apply per-case state | workspace.hooks.before_each / workspace.hooks.after_each |
| Configure an agent runner or provider variant | targets[].hooks |
| Choose targets, repeats, pass policy, budget, threshold | experiment |
workspace: hooks: before_all: command: ["bash", "-lc", "bun install && bun run build"]
targets: - name: agent-with-skills provider: codex hooks: before_each: command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
experiment: target: agent-with-skills repeat: count: 3 strategy: pass_at_kRepeat Runs
Section titled “Repeat Runs”repeat supports the same core strategies as repeated attempts:
experiment: repeat: count: 3 strategy: mean cost_limit_usd: 1.50Supported strategies:
| Strategy | Behavior |
|---|---|
pass_at_k | Uses the best passing attempt; early-exits by default unless early_exit: false is set |
pass_all | Uses the weakest attempt score, so every repeated attempt must meet the threshold |
mean | Aggregates repeated attempt scores by mean |
confidence_interval | Uses the lower bound of a 95% confidence interval as the conservative score |
AgentV also accepts runs and early_exit under experiment: as shorthand for
repeat-run policy:
experiment: runs: 4 early_exit: trueDo not set both repeat and runs in the same runtime block.
Result Layout
Section titled “Result Layout”Default eval runs write to:
.agentv/results/<eval-name>/<timestamp>/Imported source suite metadata appears in index.jsonl rows and manifests.
AgentV does not add a redundant suite directory when the result group is already
the eval name.