🧪 Experiment Campaign: test-quality-sentinel
Workflow file: .github/workflows/test-quality-sentinel.md
Selected dimension: model_size
Triggered by: ab-testing-advisor on 2026-07-05
Background
test-quality-sentinel reviews pull requests that add or modify tests, scores the quality of those tests, and decides whether to approve or request changes. I chose model_size because this workflow mixes structured rubric application, code-diff interpretation, and review writing, making it a good candidate to test whether a smaller model can preserve decision quality at lower cost.
Hypothesis
Null hypothesis: the model-size variant does not improve review usefulness acceptance rate compared to baseline.
Alternative hypothesis: a larger reasoning-capable model improves review usefulness acceptance rate by at least 15 percentage points versus a smaller model, without materially increasing false-positive change requests.
View Details
Experiment Configuration
Add the following experiments: block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):
experiments:
model_size:
variants: [small, medium, large]
description: "Measures whether test-quality analysis quality justifies larger-model cost for PR review decisions."
hypothesis: "H0: no change in review_usefulness_acceptance_rate. H1: medium or large improves review_usefulness_acceptance_rate by >=15% versus small."
metric: review_usefulness_acceptance_rate
secondary_metrics: [run_success_rate, median_comment_length]
guardrail_metrics:
- name: false_positive_request_changes_rate
direction: min
threshold: 0.10
min_samples: 121
weight: [34, 33, 33]
start_date: "2026-07-05"
issue: #aw_campaign1
Variant descriptions:
small: use the smallest supported model for lowest cost; expect faster, cheaper runs but more classification errors on nuanced tests.
medium: use a mid-tier model as a likely cost/quality balance point.
large: use the strongest available model; expect best rubric adherence and explanation quality at higher cost.
Workflow Changes Required
List the exact changes needed in the workflow markdown body to implement the experiment using handlebars conditional blocks. Always compare against a specific variant value — the correct syntax is {{#if experiments.<name> == "<variant>" }}...{{else}}...{{/if}}. The compiler automatically expands experiments.<name> references at compile time; never write the internal env-var form (__GH_AW_EXPERIMENTS__<NAME>___<variant>) directly.
Concrete diff:
--- a/.github/workflows/test-quality-sentinel.md
+++ b/.github/workflows/test-quality-sentinel.md
@@
engine:
id: copilot
+ model: {{#if experiments.model_size == "small" }}small{{else}}{{#if experiments.model_size == "medium" }}medium{{else}}large{{/if}}{{/if}}
max-continuations: 15
@@
-You are the Test Quality Sentinel. Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.
+You are the Test Quality Sentinel. {{#if experiments.model_size == "small" }}Use the shortest possible reasoning trace: classify only the highest-signal tests first, keep explanations terse, and avoid re-reading files unless required to score accurately.{{else}}{{#if experiments.model_size == "medium" }}Use standard reasoning depth with concise justifications and one verification pass for flagged tests.{{else}}Use deeper comparative reasoning: verify flagged tests against the rubric, cross-check edge-case coverage carefully, and produce more specific fix guidance.{{/if}}{{/if}} Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.
This keeps the experimental treatment narrow: model choice is primary, while prompt nudges help each model operate within an intentionally matched cost envelope.
Success Metrics
| Metric |
Type |
Target |
| review_usefulness_acceptance_rate |
Primary |
+15 percentage points vs small |
| run_success_rate |
Secondary |
No worse than baseline |
| false_positive_request_changes_rate |
Guardrail |
Must stay <= 10% |
Statistical Design
- Variants: small, medium, large
- Assignment: Round-robin via
gh-aw experiments runtime (cache-based)
- Minimum runs per variant: 121
- Expected experiment duration: ~121 days if the workflow runs about once per day; faster if slash-command usage is frequent
- Analysis approach: proportion test on binary review usefulness outcomes, with descriptive summaries for secondary metrics
Implementation Steps
References
- A/B Testing in gh-aw
- Workflow file:
.github/workflows/test-quality-sentinel.md
- Recent runs reviewed: 10
Notes
The infrastructure side quest found that analysis_type, tags, and notify are already implemented in both the compiler and pick_experiment.cjs, so the schema gate for a second issue was not met. Future work should focus on richer outcome artifacts, dashboards, and audit integration rather than adding those frontmatter fields.
Generated by 🧪 Daily A/B Testing Advisor · 16.9 AIC · ⌖ 26.1 AIC · ⊞ 5.7K · ◷
🧪 Experiment Campaign: test-quality-sentinel
Workflow file:
.github/workflows/test-quality-sentinel.mdSelected dimension: model_size
Triggered by:
ab-testing-advisoron 2026-07-05Background
test-quality-sentinelreviews pull requests that add or modify tests, scores the quality of those tests, and decides whether to approve or request changes. I chosemodel_sizebecause this workflow mixes structured rubric application, code-diff interpretation, and review writing, making it a good candidate to test whether a smaller model can preserve decision quality at lower cost.Hypothesis
Null hypothesis: the model-size variant does not improve review usefulness acceptance rate compared to baseline.
Alternative hypothesis: a larger reasoning-capable model improves review usefulness acceptance rate by at least 15 percentage points versus a smaller model, without materially increasing false-positive change requests.
View Details
Experiment Configuration
Add the following
experiments:block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):Variant descriptions:
small: use the smallest supported model for lowest cost; expect faster, cheaper runs but more classification errors on nuanced tests.medium: use a mid-tier model as a likely cost/quality balance point.large: use the strongest available model; expect best rubric adherence and explanation quality at higher cost.Workflow Changes Required
List the exact changes needed in the workflow markdown body to implement the experiment using handlebars conditional blocks. Always compare against a specific variant value — the correct syntax is
{{#if experiments.<name> == "<variant>" }}...{{else}}...{{/if}}. The compiler automatically expandsexperiments.<name>references at compile time; never write the internal env-var form (__GH_AW_EXPERIMENTS__<NAME>___<variant>) directly.Concrete diff:
This keeps the experimental treatment narrow: model choice is primary, while prompt nudges help each model operate within an intentionally matched cost envelope.
Success Metrics
Statistical Design
gh-awexperiments runtime (cache-based)Implementation Steps
experiments:section to frontmatter{{#if experiments.model_size == "<variant>" }}(value-comparison form — never use the internal__GH_AW_EXPERIMENTS__env-var syntax)gh aw compile test-quality-sentinelto regenerate lock file/tmp/gh-aw/agent/experiments/state.jsonReferences
.github/workflows/test-quality-sentinel.mdNotes
The infrastructure side quest found that
analysis_type,tags, andnotifyare already implemented in both the compiler andpick_experiment.cjs, so the schema gate for a second issue was not met. Future work should focus on richer outcome artifacts, dashboards, and audit integration rather than adding those frontmatter fields.