[ab-advisor] Experiment campaign for test-quality-sentinel: A/B test model_size

### 🧪 Experiment Campaign: test-quality-sentinel

**Workflow file**: `.github/workflows/test-quality-sentinel.md`
**Selected dimension**: model_size
**Triggered by**: `ab-testing-advisor` on 2026-07-05

---

### Background

`test-quality-sentinel` reviews pull requests that add or modify tests, scores the quality of those tests, and decides whether to approve or request changes. I chose `model_size` because this workflow mixes structured rubric application, code-diff interpretation, and review writing, making it a good candidate to test whether a smaller model can preserve decision quality at lower cost.

### Hypothesis

Null hypothesis: the model-size variant does not improve review usefulness acceptance rate compared to baseline.

Alternative hypothesis: a larger reasoning-capable model improves review usefulness acceptance rate by at least 15 percentage points versus a smaller model, without materially increasing false-positive change requests.

<details><summary>View Details</summary>

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):

```yaml
experiments:
  model_size:
    variants: [small, medium, large]
    description: "Measures whether test-quality analysis quality justifies larger-model cost for PR review decisions."
    hypothesis: "H0: no change in review_usefulness_acceptance_rate. H1: medium or large improves review_usefulness_acceptance_rate by >=15% versus small."
    metric: review_usefulness_acceptance_rate
    secondary_metrics: [run_success_rate, median_comment_length]
    guardrail_metrics:
      - name: false_positive_request_changes_rate
        direction: min
        threshold: 0.10
    min_samples: 121
    weight: [34, 33, 33]
    start_date: "2026-07-05"
    issue: #aw_campaign1
```

**Variant descriptions**:
- `small`: use the smallest supported model for lowest cost; expect faster, cheaper runs but more classification errors on nuanced tests.
- `medium`: use a mid-tier model as a likely cost/quality balance point.
- `large`: use the strongest available model; expect best rubric adherence and explanation quality at higher cost.

### Workflow Changes Required

List the exact changes needed in the workflow markdown body to implement the experiment using handlebars conditional blocks. **Always compare against a specific variant value** — the correct syntax is `{{#if experiments.<name> == "<variant>" }}...{{else}}...{{/if}}`. The compiler automatically expands `experiments.<name>` references at compile time; never write the internal env-var form (`__GH_AW_EXPERIMENTS__<NAME>___<variant>`) directly.

Concrete diff:

```diff
--- a/.github/workflows/test-quality-sentinel.md
+++ b/.github/workflows/test-quality-sentinel.md
@@
 engine:
   id: copilot
+  model: {{#if experiments.model_size == "small" }}small{{else}}{{#if experiments.model_size == "medium" }}medium{{else}}large{{/if}}{{/if}}
   max-continuations: 15
@@
-You are the Test Quality Sentinel. Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.
+You are the Test Quality Sentinel. {{#if experiments.model_size == "small" }}Use the shortest possible reasoning trace: classify only the highest-signal tests first, keep explanations terse, and avoid re-reading files unless required to score accurately.{{else}}{{#if experiments.model_size == "medium" }}Use standard reasoning depth with concise justifications and one verification pass for flagged tests.{{else}}Use deeper comparative reasoning: verify flagged tests against the rubric, cross-check edge-case coverage carefully, and produce more specific fix guidance.{{/if}}{{/if}} Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.
```

This keeps the experimental treatment narrow: model choice is primary, while prompt nudges help each model operate within an intentionally matched cost envelope.

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| review_usefulness_acceptance_rate | Primary | +15 percentage points vs small |
| run_success_rate | Secondary | No worse than baseline |
| false_positive_request_changes_rate | Guardrail | Must stay <= 10% |

### Statistical Design

- **Variants**: small, medium, large
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 121
- **Expected experiment duration**: ~121 days if the workflow runs about once per day; faster if slash-command usage is frequent
- **Analysis approach**: proportion test on binary review usefulness outcomes, with descriptive summaries for secondary metrics

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter
- [ ] Add conditional blocks to workflow prompt body using `{{#if experiments.model_size == "<variant>" }}` (value-comparison form — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax)
- [ ] Run `gh aw compile test-quality-sentinel` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/agent/experiments/state.json`
- [ ] After sufficient runs, analyze variant distribution via workflow run artifacts
- [ ] Document findings and promote winning variant

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/test-quality-sentinel.md`
- Recent runs reviewed: 10

### Notes

The infrastructure side quest found that `analysis_type`, `tags`, and `notify` are already implemented in both the compiler and `pick_experiment.cjs`, so the schema gate for a second issue was not met. Future work should focus on richer outcome artifacts, dashboards, and audit integration rather than adding those frontmatter fields.

</details>







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/28738613974) · 16.9 AIC · ⌖ 26.1 AIC · ⊞ 5.7K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jul 19, 2026, 3:11 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ab-advisor] Experiment campaign for test-quality-sentinel: A/B test model_size #43530

🧪 Experiment Campaign: test-quality-sentinel

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
review_usefulness_acceptance_rate	Primary	+15 percentage points vs small
run_success_rate	Secondary	No worse than baseline
false_positive_request_changes_rate	Guardrail	Must stay <= 10%

Uh oh!

[ab-advisor] Experiment campaign for test-quality-sentinel: A/B test model_size #43530

Description

🧪 Experiment Campaign: test-quality-sentinel

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions