Skip to content

[ab-advisor] Experiment campaign for test-quality-sentinel: A/B test model_size #43530

Description

@github-actions

🧪 Experiment Campaign: test-quality-sentinel

Workflow file: .github/workflows/test-quality-sentinel.md
Selected dimension: model_size
Triggered by: ab-testing-advisor on 2026-07-05


Background

test-quality-sentinel reviews pull requests that add or modify tests, scores the quality of those tests, and decides whether to approve or request changes. I chose model_size because this workflow mixes structured rubric application, code-diff interpretation, and review writing, making it a good candidate to test whether a smaller model can preserve decision quality at lower cost.

Hypothesis

Null hypothesis: the model-size variant does not improve review usefulness acceptance rate compared to baseline.

Alternative hypothesis: a larger reasoning-capable model improves review usefulness acceptance rate by at least 15 percentage points versus a smaller model, without materially increasing false-positive change requests.

View Details

Experiment Configuration

Add the following experiments: block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):

experiments:
  model_size:
    variants: [small, medium, large]
    description: "Measures whether test-quality analysis quality justifies larger-model cost for PR review decisions."
    hypothesis: "H0: no change in review_usefulness_acceptance_rate. H1: medium or large improves review_usefulness_acceptance_rate by >=15% versus small."
    metric: review_usefulness_acceptance_rate
    secondary_metrics: [run_success_rate, median_comment_length]
    guardrail_metrics:
      - name: false_positive_request_changes_rate
        direction: min
        threshold: 0.10
    min_samples: 121
    weight: [34, 33, 33]
    start_date: "2026-07-05"
    issue: #aw_campaign1

Variant descriptions:

  • small: use the smallest supported model for lowest cost; expect faster, cheaper runs but more classification errors on nuanced tests.
  • medium: use a mid-tier model as a likely cost/quality balance point.
  • large: use the strongest available model; expect best rubric adherence and explanation quality at higher cost.

Workflow Changes Required

List the exact changes needed in the workflow markdown body to implement the experiment using handlebars conditional blocks. Always compare against a specific variant value — the correct syntax is {{#if experiments.<name> == "<variant>" }}...{{else}}...{{/if}}. The compiler automatically expands experiments.<name> references at compile time; never write the internal env-var form (__GH_AW_EXPERIMENTS__<NAME>___<variant>) directly.

Concrete diff:

--- a/.github/workflows/test-quality-sentinel.md
+++ b/.github/workflows/test-quality-sentinel.md
@@
 engine:
   id: copilot
+  model: {{#if experiments.model_size == "small" }}small{{else}}{{#if experiments.model_size == "medium" }}medium{{else}}large{{/if}}{{/if}}
   max-continuations: 15
@@
-You are the Test Quality Sentinel. Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.
+You are the Test Quality Sentinel. {{#if experiments.model_size == "small" }}Use the shortest possible reasoning trace: classify only the highest-signal tests first, keep explanations terse, and avoid re-reading files unless required to score accurately.{{else}}{{#if experiments.model_size == "medium" }}Use standard reasoning depth with concise justifications and one verification pass for flagged tests.{{else}}Use deeper comparative reasoning: verify flagged tests against the rubric, cross-check edge-case coverage carefully, and produce more specific fix guidance.{{/if}}{{/if}} Analyze new and changed tests in this PR to produce a **Test Quality Score** (0–100) and flag tests that create false comfort without genuine behavioral coverage.

This keeps the experimental treatment narrow: model choice is primary, while prompt nudges help each model operate within an intentionally matched cost envelope.

Success Metrics

Metric Type Target
review_usefulness_acceptance_rate Primary +15 percentage points vs small
run_success_rate Secondary No worse than baseline
false_positive_request_changes_rate Guardrail Must stay <= 10%

Statistical Design

  • Variants: small, medium, large
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 121
  • Expected experiment duration: ~121 days if the workflow runs about once per day; faster if slash-command usage is frequent
  • Analysis approach: proportion test on binary review usefulness outcomes, with descriptive summaries for secondary metrics

Implementation Steps

  • Add experiments: section to frontmatter
  • Add conditional blocks to workflow prompt body using {{#if experiments.model_size == "<variant>" }} (value-comparison form — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax)
  • Run gh aw compile test-quality-sentinel to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/agent/experiments/state.json
  • After sufficient runs, analyze variant distribution via workflow run artifacts
  • Document findings and promote winning variant

References

  • A/B Testing in gh-aw
  • Workflow file: .github/workflows/test-quality-sentinel.md
  • Recent runs reviewed: 10

Notes

The infrastructure side quest found that analysis_type, tags, and notify are already implemented in both the compiler and pick_experiment.cjs, so the schema gate for a second issue was not met. Future work should focus on richer outcome artifacts, dashboards, and audit integration rather than adding those frontmatter fields.

Generated by 🧪 Daily A/B Testing Advisor · 16.9 AIC · ⌖ 26.1 AIC · ⊞ 5.7K ·

  • expires on Jul 19, 2026, 3:11 AM UTC-08:00

Metadata

Metadata

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions