Skip to content

feat: gstack-inspired AI PR review pipeline with HuggingFace triage, Claude review, and auto-merge#305

Open
schneidermr wants to merge 1 commit intogarrytan:mainfrom
bitkaio:feat/gstack-pr-review-pipeline
Open

feat: gstack-inspired AI PR review pipeline with HuggingFace triage, Claude review, and auto-merge#305
schneidermr wants to merge 1 commit intogarrytan:mainfrom
bitkaio:feat/gstack-pr-review-pipeline

Conversation

@schneidermr
Copy link

Summary

Adds a 4-step AI-powered PR review pipeline to GitHub Actions, inspired by
gstack's /review and /plan-eng-review
skill files. The pipeline classifies, reviews, scores, and optionally merges PRs
— fully automated, with structured JSON output and a complete audit trail.

Motivation

AI-assisted code generation is accelerating faster than teams can review it.
Anthropic's Claude Code Review solves this at the enterprise tier ($15–25/review,
20 min, Teams/Enterprise only). This pipeline brings comparable structured
review to any GitHub repo at a fraction of the cost (
$0.10–0.30/review)
by combining a lightweight open-source triage model with Claude's reasoning
capabilities and gstack's battle-tested review principles.

Architecture

Step 1: Triage      → Qwen2.5-3B (HuggingFace Inference API)
Step 2: Compile     → Claude Sonnet (prompt compilation)
Step 3: Review      → Claude Sonnet/Opus (structured code review)
Step 4: Route       → Deterministic bash (approve/reject/comment/merge)

Step 1 — Triage (HuggingFace, ~2-5s, ~free)

Gathers full PR context via GitHub API: diff, inline review comments,
conversation thread, and linked issues. Classifies the PR by type, risk,
size, and review depth using Qwen2.5-3B-Instruct. Falls back to
rule-based heuristics if the HF API is unavailable.

Step 2 — Prompt Compilation (Claude Sonnet, ~10-15s, ~$0.01-0.03)

Reads the actual gstack skill files (review/SKILL.md and
plan-eng-review/SKILL.md) from the main branch, plus the triage
output. Following the compile-instructions.md meta-prompt, Claude
strips interactive patterns, extracts the review principles, and
compiles a tailored single-pass review prompt optimized for this
specific PR's type, risk level, and review context.

Step 3 — Deep Review (Claude Sonnet, ~30-90s, ~$0.05-0.30)

Executes the compiled prompt against the actual PR diff. Produces a
structured JSON result with 5-dimension scores (design, security,
performance, test coverage, completeness), severity-classified findings
with file/line references, and a verdict.

Step 4 — Action Routing (bash, ~1-2s, free)

Pure deterministic logic — reads the review JSON and triage output,
then calls GitHub API to:

  • Approve, request changes, or post review comments
  • Add/remove labels (ai-approved, needs-work, security-review-needed, etc.)
  • Enable auto-merge for qualifying low-risk PRs (gated by AUTO_MERGE_ENABLED repo variable)
  • Escalate critical security findings

Decision Matrix

Condition Action
Score ≥ 9, no critical/major, auto-mergeable Approve + auto-merge (if enabled)
Score ≥ 7, no critical findings Approve + ai-review-passed
Moderate issues, non-blocking Comment only + needs-human-review
Critical or multiple major findings Request changes + needs-work
Confidence < 0.7 Comment only + escalate to human
Any critical security finding Always block + security-review-needed

Key Design Decisions

  • Multi-model pipeline: Uses a cheap, fast model (Qwen 2.5 3B) for
    classification and an expensive, capable model (Claude) for reasoning.
    This keeps costs ~50-100x lower than Claude Code Review.
  • gstack skill adaptation: Review principles are read directly from the
    actual review/SKILL.md and plan-eng-review/SKILL.md on the main branch
    — not static copies. A compile-instructions.md meta-prompt tells Claude how
    to strip interactive patterns and compile them into a headless CI review prompt.
    When gstack updates its skills, the pipeline automatically picks up the changes.
  • Structured JSON contract: The review schema (review-schema.json)
    is the formal interface between the LLM and the automation layer.
    Every review decision is a downloadable artifact for audit.
  • Auto-merge toggle: AUTO_MERGE_ENABLED repo variable (default: false)
    provides a global kill switch. Even when enabled, auto-merge requires:
    score ≥ 9, no critical/major findings, triage approval, AND all other
    CI checks passing.
  • Graceful degradation: If HuggingFace is down, triage falls back to
    heuristics. The pipeline never fails silently at classification.

Files

.github/
├── workflows/
│    └─ gstack-pr-review.yml          # Main 4-step workflow
└── gstack-review/
    ├── triage.py                      # Step 1: HF Qwen triage classifier
    ├── route-action.sh                # Step 4: Deterministic action routing
    ├── compile-instructions.md        # Step 2: Meta-prompt for skill compilation
    └── review-schema.json             # Review output JSON schema

Note: The pipeline reads review/SKILL.md and plan-eng-review/SKILL.md
from the main branch at runtime. These are the actual gstack skill files,
not copies. When the skills are updated, the pipeline automatically picks
up the changes.

Setup Required

Secrets

  • ANTHROPIC_API_KEY (required) — Claude API access
  • HF_TOKEN (recommended) — HuggingFace Inference API

Variables

  • AUTO_MERGE_ENABLED"true" to enable auto-merge, default "false"

Optional (for merge/approve authority)

  • APP_ID (variable) + APP_PRIVATE_KEY (secret) — GitHub App credentials

Triggers

  • pull_request: [opened, synchronize, ready_for_review]
  • issue_comment containing @gstack-review (manual re-trigger)

Cost per Review

Step Model Cost
Triage Qwen2.5-3B (HF) ~free
Prompt compile Claude Sonnet ~$0.01-0.03
Deep review Claude Sonnet ~$0.05-0.30
Routing None free
Total ~$0.06-0.33

Compare: Anthropic Claude Code Review = $15-25/review (Teams/Enterprise only).

@schneidermr
Copy link
Author

schneidermr commented Mar 21, 2026

🤖 gstack as a GitHub Action — your skills, running on every PR, automatically

What if /review and /plan-eng-review ran on every PR without anyone typing a slash command?

This PR turns gstack's review skills into a fully automated GitHub Actions pipeline. It reads the actual review/SKILL.md and plan-eng-review/SKILL.md from main, compiles them into a headless review prompt, and uses Claude to score, comment, approve, or reject PRs — with auto-merge for the safe stuff.


The Pipeline

PR opened
    │
    ▼
┌─ Step 1: TRIAGE ─────────────────────────────────────────────┐
│  Qwen 2.5 3B (HuggingFace) • ~2s • ~free                     │
│  Reads: diff, review comments, conversation, linked issues   │
│  → classifies type, risk, review depth                       │
└──────────────────────────────────┬───────────────────────────┘
                                   ▼
┌─ Step 2: PROMPT COMPILATION ─────────────────────────────────┐
│  Claude Sonnet • ~10s • ~$0.02                               │
│  Reads: review/SKILL.md + plan-eng-review/SKILL.md (main)    │
│  → strips interactive patterns, compiles CI review prompt    │
└──────────────────────────────────┬───────────────────────────┘
                                   ▼
┌─ Step 3: DEEP REVIEW ───────────────────────────────────────┐
│  Claude Sonnet • ~30-60s • ~$0.10-0.30                      │
│  → 5-dimension scores, severity-ranked findings, verdict    │
└──────────────────────────────────┬──────────────────────────┘
                                   ▼
┌─ Step 4: ACTION ─────────────────────────────────────────────┐
│  Deterministic bash • ~1s • free                             │
│  → approve / request changes / comment / label / merge       │
└──────────────────────────────────────────────────────────────┘

Total: ~$0.10–0.33 per PR. Under 2 minutes.


Why this matters

gstack's superpower is its opinionated review skills — the paranoid staff engineer persona, the architecture heuristics, the severity classification framework. But right now those only fire when someone manually runs /review in Claude Code.

This pipeline makes them fire on every PR, automatically, without changing a single line of the skill files. Step 2 reads the real SKILL.md files from main and compiles them on the fly — so when gstack's skills get better, the automated reviews get better too. Zero manual sync.


The scoring and routing

Every PR gets a structured JSON review with 5-dimension scores:

Dimension What it checks
Design Architecture fit, abstraction quality, readability
Security OWASP patterns, injection, auth, secrets
Performance N+1 queries, resource leaks, missing timeouts
Test Coverage New paths tested, edge cases, regression tests
Completeness Does the diff match the PR description?

The key differentiator isn't just cost — it's that the review knowledge comes from this repo's own skill files. Not a generic prompt. Not Anthropic's internal review framework. The same opinionated, battle-tested staff engineer persona from /review, compiled for CI.

Claude Code Review is a great product for enterprises that want zero-config depth. This is for people who already have gstack and want it running continuously.


What's in the PR

.github/
├── workflows/
│   └── gstack-pr-review.yml        # The 4-step workflow
└── gstack-review/
    ├── triage.py                    # HuggingFace Qwen classifier
    ├── compile-instructions.md      # Meta-prompt for skill compilation
    ├── review-schema.json           # JSON output contract
    └── route-action.sh              # Deterministic action routing

Five files. The skill files are untouched — read at runtime from main.

Setup: Two secrets (ANTHROPIC_API_KEY, HF_TOKEN), one optional variable (AUTO_MERGE_ENABLED), and optionally a GitHub App for merge authority.


tl;dr

gstack already has the best review skills. This makes them run on every PR without anyone lifting a finger — 50–100x cheaper and 10x faster than the managed alternative. The skills stay in your control, the triage is transparent, and the automation is configurable from "comment only" all the way to "auto-merge".

Every review produces a downloadable JSON artifact — full audit trail of what was scored, what was found, and what action was taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant