Audit-first AI evaluation. Six composable, agent-native, Apache-2.0 repos that let any engineering org prove — not just claim — that an AI change works, with a cryptographically signed receipt anyone can verify.
Measure the behavior. Sign the result. Gate the ship.
This repo is the umbrella for the Intent Eval Platform. Each component is its own independently developed and released repository; this repo is the map — what they are, what they do, how they converge, and why it matters. No application code lives here.
AI evaluation has a trust gap. Anyone can publish a benchmark number. The question that matters when something ships — to a compliance officer, an on-call engineer, a customer's security team — isn't "what's the score?" It's "how do I know that score is real, that it wasn't quietly edited after the run, and that it came from the code you actually shipped?"
Most eval tooling answers the first question and skips the second. A number is not evidence. Evidence is a number you can independently verify.
The category optimizes one axis: produce a score. We compete on a different one: produce signed, reproducible evidence — and gate the ship on it.
| Capability | Eval frameworks OpenAI evals · Braintrust · Langfuse |
Memory-eval boards gbrain-evals · LongMemEval |
Intent Eval Platform |
|---|---|---|---|
| Behavioral scoring | ✅ | ✅ | ✅ |
| Binary, non-laundered verdicts (no aggregate PASS%) | ◑ | ◑ | ✅ |
| Canonical cross-tool contract (one schema, many emitters) | ❌ | ❌ | ✅ Evidence Bundle |
| Cryptographic receipt — sigstore-signed, Rekor-anchored | ❌ | ❌ | ✅ public transparency log |
| Reproducible-by-signature (not just by transparency) | ◑ | ✅ transparency | ✅ + signature |
| Ship/no-ship gate consuming the evidence | ◑ | ❌ | ✅ rollout gate |
| Deterministic test-policy enforcement (AI-proof) | ❌ | ❌ | ✅ audit-harness |
✅ first-class · ◑ partial / varies · ❌ not in the architecture. An architectural contrast, not a feature-by-feature audit.
The differentiator in one line: every validator in this platform emits the same Evidence Bundle, and an Evidence Bundle can be signed into the public sigstore transparency log — so a third party verifies the result without trusting us. That's a layer a scoreboard of unsigned numbers structurally cannot offer.
| Repo | Role | What it does |
|---|---|---|
intent-eval-core (@intentsolutions/core) |
Contracts kernel | The canonical schema everything converges on — TypeScript types, JSON Schemas, Zod validators, and state machines for the platform's canonical entities (incl. the Evidence Bundle + gate-result/v1 predicate). No runtime, no judges — just the contract. Published to npm with sigstore provenance. |
| intent-eval-lab | Methodology + specs | The constitution: vendor-neutral evaluation methodology, normative spec modules, Decision Records, the canonical glossary. Where the why and the rules live. |
| intent-audit-harness | Deterministic gates | AI-proof test-policy enforcement. Hash-pins engineer-owned testing config so AI-proposed threshold-weakening is blocked at pre-commit. Ships escape-scan, CRAP, architecture, bias, and Gherkin-lint gates — each emits Evidence Bundle rows. |
| j-rig-skill-binary-eval | Behavioral eval | Binary-criteria evaluation for Claude skills (extending to plugins, agents, MCP servers). Scores every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression, baseline value, model variance, rollout safety. Never gradients. |
| intent-rollout-gate | Ship decision | A GitHub Action that consumes an Evidence Bundle + a repo policy and decides ship / no-ship / advisory. The platform's user-facing CI gate. |
| intent-eval-dashboard | Public surface | The reports dashboard at labs.intentsolutions.io — eval-set browser + the Evidence Bench scorecard where signed results are published and independently verifiable. |
All six are Apache-2.0.
The repos don't merge into a monolith — they compose at the schema layer. One fact's journey from a code change to a signed, ship-gating verdict:
%%{init: {'theme':'base','themeVariables':{
'primaryColor':'#2c5282','primaryBorderColor':'#1e3a5f','primaryTextColor':'#ffffff',
'lineColor':'#4a7fb5','signalColor':'#4a7fb5','signalTextColor':'#e5e7eb',
'actorBkg':'#2c5282','actorBorder':'#1e3a5f','actorTextColor':'#ffffff',
'noteBkgColor':'#1e3a5f','noteTextColor':'#e5e7eb'}}}%%
sequenceDiagram
participant Chg as A change (skill / plugin / agent)
participant J as j-rig + audit-harness
participant Core as @intentsolutions/core
participant Sig as sigstore + Rekor
participant Dash as labs.intentsolutions.io
participant Gate as intent-rollout-gate
Chg->>J: evaluate (behavioral + deterministic gates)
J->>Core: emit Evidence Bundle (canonical schema)
Core->>Sig: sign blob → public transparency log
Sig-->>Dash: publish signed row (verifiable receipt)
Core->>Gate: bundle + policy → ship / no-ship
Note over Dash,Gate: Anyone can run `cosign verify-blob` and check the result themselves.
The shared Evidence Bundle (defined in intent-eval-core) is the convergence point.
Every validator emits one; the dashboard renders them; the rollout gate decides on them;
sigstore signs them. Add a new emitter and it plugs into the whole platform for free.
A real signed result is published right now:
- Scorecard: labs.intentsolutions.io/eval-sets/j-rig-bench/
- Plain-English walkthrough (what we did, how, and the proof): /eval-sets/j-rig-bench/phase-a0/
- The public receipt (raw Rekor transparency-log entry): rekor.sigstore.dev · logIndex 1689291334
You don't have to trust any of it — that's the point. Verify it yourself.
Intent Solutions · intentsolutions.io · Apache-2.0