Skip to content

bench: run searches through SearXNG, not DuckDuckGo (#97)#102

Merged
askalf merged 1 commit into
masterfrom
feat/bench-searxng
Jun 15, 2026
Merged

bench: run searches through SearXNG, not DuckDuckGo (#97)#102
askalf merged 1 commit into
masterfrom
feat/bench-searxng

Conversation

@askalf

@askalf askalf commented Jun 15, 2026

Copy link
Copy Markdown
Owner

What

Switch the quality bench (bench/run.mjs) to run its general-web searches through SearXNG instead of DuckDuckGo. The product default stays DuckDuckGo — this is a bench-only override.

Why (closes the diagnosis in #97)

DuckDuckGo HTML rate-limits the test box's IP and silently degrades to the wikipedia fallback. For a question Wikipedia can't answer — e.g. factual-lookup's "current rate limits for the Claude API" — the fallback returns Wikipedia LLM articles with no rate-limit tables, and the synth model correctly refuses to cite them. That refusal reads exactly like an LLM "hedge," which is what made #97's prompt-tuning effort unmeasurable.

Re-running the gate 2026-06-14 nailed it down:

backend result
DuckDuckGo (stock) 1/3 cite — 2 runs logged primary search produced nothing — retrying the round via wikipedia, 100% Wikipedia sources
SearXNG 3/3 cite — support 1.00 / 0.81 / 1.00, zero fallback

So the bench's nondeterminism was the search layer, not claude-sonnet-4-6. SearXNG egresses via the box's ProtonVPN, not the DDG-rate-limited IP, so it's stable.

How

  • resolveBackend(env) — explicit DEEPDIVE_SEARCH wins (set =duckduckgo to reproduce old behavior); otherwise SearXNG, which needs DEEPDIVE_SEARXNG_URL. No backend resolvable → the harness exits 2 with guidance (instead of silently running the flaky DDG path).
  • questionArgs(q, web) — swaps the duckduckgo leg of multi-adapter questions (academic, niche-ops) for the chosen backend and injects --search=<web> when a question sets none. Domain adapters (arxiv/openalex/stackexchange) and --since are untouched. questions.json stays the single source of truth.
  • Scoreboards now record the backend (· search: \searxng``); the URL is logged to stderr only, never into committed boards (it's a private endpoint).
  • CLAUDE.md bench instructions updated.

This keeps the project's "don't bundle SearXNG, ship it as an adapter" principle — nothing changes for users; only the dev bench points at a reliable backend so prompt deltas become measurable again.

Validation

  • npm test — 56 suites pass, incl. new coverage for resolveBackend, questionArgs, and the scoreboard backend line.
  • npm run typecheck — clean.
  • No-LLM control paths: --list exits 0 without a backend; missing backend → guided error + exit 2.
  • End-to-end: factual-lookup through the harness via SearXNG → PASS, 10 sources, 32/32 citations (1.00); confirmed the board records search: searxng and the URL does not leak into it.

Follow-up (not in this PR)

Unblocks resuming the feat/synth-grounding comparison (#97) against a stable SearXNG baseline. Separately worth considering: stop the wikipedia fallback from masking a primary-search failure as a confident "sources don't answer this" on non-recency factual questions — that's a real user-facing UX gap this exposed.

DuckDuckGo HTML rate-limits the test box's IP and silently degrades to
the wikipedia fallback. For questions Wikipedia can't answer (e.g.
factual-lookup's "Claude API rate limits"), that turns a search failure
into a confident "the sources don't contain it" answer that reads like
an LLM hedge — making prompt deltas unmeasurable. Stock factual-lookup
scored 1/3 on DDG vs 3/3 on SearXNG with the same prompt (issue #97).

The bench now routes its general-web leg through SearXNG via
DEEPDIVE_SEARXNG_URL. The product default stays DuckDuckGo (CLAUDE.md:
"ship SearXNG as an adapter, don't bundle it") — this override is
bench-only. resolveBackend() lets an explicit DEEPDIVE_SEARCH win (set
=duckduckgo to reproduce the old behavior) and errors with guidance when
no backend is resolvable. questionArgs() swaps the duckduckgo leg of
multi-adapter questions (academic, niche-ops) for the chosen backend and
injects --search when absent, leaving domain adapters and --since
untouched. Scoreboards record the backend (URL kept to stderr, never
committed).

Tests cover resolveBackend + questionArgs + the scoreboard line.
Validated e2e: factual-lookup PASS via SearXNG, 32/32 citations (1.00).
@askalf askalf merged commit c5160ab into master Jun 15, 2026
5 checks passed
@askalf askalf deleted the feat/bench-searxng branch June 15, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant