bench: run searches through SearXNG, not DuckDuckGo (#97)#102
Merged
Conversation
DuckDuckGo HTML rate-limits the test box's IP and silently degrades to the wikipedia fallback. For questions Wikipedia can't answer (e.g. factual-lookup's "Claude API rate limits"), that turns a search failure into a confident "the sources don't contain it" answer that reads like an LLM hedge — making prompt deltas unmeasurable. Stock factual-lookup scored 1/3 on DDG vs 3/3 on SearXNG with the same prompt (issue #97). The bench now routes its general-web leg through SearXNG via DEEPDIVE_SEARXNG_URL. The product default stays DuckDuckGo (CLAUDE.md: "ship SearXNG as an adapter, don't bundle it") — this override is bench-only. resolveBackend() lets an explicit DEEPDIVE_SEARCH win (set =duckduckgo to reproduce the old behavior) and errors with guidance when no backend is resolvable. questionArgs() swaps the duckduckgo leg of multi-adapter questions (academic, niche-ops) for the chosen backend and injects --search when absent, leaving domain adapters and --since untouched. Scoreboards record the backend (URL kept to stderr, never committed). Tests cover resolveBackend + questionArgs + the scoreboard line. Validated e2e: factual-lookup PASS via SearXNG, 32/32 citations (1.00).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Switch the quality bench (
bench/run.mjs) to run its general-web searches through SearXNG instead of DuckDuckGo. The product default stays DuckDuckGo — this is a bench-only override.Why (closes the diagnosis in #97)
DuckDuckGo HTML rate-limits the test box's IP and silently degrades to the wikipedia fallback. For a question Wikipedia can't answer — e.g.
factual-lookup's "current rate limits for the Claude API" — the fallback returns Wikipedia LLM articles with no rate-limit tables, and the synth model correctly refuses to cite them. That refusal reads exactly like an LLM "hedge," which is what made #97's prompt-tuning effort unmeasurable.Re-running the gate 2026-06-14 nailed it down:
primary search produced nothing — retrying the round via wikipedia, 100% Wikipedia sourcesSo the bench's nondeterminism was the search layer, not
claude-sonnet-4-6. SearXNG egresses via the box's ProtonVPN, not the DDG-rate-limited IP, so it's stable.How
resolveBackend(env)— explicitDEEPDIVE_SEARCHwins (set=duckduckgoto reproduce old behavior); otherwise SearXNG, which needsDEEPDIVE_SEARXNG_URL. No backend resolvable → the harness exits 2 with guidance (instead of silently running the flaky DDG path).questionArgs(q, web)— swaps theduckduckgoleg of multi-adapter questions (academic,niche-ops) for the chosen backend and injects--search=<web>when a question sets none. Domain adapters (arxiv/openalex/stackexchange) and--sinceare untouched.questions.jsonstays the single source of truth.· search: \searxng``); the URL is logged to stderr only, never into committed boards (it's a private endpoint).CLAUDE.mdbench instructions updated.This keeps the project's "don't bundle SearXNG, ship it as an adapter" principle — nothing changes for users; only the dev bench points at a reliable backend so prompt deltas become measurable again.
Validation
npm test— 56 suites pass, incl. new coverage forresolveBackend,questionArgs, and the scoreboard backend line.npm run typecheck— clean.--listexits 0 without a backend; missing backend → guided error + exit 2.factual-lookupthrough the harness via SearXNG → PASS, 10 sources, 32/32 citations (1.00); confirmed the board recordssearch: searxngand the URL does not leak into it.Follow-up (not in this PR)
Unblocks resuming the
feat/synth-groundingcomparison (#97) against a stable SearXNG baseline. Separately worth considering: stop the wikipedia fallback from masking a primary-search failure as a confident "sources don't answer this" on non-recency factual questions — that's a real user-facing UX gap this exposed.