bench: run searches through SearXNG, not DuckDuckGo (#97) by askalf · Pull Request #102 · askalf/deepdive

askalf · 2026-06-15T01:26:51Z

What

Switch the quality bench (bench/run.mjs) to run its general-web searches through SearXNG instead of DuckDuckGo. The product default stays DuckDuckGo — this is a bench-only override.

Why (closes the diagnosis in #97)

DuckDuckGo HTML rate-limits the test box's IP and silently degrades to the wikipedia fallback. For a question Wikipedia can't answer — e.g. factual-lookup's "current rate limits for the Claude API" — the fallback returns Wikipedia LLM articles with no rate-limit tables, and the synth model correctly refuses to cite them. That refusal reads exactly like an LLM "hedge," which is what made #97's prompt-tuning effort unmeasurable.

Re-running the gate 2026-06-14 nailed it down:

backend	result
DuckDuckGo (stock)	1/3 cite — 2 runs logged `primary search produced nothing — retrying the round via wikipedia`, 100% Wikipedia sources
SearXNG	3/3 cite — support 1.00 / 0.81 / 1.00, zero fallback

So the bench's nondeterminism was the search layer, not claude-sonnet-4-6. SearXNG egresses via the box's ProtonVPN, not the DDG-rate-limited IP, so it's stable.

How

resolveBackend(env) — explicit DEEPDIVE_SEARCH wins (set =duckduckgo to reproduce old behavior); otherwise SearXNG, which needs DEEPDIVE_SEARXNG_URL. No backend resolvable → the harness exits 2 with guidance (instead of silently running the flaky DDG path).
questionArgs(q, web) — swaps the duckduckgo leg of multi-adapter questions (academic, niche-ops) for the chosen backend and injects --search=<web> when a question sets none. Domain adapters (arxiv/openalex/stackexchange) and --since are untouched. questions.json stays the single source of truth.
Scoreboards now record the backend (· search: \searxng``); the URL is logged to stderr only, never into committed boards (it's a private endpoint).
CLAUDE.md bench instructions updated.

This keeps the project's "don't bundle SearXNG, ship it as an adapter" principle — nothing changes for users; only the dev bench points at a reliable backend so prompt deltas become measurable again.

Validation

npm test — 56 suites pass, incl. new coverage for resolveBackend, questionArgs, and the scoreboard backend line.
npm run typecheck — clean.
No-LLM control paths: --list exits 0 without a backend; missing backend → guided error + exit 2.
End-to-end: factual-lookup through the harness via SearXNG → PASS, 10 sources, 32/32 citations (1.00); confirmed the board records search: searxng and the URL does not leak into it.

Follow-up (not in this PR)

Unblocks resuming the feat/synth-grounding comparison (#97) against a stable SearXNG baseline. Separately worth considering: stop the wikipedia fallback from masking a primary-search failure as a confident "sources don't answer this" on non-recency factual questions — that's a real user-facing UX gap this exposed.

DuckDuckGo HTML rate-limits the test box's IP and silently degrades to the wikipedia fallback. For questions Wikipedia can't answer (e.g. factual-lookup's "Claude API rate limits"), that turns a search failure into a confident "the sources don't contain it" answer that reads like an LLM hedge — making prompt deltas unmeasurable. Stock factual-lookup scored 1/3 on DDG vs 3/3 on SearXNG with the same prompt (issue #97). The bench now routes its general-web leg through SearXNG via DEEPDIVE_SEARXNG_URL. The product default stays DuckDuckGo (CLAUDE.md: "ship SearXNG as an adapter, don't bundle it") — this override is bench-only. resolveBackend() lets an explicit DEEPDIVE_SEARCH win (set =duckduckgo to reproduce the old behavior) and errors with guidance when no backend is resolvable. questionArgs() swaps the duckduckgo leg of multi-adapter questions (academic, niche-ops) for the chosen backend and injects --search when absent, leaving domain adapters and --since untouched. Scoreboards record the backend (URL kept to stderr, never committed). Tests cover resolveBackend + questionArgs + the scoreboard line. Validated e2e: factual-lookup PASS via SearXNG, 32/32 citations (1.00).

askalf merged commit c5160ab into master Jun 15, 2026
5 checks passed

askalf deleted the feat/bench-searxng branch June 15, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: run searches through SearXNG, not DuckDuckGo (#97)#102

bench: run searches through SearXNG, not DuckDuckGo (#97)#102
askalf merged 1 commit into
masterfrom
feat/bench-searxng

askalf commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

askalf commented Jun 15, 2026

What

Why (closes the diagnosis in #97)

How

Validation

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant