From 8125798e683fc9bc500c1eb5edcc559b79d9b940 Mon Sep 17 00:00:00 2001 From: ymstar <122004928+ymstar@users.noreply.github.com> Date: Wed, 20 May 2026 06:17:50 +0800 Subject: [PATCH 1/5] docs: add star history into README (#166) --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 910d7801..49cf8d54 100644 --- a/README.md +++ b/README.md @@ -492,6 +492,16 @@ The `.codegraph/config.json` file controls indexing: **Missing symbols** — The MCP server auto-syncs on save (wait a couple seconds). Run `codegraph sync` manually if needed. Check that the file's language is supported and isn't excluded by config patterns. +## Star History + + + + + + Star History Chart + + + ## License MIT From 37cf566dcfccc6a2c8e109953a4612dba8bcae19 Mon Sep 17 00:00:00 2001 From: Colby Mchenry Date: Tue, 19 May 2026 22:10:31 -0500 Subject: [PATCH 2/5] feat(mcp): steer agents to explore-first; fix Kotlin/Swift test detection (#191) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(mcp): steer agents to explore-first; fix Kotlin/Swift test detection Two changes from diagnosing why Claude Code's Explore agent wasn't using codegraph_explore on a benchmark run (37 calls / ~90k tokens via search+Read+grep, vs a general-purpose agent that led with explore: 13 calls / ~55k tokens for the same question). 1. Tool guidance reframed across server-instructions.ts, instructions-template.ts, and .cursor/rules/codegraph.mdc (+ the explore/search tool descriptions): codegraph_explore is the workhorse for understanding/architecture/"how does X work" questions. Seed it with the key symbol names (a quick search/context first if the question names nothing concrete), read its output, and fill gaps with node/Read — instead of searching then Reading each file. The old "search first to find names, then explore" wording was short-circuiting: agents searched, got file:line locations, and Read them, never reaching explore. 2. isTestFile now recognizes Kotlin (*Test.kt, jvmTest/commonTest/ androidTest source sets), Swift (*Tests.swift), and other camelCase test conventions, so test code is deprioritized in explore/context ranking. Previously only Java/JS/Python were known, letting tests dominate Kotlin/Swift exploration (OkHttp "trace a request" went from 8/9 test files to surfacing Call.kt/OkHttpClient.kt/Request.kt/Response.kt). Capital-led matching keeps latest.kt/manifest.kt unflagged. An IDF common-term down-weighting was prototyped for the cold-query case but dropped — it was a measured no-op (the "common" terms weren't actually common in the test indexes); the test-detection gap was the real cause. Co-Authored-By: Claude Opus 4.7 (1M context) * chore(agent-eval): add agent-behavior eval harness for codegraph MCP usage Tooling to measure how a Claude Code agent actually uses the codegraph MCP tools on a real repo — does it lead with codegraph_explore, how many Read/Grep follow-ups, token cost — for validating tool-guidance changes (server-instructions, tool descriptions) against real agent behavior. - itrun.sh drives the real interactive TUI via tmux (the faithful Explore path). Hardened for unattended runs: type-and-verify prompt delivery (the ❯ glyph is drawn ~6s before the input accepts keys), auto-accepts the "trust this folder" dialog, busy-detection keys on the universal "(Ns · …)" spinner so the pre-stream thinking phase counts as busy, and fails loudly instead of capturing an empty pane. - parse-session.mjs reports the tool breakdown + token accounting (gen / fresh-in / cached-in / billable) from the session and subagent logs, consistent across main-thread and subagent runs; counts main-thread Bash in the grep verdict. - run-agent.sh / parse-run.mjs are the headless stream-json complement (exact per-tool tokens/cost via claude -p). - run-interactive-test.md documents how to run it and how completion is detected. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- .cursor/rules/codegraph.mdc | 6 +- CHANGELOG.md | 20 ++++ __tests__/is-test-file.test.ts | 53 ++++++++++ run-interactive-test.md | 131 +++++++++++++++++++++++++ scripts/agent-eval/itrun.sh | 107 ++++++++++++++++++++ scripts/agent-eval/parse-run.mjs | 45 +++++++++ scripts/agent-eval/parse-session.mjs | 93 ++++++++++++++++++ scripts/agent-eval/run-agent.sh | 34 +++++++ src/installer/instructions-template.ts | 6 +- src/mcp/server-instructions.ts | 12 +-- src/mcp/tools.ts | 6 +- src/search/query-utils.ts | 65 ++++++------ 12 files changed, 534 insertions(+), 44 deletions(-) create mode 100644 __tests__/is-test-file.test.ts create mode 100644 run-interactive-test.md create mode 100755 scripts/agent-eval/itrun.sh create mode 100644 scripts/agent-eval/parse-run.mjs create mode 100644 scripts/agent-eval/parse-session.mjs create mode 100755 scripts/agent-eval/run-agent.sh diff --git a/.cursor/rules/codegraph.mdc b/.cursor/rules/codegraph.mdc index dac86b3a..4b6073b4 100644 --- a/.cursor/rules/codegraph.mdc +++ b/.cursor/rules/codegraph.mdc @@ -13,22 +13,22 @@ Use codegraph for **structural** questions — what calls what, what would break | Question | Tool | |---|---| +| "How does X work? / trace X / explain a system / architecture" | `codegraph_explore` (seed with symbol names) | | "Where is X defined?" / "Find symbol named X" | `codegraph_search` | | "What calls function Y?" | `codegraph_callers` | | "What does Y call?" | `codegraph_callees` | | "What would break if I changed Z?" | `codegraph_impact` | | "Show me Y's signature / source / docstring" | `codegraph_node` | | "Give me focused context for a task/area" | `codegraph_context` | -| "Survey an unfamiliar module/topic" | `codegraph_explore` | | "What files exist under path/" | `codegraph_files` | | "Is the index healthy?" | `codegraph_status` | ### Rules of thumb +- **`codegraph_explore` is the workhorse for understanding questions** ("how does X work", "trace…", "explain the Y system"). Feed it the key symbol/file names and read its output (line-numbered source from many files in one call). If the question names nothing concrete, do one quick `codegraph_search`/`codegraph_context` to surface the names, then explore with them. Fill gaps with `codegraph_node`/Read — don't grep-and-read your way through; that's the loop explore replaces. +- **Delegating exploration to a subagent?** Tell it to call `codegraph_explore` first and trust the result. A generic "explore"-style agent defaults to grep+Read and treats codegraph as just a search index, throwing away the token savings. - **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context. - **Don't grep first** when looking up a symbol by name. `codegraph_search` is faster and returns kind + location + signature in one call. -- **Don't chain `codegraph_search` + `codegraph_node`** when you just want context — `codegraph_context` is one call. -- **`codegraph_explore` is the heavy hitter** for unfamiliar areas — it returns full source from all relevant files in one call, but is token-heavy. If your harness supports parallel subagents (e.g., Claude Code's Task tool), spawn one for explore-class questions to keep main session context clean. - **Index lag**: the file watcher debounces ~500ms behind writes; don't re-query immediately after editing a file in the same turn. ### If `.codegraph/` doesn't exist diff --git a/CHANGELOG.md b/CHANGELOG.md index 7c32c152..57f19200 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -44,6 +44,17 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). VS Code ~12%. Agent-trust floor still holds — the Relationships section, scored cluster selection, and structured-source output are all retained. Thanks to [@essopsp](https://github.com/essopsp) for the repro. +- **MCP / tool guidance**: the tool descriptions and installed instructions + now steer agents to treat `codegraph_explore` as the workhorse for + understanding/architecture/"how does X work" questions — seed it with the + key symbol names (a quick `codegraph_search`/`codegraph_context` first if + the question names nothing concrete) and read its output, rather than + searching and then Reading each file. Diagnosed from a benchmark run where + Claude Code's Explore agent used `codegraph_search` + Read + grep (37 tool + calls, ~90k tokens) and never called `codegraph_explore`, vs a + general-purpose agent that led with explore (13 calls, ~55k tokens) for the + same VS Code question. Updated in lockstep across `server-instructions.ts`, + `instructions-template.ts`, and `.cursor/rules/codegraph.mdc`. ### Fixed - **MCP**: source-omission markers in `codegraph_explore` and @@ -51,6 +62,15 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). `... (trimmed) ...`, `... (truncated) ...`) instead of C-style `//` comments, which were misleading inside Python, Ruby, and other non-C fenced source blocks. +- **Search/explore ranking**: test-file detection now recognizes Kotlin + (`*Test.kt`, `jvmTest/`/`commonTest/`/`androidTest/` source sets), Swift + (`*Tests.swift`), and other camelCase test conventions, so test code is + properly deprioritized in `codegraph_explore` / `codegraph_context` + results. Previously only Java/JS/Python conventions were known, which let + test files dominate exploration of Kotlin/Swift codebases (e.g. an OkHttp + "trace a request" query returned 8/9 test files; now it surfaces + `Call.kt`, `OkHttpClient.kt`, `Request.kt`, `Response.kt`). Capital-led + matching keeps production files like `latest.kt` / `manifest.kt` unflagged. ## [0.7.10] - 2026-05-19 diff --git a/__tests__/is-test-file.test.ts b/__tests__/is-test-file.test.ts new file mode 100644 index 00000000..e3fc6d03 --- /dev/null +++ b/__tests__/is-test-file.test.ts @@ -0,0 +1,53 @@ +/** + * isTestFile heuristic — test-file detection used to deprioritize test code in + * search/explore ranking. + * + * Regression coverage for the cold-query fix: the heuristic previously only + * knew Java/JS/Python conventions, so Kotlin (`*Test.kt`, `jvmTest/`), Swift + * (`*Tests.swift`), and camelCase test source-set dirs slipped through — which + * let OkHttp's tests flood `codegraph_explore` results on a plain-language + * query. The false-positive guards matter just as much: `latest.kt` / + * `manifest.kt` / a `RealCall.kt` production file must NOT be flagged. + */ +import { describe, it, expect } from 'vitest'; +import { isTestFile } from '../src/search/query-utils'; + +describe('isTestFile', () => { + it('flags Kotlin test files and source sets', () => { + expect(isTestFile('okhttp/src/jvmTest/kotlin/okhttp3/CallTest.kt')).toBe(true); + expect(isTestFile('okhttp/src/commonTest/kotlin/okhttp3/CompressionInterceptorTest.kt')).toBe(true); + expect(isTestFile('app/src/androidTest/java/com/example/FooTest.kt')).toBe(true); + expect(isTestFile('module/src/integrationTest/kotlin/BarSpec.kt')).toBe(true); + }); + + it('flags Swift test files', () => { + expect(isTestFile('Tests/SessionTests.swift')).toBe(true); + expect(isTestFile('Sources/FooTest.swift')).toBe(true); + }); + + it('still flags the previously-supported conventions', () => { + expect(isTestFile('foo/test_bar.py')).toBe(true); + expect(isTestFile('pkg/bar_test.go')).toBe(true); + expect(isTestFile('src/foo.test.ts')).toBe(true); + expect(isTestFile('src/foo.spec.ts')).toBe(true); + expect(isTestFile('com/example/FooTest.java')).toBe(true); + expect(isTestFile('com/example/FooTestCase.java')).toBe(true); + expect(isTestFile('project/__tests__/foo.ts')).toBe(true); + expect(isTestFile('project/tests/foo.rb')).toBe(true); + }); + + it('does NOT flag production files that merely contain "test" lowercase', () => { + // The fix is capital-led so camelCase boundaries distinguish these. + expect(isTestFile('src/latest/loader.kt')).toBe(false); + expect(isTestFile('lib/manifest.kt')).toBe(false); + expect(isTestFile('okhttp/src/jvmMain/kotlin/okhttp3/internal/connection/RealCall.kt')).toBe(false); + expect(isTestFile('src/contestEntry.ts')).toBe(false); + expect(isTestFile('pkg/greatest.go')).toBe(false); + }); + + it('does NOT flag ordinary production source', () => { + expect(isTestFile('src/flask/app.py')).toBe(false); + expect(isTestFile('src/vs/workbench/api/common/extensionHostMain.ts')).toBe(false); + expect(isTestFile('okhttp/src/commonJvmAndroid/kotlin/okhttp3/OkHttpClient.kt')).toBe(false); + }); +}); diff --git a/run-interactive-test.md b/run-interactive-test.md new file mode 100644 index 00000000..448c9e62 --- /dev/null +++ b/run-interactive-test.md @@ -0,0 +1,131 @@ +# Running the agent-behavior test (how agents actually use codegraph) + +This explains how to measure **how a Claude Code agent uses the codegraph MCP +tools** on a real repo — which tools it calls (does it lead with +`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token +cost. Use it when changing tool guidance (`server-instructions.ts`, +`instructions-template.ts`, tool descriptions) or retrieval, to verify the +change actually shifts agent behavior. + +Scripts live in `scripts/agent-eval/`. + +## Why two harnesses (read this first) + +| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) | +|---|---|---| +| Drives | the real TUI via tmux | `claude -p` print mode | +| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) | +| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) | +| Cost | Claude Max subscription | API $ (`total_cost_usd`) | + +**Headless `claude -p` does NOT reproduce what users see** — it silently picks +the general-purpose subagent, while interactive sessions delegate to the +read-first **Explore** subagent. So for "what does my session actually do," use +the interactive harness. For a clean per-tool/token breakdown in one shot, use +headless (and ask for the Explore subagent in the prompt if you want that path). + +## Prerequisites + +- **tmux 3.0+** +- A logged-in `claude` CLI (Claude Max or API). +- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`). + The interactive harness uses your global config, so it runs whatever + `codegraph` resolves to — point that at your dev build (`npm link` / the + symlinked global) to test local changes. +- A target repo, cloned and indexed: + ```bash + git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp + cd /tmp/corpus/okhttp && codegraph init -i + ``` + Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600), + OkHttp (~640), VS Code (~10k). + +## Interactive test (the faithful one) + +```bash +scripts/agent-eval/itrun.sh