From 8125798e683fc9bc500c1eb5edcc559b79d9b940 Mon Sep 17 00:00:00 2001
From: ymstar <122004928+ymstar@users.noreply.github.com>
Date: Wed, 20 May 2026 06:17:50 +0800
Subject: [PATCH 1/5] docs: add star history into README (#166)
---
README.md | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/README.md b/README.md
index 910d7801..49cf8d54 100644
--- a/README.md
+++ b/README.md
@@ -492,6 +492,16 @@ The `.codegraph/config.json` file controls indexing:
**Missing symbols** — The MCP server auto-syncs on save (wait a couple seconds). Run `codegraph sync` manually if needed. Check that the file's language is supported and isn't excluded by config patterns.
+## Star History
+
+
+
+
+
+
+
+
+
## License
MIT
From 37cf566dcfccc6a2c8e109953a4612dba8bcae19 Mon Sep 17 00:00:00 2001
From: Colby Mchenry
Date: Tue, 19 May 2026 22:10:31 -0500
Subject: [PATCH 2/5] feat(mcp): steer agents to explore-first; fix
Kotlin/Swift test detection (#191)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* feat(mcp): steer agents to explore-first; fix Kotlin/Swift test detection
Two changes from diagnosing why Claude Code's Explore agent wasn't using
codegraph_explore on a benchmark run (37 calls / ~90k tokens via
search+Read+grep, vs a general-purpose agent that led with explore: 13
calls / ~55k tokens for the same question).
1. Tool guidance reframed across server-instructions.ts,
instructions-template.ts, and .cursor/rules/codegraph.mdc (+ the
explore/search tool descriptions): codegraph_explore is the workhorse
for understanding/architecture/"how does X work" questions. Seed it with
the key symbol names (a quick search/context first if the question names
nothing concrete), read its output, and fill gaps with node/Read —
instead of searching then Reading each file. The old "search first to
find names, then explore" wording was short-circuiting: agents searched,
got file:line locations, and Read them, never reaching explore.
2. isTestFile now recognizes Kotlin (*Test.kt, jvmTest/commonTest/
androidTest source sets), Swift (*Tests.swift), and other camelCase test
conventions, so test code is deprioritized in explore/context ranking.
Previously only Java/JS/Python were known, letting tests dominate
Kotlin/Swift exploration (OkHttp "trace a request" went from 8/9 test
files to surfacing Call.kt/OkHttpClient.kt/Request.kt/Response.kt).
Capital-led matching keeps latest.kt/manifest.kt unflagged.
An IDF common-term down-weighting was prototyped for the cold-query case
but dropped — it was a measured no-op (the "common" terms weren't actually
common in the test indexes); the test-detection gap was the real cause.
Co-Authored-By: Claude Opus 4.7 (1M context)
* chore(agent-eval): add agent-behavior eval harness for codegraph MCP usage
Tooling to measure how a Claude Code agent actually uses the codegraph
MCP tools on a real repo — does it lead with codegraph_explore, how many
Read/Grep follow-ups, token cost — for validating tool-guidance changes
(server-instructions, tool descriptions) against real agent behavior.
- itrun.sh drives the real interactive TUI via tmux (the faithful
Explore path). Hardened for unattended runs: type-and-verify prompt
delivery (the ❯ glyph is drawn ~6s before the input accepts keys),
auto-accepts the "trust this folder" dialog, busy-detection keys on
the universal "(Ns · …)" spinner so the pre-stream thinking phase
counts as busy, and fails loudly instead of capturing an empty pane.
- parse-session.mjs reports the tool breakdown + token accounting
(gen / fresh-in / cached-in / billable) from the session and subagent
logs, consistent across main-thread and subagent runs; counts
main-thread Bash in the grep verdict.
- run-agent.sh / parse-run.mjs are the headless stream-json complement
(exact per-tool tokens/cost via claude -p).
- run-interactive-test.md documents how to run it and how completion is
detected.
Co-Authored-By: Claude Opus 4.7 (1M context)
---------
Co-authored-by: Claude Opus 4.7 (1M context)
---
.cursor/rules/codegraph.mdc | 6 +-
CHANGELOG.md | 20 ++++
__tests__/is-test-file.test.ts | 53 ++++++++++
run-interactive-test.md | 131 +++++++++++++++++++++++++
scripts/agent-eval/itrun.sh | 107 ++++++++++++++++++++
scripts/agent-eval/parse-run.mjs | 45 +++++++++
scripts/agent-eval/parse-session.mjs | 93 ++++++++++++++++++
scripts/agent-eval/run-agent.sh | 34 +++++++
src/installer/instructions-template.ts | 6 +-
src/mcp/server-instructions.ts | 12 +--
src/mcp/tools.ts | 6 +-
src/search/query-utils.ts | 65 ++++++------
12 files changed, 534 insertions(+), 44 deletions(-)
create mode 100644 __tests__/is-test-file.test.ts
create mode 100644 run-interactive-test.md
create mode 100755 scripts/agent-eval/itrun.sh
create mode 100644 scripts/agent-eval/parse-run.mjs
create mode 100644 scripts/agent-eval/parse-session.mjs
create mode 100755 scripts/agent-eval/run-agent.sh
diff --git a/.cursor/rules/codegraph.mdc b/.cursor/rules/codegraph.mdc
index dac86b3a..4b6073b4 100644
--- a/.cursor/rules/codegraph.mdc
+++ b/.cursor/rules/codegraph.mdc
@@ -13,22 +13,22 @@ Use codegraph for **structural** questions — what calls what, what would break
| Question | Tool |
|---|---|
+| "How does X work? / trace X / explain a system / architecture" | `codegraph_explore` (seed with symbol names) |
| "Where is X defined?" / "Find symbol named X" | `codegraph_search` |
| "What calls function Y?" | `codegraph_callers` |
| "What does Y call?" | `codegraph_callees` |
| "What would break if I changed Z?" | `codegraph_impact` |
| "Show me Y's signature / source / docstring" | `codegraph_node` |
| "Give me focused context for a task/area" | `codegraph_context` |
-| "Survey an unfamiliar module/topic" | `codegraph_explore` |
| "What files exist under path/" | `codegraph_files` |
| "Is the index healthy?" | `codegraph_status` |
### Rules of thumb
+- **`codegraph_explore` is the workhorse for understanding questions** ("how does X work", "trace…", "explain the Y system"). Feed it the key symbol/file names and read its output (line-numbered source from many files in one call). If the question names nothing concrete, do one quick `codegraph_search`/`codegraph_context` to surface the names, then explore with them. Fill gaps with `codegraph_node`/Read — don't grep-and-read your way through; that's the loop explore replaces.
+- **Delegating exploration to a subagent?** Tell it to call `codegraph_explore` first and trust the result. A generic "explore"-style agent defaults to grep+Read and treats codegraph as just a search index, throwing away the token savings.
- **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
- **Don't grep first** when looking up a symbol by name. `codegraph_search` is faster and returns kind + location + signature in one call.
-- **Don't chain `codegraph_search` + `codegraph_node`** when you just want context — `codegraph_context` is one call.
-- **`codegraph_explore` is the heavy hitter** for unfamiliar areas — it returns full source from all relevant files in one call, but is token-heavy. If your harness supports parallel subagents (e.g., Claude Code's Task tool), spawn one for explore-class questions to keep main session context clean.
- **Index lag**: the file watcher debounces ~500ms behind writes; don't re-query immediately after editing a file in the same turn.
### If `.codegraph/` doesn't exist
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7c32c152..57f19200 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -44,6 +44,17 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
VS Code ~12%. Agent-trust floor still holds — the Relationships section,
scored cluster selection, and structured-source output are all retained.
Thanks to [@essopsp](https://github.com/essopsp) for the repro.
+- **MCP / tool guidance**: the tool descriptions and installed instructions
+ now steer agents to treat `codegraph_explore` as the workhorse for
+ understanding/architecture/"how does X work" questions — seed it with the
+ key symbol names (a quick `codegraph_search`/`codegraph_context` first if
+ the question names nothing concrete) and read its output, rather than
+ searching and then Reading each file. Diagnosed from a benchmark run where
+ Claude Code's Explore agent used `codegraph_search` + Read + grep (37 tool
+ calls, ~90k tokens) and never called `codegraph_explore`, vs a
+ general-purpose agent that led with explore (13 calls, ~55k tokens) for the
+ same VS Code question. Updated in lockstep across `server-instructions.ts`,
+ `instructions-template.ts`, and `.cursor/rules/codegraph.mdc`.
### Fixed
- **MCP**: source-omission markers in `codegraph_explore` and
@@ -51,6 +62,15 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
`... (trimmed) ...`, `... (truncated) ...`) instead of C-style `//`
comments, which were misleading inside Python, Ruby, and other non-C
fenced source blocks.
+- **Search/explore ranking**: test-file detection now recognizes Kotlin
+ (`*Test.kt`, `jvmTest/`/`commonTest/`/`androidTest/` source sets), Swift
+ (`*Tests.swift`), and other camelCase test conventions, so test code is
+ properly deprioritized in `codegraph_explore` / `codegraph_context`
+ results. Previously only Java/JS/Python conventions were known, which let
+ test files dominate exploration of Kotlin/Swift codebases (e.g. an OkHttp
+ "trace a request" query returned 8/9 test files; now it surfaces
+ `Call.kt`, `OkHttpClient.kt`, `Request.kt`, `Response.kt`). Capital-led
+ matching keeps production files like `latest.kt` / `manifest.kt` unflagged.
## [0.7.10] - 2026-05-19
diff --git a/__tests__/is-test-file.test.ts b/__tests__/is-test-file.test.ts
new file mode 100644
index 00000000..e3fc6d03
--- /dev/null
+++ b/__tests__/is-test-file.test.ts
@@ -0,0 +1,53 @@
+/**
+ * isTestFile heuristic — test-file detection used to deprioritize test code in
+ * search/explore ranking.
+ *
+ * Regression coverage for the cold-query fix: the heuristic previously only
+ * knew Java/JS/Python conventions, so Kotlin (`*Test.kt`, `jvmTest/`), Swift
+ * (`*Tests.swift`), and camelCase test source-set dirs slipped through — which
+ * let OkHttp's tests flood `codegraph_explore` results on a plain-language
+ * query. The false-positive guards matter just as much: `latest.kt` /
+ * `manifest.kt` / a `RealCall.kt` production file must NOT be flagged.
+ */
+import { describe, it, expect } from 'vitest';
+import { isTestFile } from '../src/search/query-utils';
+
+describe('isTestFile', () => {
+ it('flags Kotlin test files and source sets', () => {
+ expect(isTestFile('okhttp/src/jvmTest/kotlin/okhttp3/CallTest.kt')).toBe(true);
+ expect(isTestFile('okhttp/src/commonTest/kotlin/okhttp3/CompressionInterceptorTest.kt')).toBe(true);
+ expect(isTestFile('app/src/androidTest/java/com/example/FooTest.kt')).toBe(true);
+ expect(isTestFile('module/src/integrationTest/kotlin/BarSpec.kt')).toBe(true);
+ });
+
+ it('flags Swift test files', () => {
+ expect(isTestFile('Tests/SessionTests.swift')).toBe(true);
+ expect(isTestFile('Sources/FooTest.swift')).toBe(true);
+ });
+
+ it('still flags the previously-supported conventions', () => {
+ expect(isTestFile('foo/test_bar.py')).toBe(true);
+ expect(isTestFile('pkg/bar_test.go')).toBe(true);
+ expect(isTestFile('src/foo.test.ts')).toBe(true);
+ expect(isTestFile('src/foo.spec.ts')).toBe(true);
+ expect(isTestFile('com/example/FooTest.java')).toBe(true);
+ expect(isTestFile('com/example/FooTestCase.java')).toBe(true);
+ expect(isTestFile('project/__tests__/foo.ts')).toBe(true);
+ expect(isTestFile('project/tests/foo.rb')).toBe(true);
+ });
+
+ it('does NOT flag production files that merely contain "test" lowercase', () => {
+ // The fix is capital-led so camelCase boundaries distinguish these.
+ expect(isTestFile('src/latest/loader.kt')).toBe(false);
+ expect(isTestFile('lib/manifest.kt')).toBe(false);
+ expect(isTestFile('okhttp/src/jvmMain/kotlin/okhttp3/internal/connection/RealCall.kt')).toBe(false);
+ expect(isTestFile('src/contestEntry.ts')).toBe(false);
+ expect(isTestFile('pkg/greatest.go')).toBe(false);
+ });
+
+ it('does NOT flag ordinary production source', () => {
+ expect(isTestFile('src/flask/app.py')).toBe(false);
+ expect(isTestFile('src/vs/workbench/api/common/extensionHostMain.ts')).toBe(false);
+ expect(isTestFile('okhttp/src/commonJvmAndroid/kotlin/okhttp3/OkHttpClient.kt')).toBe(false);
+ });
+});
diff --git a/run-interactive-test.md b/run-interactive-test.md
new file mode 100644
index 00000000..448c9e62
--- /dev/null
+++ b/run-interactive-test.md
@@ -0,0 +1,131 @@
+# Running the agent-behavior test (how agents actually use codegraph)
+
+This explains how to measure **how a Claude Code agent uses the codegraph MCP
+tools** on a real repo — which tools it calls (does it lead with
+`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token
+cost. Use it when changing tool guidance (`server-instructions.ts`,
+`instructions-template.ts`, tool descriptions) or retrieval, to verify the
+change actually shifts agent behavior.
+
+Scripts live in `scripts/agent-eval/`.
+
+## Why two harnesses (read this first)
+
+| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) |
+|---|---|---|
+| Drives | the real TUI via tmux | `claude -p` print mode |
+| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) |
+| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) |
+| Cost | Claude Max subscription | API $ (`total_cost_usd`) |
+
+**Headless `claude -p` does NOT reproduce what users see** — it silently picks
+the general-purpose subagent, while interactive sessions delegate to the
+read-first **Explore** subagent. So for "what does my session actually do," use
+the interactive harness. For a clean per-tool/token breakdown in one shot, use
+headless (and ask for the Explore subagent in the prompt if you want that path).
+
+## Prerequisites
+
+- **tmux 3.0+**
+- A logged-in `claude` CLI (Claude Max or API).
+- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`).
+ The interactive harness uses your global config, so it runs whatever
+ `codegraph` resolves to — point that at your dev build (`npm link` / the
+ symlinked global) to test local changes.
+- A target repo, cloned and indexed:
+ ```bash
+ git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp
+ cd /tmp/corpus/okhttp && codegraph init -i
+ ```
+ Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600),
+ OkHttp (~640), VS Code (~10k).
+
+## Interactive test (the faithful one)
+
+```bash
+scripts/agent-eval/itrun.sh