diff --git a/docs/diagrams/architecture_overview.drawio b/docs/diagrams/architecture_overview.drawio
new file mode 100644
index 0000000..0997fa9
--- /dev/null
+++ b/docs/diagrams/architecture_overview.drawio
@@ -0,0 +1,279 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/architecture_overview.md b/docs/diagrams/architecture_overview.md
new file mode 100644
index 0000000..343df10
--- /dev/null
+++ b/docs/diagrams/architecture_overview.md
@@ -0,0 +1,458 @@
+# KnowCode — System Architecture
+
+> Textual narration of [`architecture_overview.drawio`](architecture_overview.drawio).
+> Every component, relationship, and label in the draw.io file is described here in full.
+
+---
+
+## Overview
+
+KnowCode is a code intelligence system that parses a codebase into a semantic knowledge graph, indexes it with hybrid BM25 + vector search, and exposes that intelligence through four distinct interfaces: a CLI, a REST API, an MCP server, and an Agent Gateway. The system is structured into five horizontal layers plus a separately deployable Agent Gateway microservice.
+
+---
+
+## Layer 0 — User Interfaces
+
+All user-facing entry points sit in this layer. Every interface ultimately delegates to the Service Layer beneath it.
+
+### CLI (`cli.py`, click framework)
+
+The command-line interface exposes eleven commands:
+
+| Command | Purpose |
+|---|---|
+| `analyze` | Scan a directory, build knowledge graph, and auto-build semantic index |
+| `index` | (Re)build the semantic index from an existing graph |
+| `query` | Lexical query: callers, callees, dependencies, or search |
+| `context` | Generate a task-aware context bundle for an entity |
+| `semantic-search` | Natural-language search over embeddings |
+| `export` | Export the knowledge graph as Markdown documentation |
+| `stats` | Print entity and relationship counts |
+| `server` | Start the FastAPI REST server (optionally with `--watch`) |
+| `history` | Show git commit history or entity change history |
+| `ask` | Answer a question using the LLM Agent |
+| `mcp-server` | Start the MCP server over STDIO |
+
+### FastAPI REST API (`:8000`, uvicorn)
+
+Eleven endpoints grouped by rate-limit tier:
+
+**Standard (60 req/min):**
+- `GET /api/v1/health` — liveness check
+- `GET /api/v1/stats` — entity/relationship counts
+- `GET /api/v1/search?q=` — lexical entity search
+- `GET /api/v1/context?target=&task_type=` — context bundle for a named entity
+- `GET /api/v1/entities/{entity_id}` — raw entity detail
+- `GET /api/v1/callers/{entity_id}` — direct callers
+- `GET /api/v1/callees/{entity_id}` — direct callees
+- `POST /api/v1/context/query` — semantic query with retrieval orchestration
+- `POST /api/v1/reload` — reload KnowledgeStore from disk
+
+**Expensive (10 req/min):**
+- `GET /api/v1/trace_calls/{entity_id}?direction=&depth=` — multi-hop BFS traversal
+- `GET /api/v1/impact/{entity_id}?max_depth=` — transitive impact analysis
+
+### MCP Server (STDIO, JSON-RPC 2.0)
+
+Used by Claude Desktop and compatible IDEs. Exposes four tools:
+
+1. `search_codebase(query, limit=10)`
+2. `get_entity_context(entity_id, task_type, max_tokens)`
+3. `trace_calls(entity_id, direction, depth)`
+4. `retrieve_context_for_query(query, task_type, max_tokens, limit_entities, expand_deps, verbosity)`
+
+### Agent Gateway (FastAPI `:8081`)
+
+A separately deployable microservice (in `apps/agent-gateway/`) that proxies to the KnowCode REST API and wraps it in an LLM-driven tool-use loop. Its own endpoints:
+
+- `GET /health` — gateway liveness
+- `GET /ready` — checks KnowCode + LiteLLM connectivity
+- `GET /api/v1/config` — current gateway configuration
+- `GET /api/v1/tools` — list of available tools (from OpenAPI translation)
+- `POST /api/v1/chat` — submit a message; returns answer + tool execution records
+
+### API Rate Limiter (`rate_limit.py`, slowapi, IP-keyed)
+
+Attached to the FastAPI app as middleware. Two tiers:
+- **Standard:** 60 requests/minute — all endpoints except trace and impact
+- **Expensive:** 10 requests/minute — `trace_calls`, `impact`
+
+---
+
+## Layer 1 — Service Layer
+
+### `KnowCodeService` (`service.py`)
+
+The single central orchestrator. All interfaces call this class. Key public methods:
+
+| Method | What it does |
+|---|---|
+| `analyze(dir, output, temporal, coverage)` | Builds knowledge graph via `GraphBuilder` → saves JSON → auto-calls `_build_index()`. Returns stats dict. |
+| `ensure_store()` / `ensure_index()` | Build store or index only if not already present on disk |
+| `get_indexer()` | Lazy-init `Indexer(embedding_provider)`, optionally load existing index |
+| `get_search_engine()` | Lazy-init `SearchEngine(chunk_repo, embedding_provider, HybridIndex, store)` |
+| `retrieve_context_for_query(query, max_tokens, task_type, limit_entities, expand_deps, verbosity)` | Delegates to `RetrievalOrchestrator` |
+| `search(pattern)` | Lexical entity search on `KnowledgeStore` |
+| `get_context(target, max_tokens, task_type)` | Single-entity context bundle via `ContextSynthesizer` |
+| `get_callers(id)` / `get_callees(id)` | Graph traversal shortcuts |
+| `get_entity_details(id)` | Raw entity dict |
+| `get_stats()` | Entity/relationship/chunk/vector counts |
+| `reload()` | Clears in-memory `_store`, re-reads from disk on next access |
+
+The `store` property is lazy: it loads `KnowledgeStore` from disk on first access and caches it as `_store`.
+
+---
+
+## Layer 2 — Core Processing Pipelines
+
+### Indexing Pipeline
+
+Three components form a linear chain: **GraphBuilder → Chunker → Indexer**.
+
+**`GraphBuilder` (`graph_builder.py`)**
+
+- `build_from_directory(root_dir, additional_ignores, analyze_temporal, coverage_path)` — orchestrates the full scan:
+ 1. Calls `Scanner.scan(root_dir)` to discover files (applying `.gitignore` via pathspec)
+ 2. For each `FileInfo`, calls `_parse_file()` which selects the correct parser by language
+ 3. Accumulates `ParseResult` objects via `_merge_result()`
+ 4. After all files: calls `_resolve_references()` to wire cross-file `CALLS`, `IMPORTS`, `INHERITS` relationships
+ 5. Optionally runs `TemporalAnalyzer` (git history) and `CoverageProcessor` (Cobertura XML)
+- Exposes: `get_entity()`, `get_entities_by_kind()`, `search_entities()`, `stats()`
+
+**`Chunker` (`chunker.py`)**
+
+- `process_parse_result(result)` — splits each entity into overlapping `CodeChunk` objects:
+ - Module header chunks (file-level docstring + metadata)
+ - Import block chunk
+ - Per-entity chunks (signature + docstring + body)
+ - Each chunk carries BM25 tokenized `tokens[]` list
+
+**`Indexer` (`indexer.py`)**
+
+- `index_directory(directory)` — runs its own internal scan+parse+chunk+embed pipeline end-to-end
+- `index_file(file_path)` — incremental re-index of a single file (used by `BackgroundIndexer`)
+- `save(index_path)` — writes `chunks.json`, `vectors.index`, `vectors.json` under `knowcode_index/`
+- `load(index_path)` — restores from disk
+
+> **Note:** `KnowCodeService.analyze()` calls `GraphBuilder` for the knowledge graph, then separately calls `_build_index()` which creates a new `Indexer` that scans again. Both pipelines run during `knowcode analyze`.
+
+### Retrieval Pipeline
+
+Five components: **QueryClassifier → HybridIndex → SearchEngine → Reranker → expand_dependencies**.
+
+**`QueryClassifier` (`query_classifier.py`)**
+
+- `classify_query(query)` → `(TaskType, confidence: float)`
+- Uses regex pattern matching with weighted scoring across five task types: `EXPLAIN`, `DEBUG`, `EXTEND`, `REVIEW`, `LOCATE`
+- Returns `GENERAL` with confidence 0.0 when no patterns match
+- Also provides `get_prompt_template(task_type)` — task-specific LLM system prompt strings
+
+**`HybridIndex` (`hybrid_index.py`)**
+
+- `search(query_text, query_vector, limit)` → `list[(CodeChunk, score)]`
+- Combines:
+ - BM25 lexical search on `ChunkRepository` token lists
+ - FAISS dense similarity search on `VectorStore` (cosine via `IndexFlatIP` with normalized vectors)
+ - Merges and normalizes scores from both retrieval modes
+
+**`SearchEngine` (`search_engine.py`)**
+
+- `search_scored(query, limit, expand_deps)` → `list[ScoredChunk]` — the full pipeline:
+ 1. `embedding_provider.embed_single(query)` → query vector
+ 2. `hybrid_index.search(query, query_vector, limit×2)`
+ 3. `reranker.rerank(query, results, top_k=limit)`
+ 4. `expand_dependencies()` for each top result
+- `search(query, limit, expand_deps)` → `list[CodeChunk]` (strips scores)
+- `ScoredChunk` carries `{chunk, score, source: "retrieved"|"dependency"}`
+
+**`Reranker` (`reranker.py`)**
+
+- `rerank(query, chunks, top_k)` → `list[(CodeChunk, score)]`
+- **Primary path:** VoyageAI cross-encoder (`rerank-2.5` model via `voyage_client.rerank()`)
+- **Fallback path** (if VoyageAI unavailable): signal-based scoring:
+ - `boost_documented`: ×1.2 if chunk has docstring
+ - `boost_recent`: ×1.1 if last-modified within 7 days
+ - Query-in-content: ×1.5 if query string appears in chunk text
+ - Exact kind match: ×2.0
+
+**`expand_dependencies` (`completeness.py`)**
+
+- Takes a `CodeChunk` and expands to include its callees (up to `max_depth=1`)
+- Uses `chunk_repo.get_by_entity()` + `knowledge_store.get_callees()`
+- Marks expanded chunks with `source="dependency"`
+
+### `RetrievalOrchestrator` (`retrieval/orchestrator.py`)
+
+Coordinates the full end-to-end retrieval flow:
+
+1. Validate store + index exist
+2. `classify_query()` → resolve task type (override if caller specified one)
+3. `get_search_engine()` → validate index compatibility (embedding dimension + model)
+4. `engine.search_scored()` → semantic retrieval (falls back to lexical on any exception)
+5. For each selected entity: `get_context()` → `ContextSynthesizer`
+6. Assemble `context_text`, compute average `sufficiency_score`
+7. Filter response fields based on `verbosity`:
+ - `minimal` → `{context_text, sufficiency_score, total_tokens, reduction_summary}`
+ - `standard` → + `query, task_type, task_confidence, retrieval_mode, max_tokens, truncated`
+ - `verbose` → + `evidence[]`
+ - `diagnostic` → full dict with all fields and `errors[]`
+
+---
+
+## Layer 2b — LLM Agent
+
+### `ContextSynthesizer` (`analysis/context_synthesizer.py`)
+
+Generates token-budget-aware context bundles for individual entities.
+
+- `synthesize(entity_id, summarize)` — default synthesis: header + docstring + signature + source_code + parent + callers + callees + children (in priority order, stopping at token budget)
+- `synthesize_with_task(entity_id, task_type, summarize)` — task-prioritized synthesis using `TASK_TEMPLATES`:
+
+| TaskType | Priority order | Boosts |
+|---|---|---|
+| `DEBUG` | source_code, callers, callees, signature, docstring | source_code ×2.0, callers ×1.5 |
+| `EXTEND` | signature, docstring, children, parent, source_code | signature ×1.5, children ×1.3 |
+| `REVIEW` | source_code, callers, callees, signature | callers ×1.5, callees ×1.5 |
+| `EXPLAIN` | docstring, signature, source_code, callees, parent | docstring ×1.5, callees ×1.3 |
+| `LOCATE` | signature, docstring, parent | none |
+| `GENERAL` | docstring, signature, source_code, parent, callers, callees | none |
+
+- `_calculate_sufficiency(task_type, content_included, entity, text)` → `float 0.0–1.0`
+ - Weighted sum over priority sections (weight = 1/(rank+1))
+ - Bonus: +0.2 if source_code included; +0.1 if long docstring present
+ - Penalty: ×0.5 if total context < 100 chars
+- Returns `ContextBundle {target_entity, context_text, included_entities, total_tokens, truncated, task_type, sufficiency_score}`
+
+### `Agent` (`llm/agent.py`)
+
+Answers codebase questions using configured LLM providers.
+
+- `answer(query)` — always invokes an LLM:
+ 1. `service.retrieve_context_for_query(query)` → context bundle
+ 2. `get_prompt_template(task_type)` → system instructions
+ 3. Iterate configured models with RPM/RPD rate-limit check:
+ - Google: `client.models.generate_content(model, prompt)`
+ - OpenAI-compatible: `client.chat.completions.create(model, messages)` (with `HTTP-Referer` header for OpenRouter)
+ - On `ResourceExhausted` or error: try next model
+ 4. `rate_limiter.record_usage(model.name)` → `~/.knowcode/usage_stats.json`
+
+- `smart_answer(query, force_llm=False)` — local-first:
+ 1. Retrieve context and check `sufficiency_score ≥ config.sufficiency_threshold` (default 0.8)
+ 2. If sufficient: `_format_local_answer()` — returns context-only answer (zero LLM tokens)
+ 3. If insufficient or `force_llm=True`: delegates to `answer()`
+ 4. Returns `{answer, source: "local"|"llm", task_type, sufficiency_score, context, llm_tokens_saved}`
+
+---
+
+## Layer 3 — Storage Layer
+
+### `KnowledgeStore` (`storage/knowledge_store.py`)
+
+- In-memory semantic graph: `entities: dict[str, Entity]` + `relationships: list[Relationship]`
+- Persistence: `knowcode_knowledge.json` (schema v2)
+- Core factory: `from_graph_builder(builder)` — transfers parsed data into the store
+- Persistence: `save(path)` / `load(path)` / `_migrate_schema()` (handles v1→v2 upgrade)
+- Graph queries: `search()`, `get_entity()`, `get_callers()`, `get_callees()`, `get_children()`, `get_parent()`, `get_dependencies()`, `get_dependents()`, `trace_calls()`, `get_impact()`, `list_by_kind()`
+
+### `VectorStore` (`storage/vector_store.py`)
+
+- Wraps FAISS `IndexFlatIP` with L2-normalized embeddings (equivalent to cosine similarity)
+- Default embedding dimension: 1024 (voyage-code-3)
+- Persistence: `knowcode_index/vectors.index` (FAISS binary) + `knowcode_index/vectors.json` (metadata)
+- API: `add(chunks, embeddings)`, `search(query_vector, k)`, `save()`, `load()`, `clear()`, `_validate_and_migrate_metadata()`
+
+### `ChunkRepository` (`storage/chunk_repository.py`)
+
+- `InMemoryChunkRepository` implementation
+- Stores `CodeChunk` objects indexed by `chunk_id` and `entity_id`
+- Persistence: `knowcode_index/chunks.json`
+- API: `add(chunk)`, `get(chunk_id)`, `get_by_entity(entity_id)`, `search_by_tokens(tokens)` (BM25 candidate lookup), `clear()`
+
+---
+
+## Layer 4 — Infrastructure / Plugins
+
+### Parsers (`parsers/`)
+
+Nine parser implementations, all extending `TreeSitterParser` (base class):
+
+| Parser | Language |
+|---|---|
+| `PythonParser` | Python |
+| `JavaScriptParser` | JavaScript |
+| `TypeScriptParser` | TypeScript |
+| `JavaParser` | Java |
+| `RustParser` | Rust |
+| `VueParser` | Vue SFCs |
+| `MarkdownParser` | Markdown (docs) |
+| `YAMLParser` | YAML configs |
+
+Each implements `_extract_entities()` and returns `ParseResult {entities[], relationships[], errors[]}`. The base class handles Tree-sitter `parse_file()`, `_get_text()`, `_get_location()`, `_create_entity()`.
+
+### EmbeddingProviders (`llm/embedding.py`)
+
+Abstract base `EmbeddingProvider` with `embed(texts[])` and `embed_single(text)` methods.
+
+- `VoyageAIEmbeddingProvider` — uses `voyage-code-3` (dim=1024), distinguishes `input_type=document` (indexing) vs `input_type=query` (search)
+- `OpenAIEmbeddingProvider` — supports `text-embedding-3-small` (1536-dim) and `text-embedding-3-large` (3072-dim)
+- `create_embedding_provider(app_config)` factory: tries each configured embedding model in order, checks API key availability, falls back to VoyageAI default
+
+### LLM Clients (`llm/agent.py`)
+
+- `_create_google_client(api_key)` → `google.genai.Client`
+- `_create_openai_client(api_key, base_url)` → `openai.OpenAI` (with optional base_url override for OpenRouter/Mistral)
+- Model failover order defined in `AppConfig.models` (loaded from `aimodels.yaml`)
+
+### Scanner (`indexing/scanner.py`)
+
+- `scan(root_dir)` → `list[FileInfo]` — discovers all non-ignored files
+- `_load_gitignore()` — reads `.gitignore` via pathspec
+- `_should_ignore(path)` — applies gitignore rules + extension filter
+- `FileInfo`: `{path, size, modified, language}` — language auto-detected from extension
+
+### FileMonitor + BackgroundIndexer
+
+**`FileMonitor` (`indexing/monitor.py`)**
+- Wraps watchdog `Observer`
+- `IndexingHandler.on_modified(event)` + `on_created(event)` → `_handle_change(path)` → extension filter → `bg_indexer.queue_file(path)`
+- `start()` / `stop()`
+
+**`BackgroundIndexer` (`indexing/background_indexer.py`)**
+- Daemon thread + `threading.Queue`
+- `queue_file(path)` — enqueues a file path for re-indexing
+- `_worker()` — blocking dequeue loop, calls `indexer.index_file(path)` for each entry
+- `start()` / `stop()`
+
+### TemporalAnalyzer + CoverageProcessor
+
+**`TemporalAnalyzer` (`analysis/temporal.py`)**
+- `analyze_history(limit=100)` — uses GitPython to parse commit log
+- Creates `COMMIT` and `AUTHOR` entities with `AUTHORED`, `MODIFIED`, `CHANGED_BY` relationships
+- Stores `insertions`, `deletions` metadata on `MODIFIED` relationships
+
+**`CoverageProcessor` (`analysis/signals.py`)**
+- `process_cobertura(xml_path)` — parses Cobertura XML coverage report
+- Creates `COVERAGE_REPORT` entity and `COVERS` relationships linking the report to covered modules
+
+### Config (`config.py`)
+
+- `AppConfig.load()` — priority: explicit path → `./aimodels.yaml` → `~/.aimodels.yaml` → defaults
+- `ModelConfig {name, provider, api_key_env, rpm_free_tier_limit=10, rpd_free_tier_limit=1000}`
+- Defaults: NL models = `[gemini-2.0-flash-lite, gemini-1.5-flash, gemini-1.5-pro]`; embedding = `voyage-code-3`; `sufficiency_threshold = 0.8`
+
+---
+
+## Agent Gateway (Separate Microservice)
+
+Located in `apps/agent-gateway/`. Can be moved to an independent repository without code changes.
+
+### `GatewaySettings` (`settings.py`)
+
+Frozen dataclass loaded from environment variables via `from_env()`:
+
+| Setting | Default |
+|---|---|
+| `knowcode_api_base_url` | `http://127.0.0.1:8000` |
+| `litellm_base_url` | `http://127.0.0.1:4000` |
+| `litellm_api_key` | `sk-local-proxy` |
+| `default_model` | `gemini/gemini-3-flash-preview` |
+| `max_tool_rounds` | `4` |
+| `tool_timeout_seconds` | `30.0` |
+| `openapi_cache_ttl_seconds` | `300` |
+| `allowed_tool_names` | `{query_context, search, get_context, trace_calls}` |
+
+### `AgentOrchestrator` (`orchestrator.py`)
+
+- `run(ChatRequest)` → `ChatResponse` — the main agentic loop:
+ 1. `_pick_tool_names(request)` → `select_tool_names(message)` (keyword heuristics)
+ 2. Fetch tool schemas from `OpenAPIToolRegistry`
+ 3. Loop ≤ `max_tool_rounds`:
+ - `LiteLLMClient.create_chat_completion(messages, tools)`
+ - `_first_choice(response)` extracts `tool_call`
+ - `_execute_tool_call(tool_call, timeout)` → `KnowCodeClient.execute_tool()`
+ - Append tool result to messages, record `ToolExecutionRecord`
+ 4. Build and return `ChatResponse`
+- `list_tools()` → available tool names
+- `readiness()` → checks KnowCode + LiteLLM health
+
+### `ToolSelector` (`tool_selector.py`)
+
+- `select_tool_names(message)` — keyword heuristics on the user message text
+- Returns a subset of `allowed_tool_names` based on detected intent
+
+### `LiteLLMClient` (`litellm_client.py`)
+
+- `create_chat_completion(messages, tools, model, temperature)` → sends to LiteLLM proxy `:4000`
+- `check_health()` — pings LiteLLM
+- `_extract_response_cost(response)` — extracts cost metadata
+
+### `KnowCodeClient` (`knowcode_client.py`)
+
+- `execute_tool(tool_name, args)` — dispatches to KnowCode REST API:
+ - `query_context` → `POST /api/v1/context/query`
+ - `search` → `GET /api/v1/search?q=...`
+ - `get_context` → `GET /api/v1/context?target=...`
+ - `trace_calls` → `GET /api/v1/trace_calls/{entity_id}?direction=...&depth=...`
+- `check_health()` — pings KnowCode `/api/v1/health`
+
+### `OpenAPIToolRegistry` + `OpenAPIToolTranslator` (`openapi_tools.py`)
+
+- `fetch_openapi_spec(url)` → fetches `/openapi.json` from KnowCode
+- `OpenAPIToolTranslator` converts OpenAPI operation objects into OpenAI-compatible tool schema dicts
+- Results cached for `openapi_cache_ttl_seconds = 300` seconds
+
+### LiteLLM Proxy (`:4000`)
+
+- Configured via `litellm.config.yaml`
+- Accepts OpenAI-compatible requests and proxies to configured upstream LLMs (Google Gemini, others)
+- Manages rate-limit passthrough
+
+---
+
+## Key Data Models
+
+### `Entity`
+```
+id: "file_path::qualified_name"
+kind: EntityKind (MODULE|CLASS|FUNCTION|METHOD|VARIABLE|DOCUMENT|SECTION|CONFIG_KEY|COMMIT|AUTHOR|TEST_RUN|COVERAGE_REPORT)
+name, qualified_name, location: Location{file_path, line_start, line_end, column_start, column_end}
+docstring, signature, source_code, metadata: dict
+```
+
+### `Relationship`
+```
+source_id, target_id, kind: RelationshipKind, metadata: dict
+RelationshipKind: CALLS|IMPORTS|CONTAINS|INHERITS|IMPLEMENTS|USES_TYPE|REFERENCES
+ CHANGED_BY|AUTHORED|MODIFIED (temporal)
+ COVERS|EXECUTED_BY (runtime)
+```
+
+### `CodeChunk`
+```
+id: "entity_id::chunk_index"
+entity_id, content, tokens: list[str], embedding: list[float] | None, metadata: dict
+```
+
+### `EmbeddingConfig` (default)
+```
+provider: "voyageai", model_name: "voyage-code-3", dimension: 1024, batch_size: 100, normalize: True
+```
+
+---
+
+## Cross-Layer Arrows Summary
+
+| From | To | Nature |
+|---|---|---|
+| CLI / REST API / MCP | KnowCodeService | synchronous call |
+| Agent Gateway | KnowCode REST API | HTTP (dashed) |
+| KnowCodeService | GraphBuilder / SearchEngine / RetrievalOrchestrator / ContextSynthesizer / Agent | delegation |
+| Indexer | KnowledgeStore / VectorStore / ChunkRepository | writes |
+| SearchEngine | VectorStore / ChunkRepository | reads (dashed) |
+| ContextSynthesizer | KnowledgeStore | reads (dashed) |
+| GraphBuilder | Scanner / Parsers | uses |
+| Indexer | EmbeddingProviders | uses |
+| Agent | LLM Clients | calls |
+| Reranker | EmbeddingProviders (VoyageAI) | uses (dashed) |
+| REST API | API Rate Limiter | uses (dashed) |
+| REST API | FileMonitor | triggers (dashed, watch mode) |
diff --git a/docs/diagrams/seq_agent_gateway.drawio b/docs/diagrams/seq_agent_gateway.drawio
new file mode 100644
index 0000000..bdd8a8c
--- /dev/null
+++ b/docs/diagrams/seq_agent_gateway.drawio
@@ -0,0 +1,94 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/seq_agent_gateway.md b/docs/diagrams/seq_agent_gateway.md
new file mode 100644
index 0000000..6435530
--- /dev/null
+++ b/docs/diagrams/seq_agent_gateway.md
@@ -0,0 +1,289 @@
+# Sequence Diagram — Agent Gateway Workflow
+
+> Textual narration of [`seq_agent_gateway.drawio`](seq_agent_gateway.drawio).
+> Every participant, message, and note in the draw.io file is described here in full.
+
+**Located in:** `apps/agent-gateway/`
+**Startup:** `local_up.sh`
+**Request entry:** `POST /api/v1/chat`
+**Smoke test:** `scripts/smoke_e2e.py`
+
+---
+
+## Participants
+
+| Participant | File | Role |
+|---|---|---|
+| User / IDE | — | Sends chat requests to the Gateway |
+| Gateway FastAPI `:8081` | `app.py` | HTTP server — validates requests, delegates to orchestrator |
+| AgentOrchestrator | `orchestrator.py` | Agentic tool-use loop (max 4 rounds) |
+| ToolSelector | `tool_selector.py` | Keyword heuristics — selects tool subset from user message |
+| OpenAPIToolRegistry | `openapi_tools.py` | Caches OpenAI-compatible tool schemas derived from KnowCode OpenAPI spec |
+| LiteLLMClient | `litellm_client.py` | Sends chat completion requests to LiteLLM proxy |
+| LiteLLM Proxy `:4000` | external | Normalizes to upstream LLMs (Gemini, Mistral, …) |
+| KnowCodeClient | `knowcode_client.py` | Dispatches tool calls to KnowCode REST API |
+| KnowCode REST API `:8000` | `src/knowcode/api/api.py` | The main KnowCode service API |
+
+---
+
+## Startup — `local_up.sh`
+
+### Step 1 — Start dependencies
+
+```
+local_up.sh:
+ → start KnowCode REST API on :8000
+ → start LiteLLM proxy on :4000
+```
+
+### Step 2 — Load settings
+
+```
+Gateway: GatewaySettings.from_env()
+```
+
+Settings loaded (frozen dataclass, all from environment variables):
+
+| Setting | Default |
+|---|---|
+| `knowcode_api_base_url` | — (required) |
+| `litellm_base_url` | — (required) |
+| `default_model` | — (required) |
+| `max_tool_rounds` | `4` |
+| `tool_timeout_seconds` | `30` |
+| `openapi_cache_ttl_seconds` | `300` |
+
+### Step 3 — Fetch OpenAPI spec
+
+```
+Gateway → KnowCode REST API:
+ GET {knowcode_api_base_url}/openapi.json
+```
+
+```
+KnowCode REST API → OpenAPIToolRegistry: OpenAPI spec JSON
+```
+
+### Step 4 — Translate to tool schemas
+
+```
+OpenAPIToolRegistry:
+ OpenAPIToolTranslator.translate(openapi_spec)
+ → OpenAI-compatible tool schema list (cached for 300 s)
+```
+
+### Step 5 — Gateway ready
+
+```
+Gateway: listening on :8081
+```
+
+---
+
+## Agentic Request — `POST /api/v1/chat`
+
+### Step 6 — Receive chat request
+
+```
+User / IDE → Gateway:
+ POST /api/v1/chat
+ ChatRequest{
+ message,
+ conversation[],
+ model,
+ tags,
+ tool_names,
+ temperature
+ }
+```
+
+### Step 7 — Delegate to orchestrator
+
+```
+Gateway → AgentOrchestrator: orchestrator.run(chat_request)
+```
+
+### Step 8 — Select tools
+
+```
+AgentOrchestrator → ToolSelector:
+ _pick_tool_names(request) → select_tool_names(message)
+```
+
+Keyword heuristics (not ML):
+
+| Keyword pattern | Tool selected |
+|---|---|
+| `explain`, `what is`, `describe` | `get_context` |
+| `find`, `search`, `where` | `search` |
+| `trace`, `who calls`, `callers` | `trace_calls` |
+| (default) | all four tools |
+
+Returns: subset of `{query_context, search, get_context, trace_calls}`.
+
+### Step 9 — Fetch tool schemas
+
+```
+AgentOrchestrator → OpenAPIToolRegistry:
+ get tool schemas for selected tools
+```
+
+Returns: list of OpenAI-compatible tool schema dicts.
+
+---
+
+## Tool-Use Loop — up to `max_tool_rounds=4` iterations
+
+### Step 10 — LLM completion with tools
+
+```
+AgentOrchestrator → LiteLLMClient:
+ litellm_client.create_chat_completion(
+ messages, tools=tool_schemas, model, temperature
+ )
+```
+
+### Step 11 — Forward to LiteLLM proxy
+
+```
+LiteLLMClient → LiteLLM Proxy:
+ POST http://litellm_base_url/chat/completions
+```
+
+### Step 12 — Upstream LLM call
+
+```
+LiteLLM Proxy: proxy → upstream LLM (Gemini / Mistral / …)
+```
+
+### Step 13 — Receive completion response
+
+```
+LiteLLM Proxy → LiteLLMClient:
+ ChatCompletion{
+ choices[0].finish_reason,
+ choices[0].message.tool_calls[]
+ }
+```
+
+### Step 14 — Extract tool call
+
+```
+LiteLLMClient → AgentOrchestrator:
+ _first_choice(response) → tool_call{id, name, arguments}
+```
+
+---
+
+### [if `finish_reason == "tool_calls"`] — Execute tool call (timeout = 30 s)
+
+### Step 15 — Dispatch to KnowCodeClient
+
+```
+AgentOrchestrator → KnowCodeClient:
+ _execute_tool_call(tool_call) → knowcode_client.execute_tool(name, args)
+```
+
+### Step 16 — KnowCodeClient dispatches to REST API
+
+KnowCodeClient maps tool names to REST endpoints:
+
+| Tool name | HTTP call |
+|---|---|
+| `query_context` | `POST /api/v1/context/query {query, limit, task_type}` |
+| `search` | `GET /api/v1/search?q=...` |
+| `get_context` | `GET /api/v1/context?target=...&task_type=...` |
+| `trace_calls` | `GET /api/v1/trace_calls/{entity_id}?direction=...&depth=...` |
+
+### Step 17 — API result returned
+
+```
+KnowCode REST API → KnowCodeClient: result JSON
+```
+
+### Step 18 — Record execution
+
+```
+KnowCodeClient → AgentOrchestrator:
+ ToolExecutionRecord{
+ tool_name,
+ tool_call_id,
+ arguments,
+ success,
+ latency_ms
+ }
+```
+
+### Step 19 — Append result and continue loop
+
+```
+AgentOrchestrator:
+ append tool_result to messages[]
+ → continue loop
+```
+
+---
+
+**Loop exits when:** `finish_reason == "stop"` OR `max_tool_rounds` reached.
+
+---
+
+## Final Response
+
+### Step 19 — Build ChatResponse
+
+```
+AgentOrchestrator → Gateway:
+ ChatResponse{
+ answer,
+ model,
+ usage{},
+ response_cost,
+ finish_reason,
+ selected_tools[],
+ tool_executions[]
+ }
+```
+
+### Step 20 — Return to caller
+
+```
+Gateway → User / IDE: ChatResponse
+```
+
+---
+
+## Smoke E2E — `scripts/smoke_e2e.py`
+
+Used in CI post-deploy validation or run manually.
+
+### Step 21 — Health check
+
+```
+smoke_e2e.py → Gateway: GET /health
+Gateway: assert {status: "ok"}
+```
+
+### Step 22 — Tools check
+
+```
+smoke_e2e.py → Gateway: GET /api/v1/tools
+Gateway: assert count ≥ 1 tool available
+```
+
+### Step 23 — Chat round-trip
+
+```
+smoke_e2e.py → Gateway:
+ POST /api/v1/chat
+ {message: "Use query_context and get_context to find search logic..."}
+```
+
+### Step 24 — Validate response
+
+```
+smoke_e2e.py:
+ assert answer != ''
+ assert len(tool_executions) ≥ SmokeConfig.min_tool_calls
+ [optional: filter by specific tool_name]
+```
diff --git a/docs/diagrams/seq_file_watch.drawio b/docs/diagrams/seq_file_watch.drawio
new file mode 100644
index 0000000..eefa9b7
--- /dev/null
+++ b/docs/diagrams/seq_file_watch.drawio
@@ -0,0 +1,83 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/seq_file_watch.md b/docs/diagrams/seq_file_watch.md
new file mode 100644
index 0000000..e191577
--- /dev/null
+++ b/docs/diagrams/seq_file_watch.md
@@ -0,0 +1,242 @@
+# Sequence Diagram — File Watch / Hot-Reload Workflow
+
+> Textual narration of [`seq_file_watch.drawio`](seq_file_watch.drawio).
+> Every participant, message, and note in the draw.io file is described here in full.
+
+**Triggered by:** `knowcode server --watch`
+**Effect:** Every file save triggers an incremental re-index of only that file — no full re-scan needed.
+
+---
+
+## Participants
+
+| Participant | File | Role |
+|---|---|---|
+| Developer | — | Saves source files in the project |
+| CLI | `cli/cli.py` | Parses `server --watch` flag, starts service |
+| KnowCodeService | `service.py` | Wires together indexer, monitor, and FastAPI app |
+| FileMonitor | `indexing/monitor.py` | watchdog `Observer` — watches filesystem for events |
+| IndexingHandler | `indexing/monitor.py` | watchdog event handler — filters and enqueues paths |
+| BackgroundIndexer | `indexing/background_indexer.py` | Daemon thread with `Queue` — dequeues and re-indexes |
+| Indexer | `indexing/indexer.py` | Parses, chunks, embeds a single file |
+| EmbeddingProvider | `llm/embedding.py` | VoyageAI / OpenAI embeddings API |
+| KnowledgeStore + VectorStore + ChunkRepo | `storage/` | In-memory stores updated atomically per file |
+
+---
+
+## Startup
+
+### Step 1 — Launch with `--watch`
+
+```
+Developer → CLI: knowcode server --watch
+```
+
+### Step 2 — Initialize service
+
+```
+CLI → KnowCodeService: KnowCodeService(store_path, strict_config=True)
+```
+
+### Step 3 — Load indexer
+
+```
+KnowCodeService:
+ service.get_indexer()
+ → Indexer(embedding_provider)
+ + load(knowcode_index/) [if existing index found on disk]
+```
+
+### Step 4 — Start BackgroundIndexer
+
+```
+KnowCodeService → BackgroundIndexer:
+ BackgroundIndexer(indexer).start()
+ → daemon thread started
+ + Queue() initialized
+```
+
+### Step 5 — Start FileMonitor
+
+```
+KnowCodeService → FileMonitor:
+ FileMonitor(watch_root, bg_indexer).start()
+ → watchdog Observer.start() [uses inotify / FSEvents / kqueue per OS]
+```
+
+### Step 6 — Server ready
+
+```
+KnowCodeService: FastAPI + Uvicorn listening on :8000
+```
+
+---
+
+## File Change Event
+
+Triggered by OS filesystem notifications forwarded through watchdog.
+
+### Step 7 — Developer saves a file
+
+```
+Developer → FileMonitor: save src/foo.py (write to filesystem)
+```
+
+### Step 8 — Watchdog fires event
+
+```
+FileMonitor → IndexingHandler:
+ watchdog OS event → IndexingHandler.on_modified(FileModifiedEvent)
+```
+
+> `on_created` fires the same path: `IndexingHandler.on_created → _handle_change(path)`
+
+### Step 9 — Dispatch to handler
+
+```
+IndexingHandler: _handle_change(event.src_path)
+```
+
+### Step 10 — Filter
+
+```
+IndexingHandler:
+ filter: file extension in SUPPORTED_EXTENSIONS + not gitignored
+ [path is silently dropped if filter fails]
+```
+
+### Step 11 — Enqueue
+
+```
+IndexingHandler → BackgroundIndexer:
+ bg_indexer.queue_file(file_path) → Queue.put(file_path)
+```
+
+---
+
+## Background Re-Indexing — `_worker` daemon thread
+
+### Step 12 — Dequeue
+
+```
+BackgroundIndexer: Queue.get(file_path) [blocking dequeue]
+```
+
+### Step 13 — Invoke incremental indexer
+
+```
+BackgroundIndexer → Indexer: indexer.index_file(file_path)
+```
+
+### Step 14 — Parse file
+
+```
+Indexer:
+ parse file with appropriate language parser (Tree-sitter)
+ → ParseResult{entities[], relationships[]}
+```
+
+### Step 15 — Chunk entities
+
+```
+Indexer:
+ Chunker.process_parse_result()
+ → CodeChunks[] {id, entity_id, content, tokens[], metadata}
+ (module header chunk + import block chunk + entity chunks with BM25 tokens)
+```
+
+### Step 16 — Embed chunks
+
+```
+Indexer → EmbeddingProvider:
+ embedding_provider.embed(chunk_texts[])
+ → VoyageAI / OpenAI Embeddings API call
+```
+
+```
+EmbeddingProvider → Indexer:
+ vectors (list[list[float]], L2-normalized)
+```
+
+### Step 17 — Update ChunkRepository
+
+```
+Indexer → ChunkRepository:
+ remove old chunks for entity
+ add new chunks
+```
+
+### Step 18 — Update VectorStore
+
+```
+Indexer → VectorStore:
+ remove old vectors for entity
+ add new vectors
+ → rebuild FAISS IndexFlatIP
+```
+
+### Step 19 — Update KnowledgeStore
+
+```
+Indexer → KnowledgeStore:
+ update entities + relationships for the changed file
+```
+
+### Step 20 — Persist to disk
+
+```
+Indexer:
+ indexer.save(index_path)
+ atomic write:
+ → chunks.json (all CodeChunk objects)
+ → vectors.index (FAISS binary index)
+ → vectors.json (metadata: schema version, dimension, model name)
+```
+
+### Step 21 — Re-index complete
+
+```
+BackgroundIndexer: ✓ re-index complete
+ next API request sees fresh data (no server restart needed)
+```
+
+---
+
+## Manual Reload — `POST /api/v1/reload`
+
+This is a separate mechanism that clears the in-memory **knowledge graph** cache (not the semantic index).
+
+### Step 22 — POST reload
+
+```
+Developer → KnowCodeService: POST /api/v1/reload
+```
+
+### Step 23 — Clear cache
+
+```
+KnowCodeService:
+ service.reload() → _store = None [clears in-memory KnowledgeStore cache]
+```
+
+### Step 24 — Lazy reload on next access
+
+```
+KnowCodeService → KnowledgeStore:
+ next access to service.store
+ → KnowledgeStore.load(store_path) reads knowcode_knowledge.json from disk
+```
+
+```
+KnowCodeService → Developer: {status: "reloaded"}
+```
+
+---
+
+## Contrast: Incremental vs Full Reload
+
+| Mechanism | Scope | Triggered by |
+|---|---|---|
+| `FileMonitor → BackgroundIndexer` (steps 7–21) | **Incremental**: re-indexes only the single changed file; updates `ChunkRepo`, `VectorStore`, and `KnowledgeStore` in memory | File save detected by watchdog |
+| `POST /api/v1/reload` (steps 22–24) | **Cache clear only**: discards in-memory `KnowledgeStore`; reloads from `knowcode_knowledge.json` | Manual API call |
+| `knowcode analyze` (separate command) | **Full rebuild**: GraphBuilder re-scans all files, rebuilds knowledge graph, then Indexer re-scans for semantic index | CLI command |
diff --git a/docs/diagrams/seq_indexing.drawio b/docs/diagrams/seq_indexing.drawio
new file mode 100644
index 0000000..d8333d6
--- /dev/null
+++ b/docs/diagrams/seq_indexing.drawio
@@ -0,0 +1,188 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/seq_indexing.md b/docs/diagrams/seq_indexing.md
new file mode 100644
index 0000000..5ff55e1
--- /dev/null
+++ b/docs/diagrams/seq_indexing.md
@@ -0,0 +1,240 @@
+# Sequence Diagram — Indexing / Analysis Workflow
+
+> Textual narration of [`seq_indexing.drawio`](seq_indexing.drawio).
+> Every participant, message, and note in the draw.io file is described here in full.
+
+**Triggered by:** `knowcode analyze
`
+**Side effect:** automatically builds the semantic index (no separate `knowcode index` call needed after analyze)
+
+---
+
+## Participants
+
+| Participant | File | Role |
+|---|---|---|
+| User / CI | — | Invokes `knowcode analyze` |
+| CLI | `cli/cli.py` | Parses arguments, calls service |
+| KnowCodeService | `service.py` | Central orchestrator |
+| GraphBuilder | `indexing/graph_builder.py` | Parses codebase into entity/relationship graph |
+| Scanner | `indexing/scanner.py` | File discovery with gitignore filtering |
+| Parser (×9 langs) | `parsers/` | Language-specific AST extraction |
+| KnowledgeStore | `storage/knowledge_store.py` | In-memory graph + JSON persistence |
+| Indexer | `indexing/indexer.py` | Full scan→chunk→embed pipeline |
+| Chunker | `indexing/chunker.py` | Splits entities into BM25-tokenized code chunks |
+| EmbeddingProvider | `llm/embedding.py` | Converts text to dense vectors |
+| VectorStore + ChunkRepo | `storage/vector_store.py`, `storage/chunk_repository.py` | Persists vectors (FAISS) and chunks (JSON) |
+
+---
+
+## Phase 1 — Knowledge Graph Construction
+
+### Step 1 — User invokes analyze
+
+```
+User → CLI: knowcode analyze ./src [--temporal] [--coverage=report.xml]
+```
+
+Optional flags:
+- `--temporal` — enables git history analysis
+- `--coverage=` — enables Cobertura XML coverage ingestion
+
+### Step 2 — CLI delegates to service
+
+```
+CLI → KnowCodeService: service.analyze(directory, output, ignore, temporal, coverage)
+```
+
+`output` defaults to the same directory as `directory`, producing `knowcode_knowledge.json` in place.
+
+### Step 3 — GraphBuilder instantiated and scan begins
+
+```
+KnowCodeService → GraphBuilder: GraphBuilder()
+ builder.build_from_directory(root_dir, additional_ignores, analyze_temporal, coverage_path)
+```
+
+`build_from_directory` is the top-level entry point for the knowledge graph pipeline.
+
+### Step 4 — Scanner discovers files
+
+```
+GraphBuilder → Scanner: Scanner.scan(root_dir)
+Scanner returns: list[FileInfo] {path, size, modified, language}
+```
+
+The scanner:
+- Loads `.gitignore` rules via `pathspec`
+- Applies `_should_ignore(path)` filter (extension list + gitignore patterns)
+- Returns one `FileInfo` per qualifying file, with language auto-detected from extension
+
+### Step 5 — [Loop] Parse each file
+
+For each `FileInfo` in the discovered list:
+
+```
+GraphBuilder → Parser: _parse_file(file_info) → select parser by language
+Parser: parse_file(file_path, source) → AST traversal
+Parser returns: ParseResult {entities[], relationships[], errors[]}
+```
+
+Language-specific parsers (Python, JavaScript, TypeScript, Java, Rust, Vue, Markdown, YAML) extend `TreeSitterParser`. Each parser:
+- Parses source with Tree-sitter
+- Extracts entities (functions, classes, methods, variables, modules)
+- Records intra-file relationships (CALLS, IMPORTS, CONTAINS, INHERITS)
+
+```
+GraphBuilder: _merge_result(parse_result) → accumulate entities + relationships into internal collections
+```
+
+### Step 6 — End of file loop
+
+### Step 7 — Resolve cross-file references
+
+```
+GraphBuilder: _resolve_references()
+```
+
+After all files are parsed, GraphBuilder resolves cross-file relationships:
+- CALLS edges: function calls resolved by qualified name across modules
+- IMPORTS edges: import statements linked to the imported module entity
+- INHERITS edges: class inheritance resolved by name lookup
+
+### Step 8 — Optional temporal analysis
+
+```
+GraphBuilder: [if --temporal] TemporalAnalyzer.analyze_history(limit=100)
+```
+
+- Uses GitPython to read commit log
+- Creates `COMMIT` and `AUTHOR` entities
+- Creates `AUTHOR→AUTHORED→COMMIT`, `COMMIT→MODIFIED→MODULE`, `MODULE→CHANGED_BY→COMMIT` relationships
+- Stores `insertions`, `deletions` as relationship metadata
+
+### Step 9 — Optional coverage analysis
+
+```
+GraphBuilder: [if --coverage] CoverageProcessor.process_cobertura(xml_path)
+```
+
+- Parses Cobertura XML report
+- Creates `COVERAGE_REPORT` entity
+- Creates `COVERAGE_REPORT→COVERS→MODULE` relationships with `line_rate` metadata
+
+### Step 10 — Build and save KnowledgeStore
+
+```
+KnowCodeService → KnowledgeStore: KnowledgeStore.from_graph_builder(builder)
+KnowledgeStore: store.save(output_path) → writes knowcode_knowledge.json (schema v2)
+KnowledgeStore returns to KnowCodeService: KnowledgeStore instance (cached as service._store)
+```
+
+The JSON file structure:
+```json
+{
+ "schema_version": 2,
+ "version": "1.0",
+ "metadata": {"stats": {…}, "errors": []},
+ "entities": {"entity_id": {…Entity…}},
+ "relationships": [{…Relationship…}]
+}
+```
+
+---
+
+## Phase 2 — Semantic Index Build
+
+Called automatically by `service.analyze()` immediately after saving the knowledge store. Can also be called independently via `knowcode index`.
+
+### Step 11 — Build index invoked
+
+```
+KnowCodeService: service._build_index(directory, index_path)
+```
+
+`index_path` defaults to `/knowcode_index/`.
+
+### Step 12 — Create embedding provider
+
+```
+KnowCodeService: create_embedding_provider(app_config)
+```
+
+Factory logic:
+1. Try each model in `app_config.embedding_models` in order
+2. Check API key is set in environment
+3. Return `VoyageAIEmbeddingProvider(voyage-code-3, dim=1024)` (default) or `OpenAIEmbeddingProvider`
+
+### Step 13 — Indexer runs full scan
+
+```
+KnowCodeService → Indexer: Indexer(embedding_provider)
+ indexer.index_directory(directory)
+```
+
+The Indexer **runs its own internal scan + parse + chunk pipeline** (independent of the GraphBuilder scan above). This means files are scanned twice during `knowcode analyze` — once for the knowledge graph and once for the semantic index.
+
+Internally, `index_directory` uses Scanner + GraphBuilder to re-parse, then hands results to Chunker.
+
+### Step 14 — Chunker produces code chunks
+
+```
+Indexer → Chunker: Chunker.process_parse_result(result)
+Chunker returns: CodeChunk[] {id, entity_id, content, tokens[], metadata}
+```
+
+For each parsed entity, the Chunker produces:
+- A **module header chunk**: file path, docstring, top-level summary
+- An **import block chunk**: all import statements concatenated
+- **Entity chunks** (overlapping if the entity is large): signature + docstring + body, with configurable `max_chunk_size=1000` and `overlap=100` tokens
+
+Each chunk carries BM25-tokenized `tokens[]` for lexical search.
+
+### Step 15 — [Loop] Embed chunks in batches
+
+```
+Indexer → EmbeddingProvider: EmbeddingProvider.embed(texts[]) → VoyageAI / OpenAI API call
+EmbeddingProvider returns: list[list[float]] (dim=1024, L2-normalized)
+```
+
+Batching: `batch_size=100` chunks per API call. Embeddings are L2-normalized to enable cosine similarity via FAISS `IndexFlatIP`.
+
+### Step 16 — Store chunks and vectors
+
+```
+Indexer → ChunkRepository: ChunkRepository.add(chunks)
+Indexer → VectorStore: VectorStore.add(chunks, embeddings)
+VectorStore: builds FAISS IndexFlatIP (inner product on normalized = cosine)
+```
+
+### Step 17 — Persist index to disk
+
+```
+Indexer: indexer.save(index_path)
+ → chunks.json (all CodeChunk objects)
+ → vectors.index (FAISS binary index)
+ → vectors.json (metadata: schema version, embedding dimension, model name)
+```
+
+### Step 18 — Return stats to CLI
+
+```
+Indexer returns to KnowCodeService: indexed_chunks count
+KnowCodeService returns to CLI: stats dict {entities, relationships, indexed_chunks, index_path, [index_error]}
+CLI → User: print summary (entity counts, relationship types, index size)
+```
+
+If `_build_index()` raises an exception (e.g., missing API key), `index_error` is included in stats but the overall `analyze` command still succeeds (knowledge graph was saved).
+
+---
+
+## Optional: File Watch Mode
+
+When `knowcode server --watch` is running:
+
+- `FileMonitor` (watchdog `Observer`) watches the project directory
+- On file save: `IndexingHandler.on_modified()` or `on_created()` → `_handle_change(path)` → extension filter → `bg_indexer.queue_file(path)`
+- `BackgroundIndexer._worker()` (daemon thread): dequeues paths, calls `indexer.index_file(path)`
+- `index_file(path)` re-runs steps 14–17 for the single changed file only (incremental, not full re-scan)
+- After re-index: the next API request automatically sees fresh data (no server restart needed)
+
+`POST /api/v1/reload` clears the in-memory `KnowledgeStore` cache; on next access it re-reads `knowcode_knowledge.json` from disk.
diff --git a/docs/diagrams/seq_mcp.drawio b/docs/diagrams/seq_mcp.drawio
new file mode 100644
index 0000000..484f042
--- /dev/null
+++ b/docs/diagrams/seq_mcp.drawio
@@ -0,0 +1,82 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/seq_mcp.md b/docs/diagrams/seq_mcp.md
new file mode 100644
index 0000000..0859756
--- /dev/null
+++ b/docs/diagrams/seq_mcp.md
@@ -0,0 +1,257 @@
+# Sequence Diagram — MCP Server Workflow
+
+> Textual narration of [`seq_mcp.drawio`](seq_mcp.drawio).
+> Every participant, message, and note in the draw.io file is described here in full.
+
+**Triggered by:** `knowcode mcp-server`
+**Transport:** STDIO / JSON-RPC 2.0
+**Clients:** Claude Desktop, VS Code, JetBrains, any MCP-compatible IDE
+
+---
+
+## Participants
+
+| Participant | File | Role |
+|---|---|---|
+| IDE / Claude Desktop | — | MCP client — sends `tools/call` JSON-RPC requests |
+| KnowCodeMCPServer | `mcp/server.py` | MCP server — routes tool calls, formats results |
+| KnowCodeService | `service.py` | Central orchestrator — performs all actual work |
+| KnowledgeStore | `storage/knowledge_store.py` | In-memory knowledge graph (entity/relationship data) |
+| ContextSynthesizer | `analysis/context_synthesizer.py` | Builds task-prioritized context bundles |
+| RetrievalOrchestrator | `retrieval/orchestrator.py` | Full hybrid retrieval pipeline (Tool 4 only) |
+
+---
+
+## Startup
+
+### Step 1 — Launch MCP server
+
+```
+User → KnowCodeMCPServer: knowcode mcp-server
+```
+
+### Step 2 — Start async runtime
+
+```
+KnowCodeMCPServer: run_server() → asyncio.run(run_server_async())
+```
+
+### Step 3 — Open STDIO transport
+
+```
+KnowCodeMCPServer: stdio_server(KnowCodeMCPServer)
+ → STDIO transport (stdin/stdout pipes)
+```
+
+### Step 4 — MCP initialize handshake
+
+```
+IDE / Claude Desktop → KnowCodeMCPServer:
+ MCP initialize (JSON-RPC 2.0)
+```
+
+### Step 5 — Advertise tools
+
+```
+KnowCodeMCPServer → IDE / Claude Desktop:
+ tools/list response → 4 tools with full JSON schemas
+```
+
+### Step 6 — Lazy service initialization
+
+```
+KnowCodeMCPServer → KnowCodeService:
+ KnowCodeService(store_path, strict_config=False)
+ [initialized on the first tool call, not at startup]
+```
+
+---
+
+## Tool 1 — `search_codebase`
+
+**Signature:** `search_codebase(query: str, limit: int = 10)`
+
+### Invocation
+
+```
+IDE → KnowCodeMCPServer:
+ tools/call {name: "search_codebase", arguments: {query, limit}}
+```
+
+### Execution
+
+```
+KnowCodeMCPServer → KnowCodeService: service.search(query)
+KnowCodeService → KnowledgeStore: knowledge_store.search(query)
+```
+
+`knowledge_store.search()` uses substring and token matching on entity `name` and `qualified_name` fields.
+
+### Response
+
+```
+KnowledgeStore → IDE:
+ [{id, name, qualified_name, kind, file_path, line_start}] top limit results
+```
+
+---
+
+## Tool 2 — `get_entity_context`
+
+**Signature:** `get_entity_context(entity_id: str, task_type: str = "general", max_tokens: int = 2000)`
+
+### Invocation
+
+```
+IDE → KnowCodeMCPServer:
+ tools/call {name: "get_entity_context", arguments: {entity_id, task_type, max_tokens}}
+```
+
+### Execution
+
+```
+KnowCodeMCPServer → KnowCodeService:
+ service.get_context(entity_id, task_type, max_tokens)
+
+KnowCodeService → KnowledgeStore:
+ entity = store.get_entity(entity_id) [fallback to store.search() if not found by ID]
+
+KnowCodeService → ContextSynthesizer:
+ synthesizer.synthesize_with_task(entity_id, task_type)
+ → applies TASK_TEMPLATES priority order + per-section boost multipliers
+```
+
+ContextSynthesizer fetches related nodes from KnowledgeStore:
+- `parent` entity
+- `callers[]` (entities that call this one)
+- `callees[]` (entities this one calls)
+- `children[]` (nested entities)
+
+```
+ContextSynthesizer:
+ _calculate_sufficiency(task_type, content_included, entity, text) → float 0.0–1.0
+```
+
+### Response
+
+```
+ContextSynthesizer → IDE:
+ {entity_id, qualified_name, context_text, total_tokens, sufficiency_score, task_type}
+```
+
+---
+
+## Tool 3 — `trace_calls`
+
+**Signature:** `trace_calls(entity_id: str, direction: str = "callees", depth: int = 1)`
+
+Valid direction values: `callers` | `callees`. Valid depth range: 1–5.
+
+### Invocation
+
+```
+IDE → KnowCodeMCPServer:
+ tools/call {name: "trace_calls", arguments: {entity_id, direction, depth}}
+```
+
+### Execution
+
+```
+KnowCodeMCPServer → KnowCodeService:
+ service.store.trace_calls(entity_id, direction, depth, max_results=50)
+```
+
+```
+KnowledgeStore:
+ BFS traversal on relationship graph
+ (CALLS / IMPORTED_BY edges, up to `depth` levels, max_results=50 nodes)
+```
+
+### Response
+
+```
+KnowledgeStore → IDE:
+ [{id, name, qualified_name, kind, file_path, line_start, call_depth}]
+```
+
+---
+
+## Tool 4 — `retrieve_context_for_query`
+
+**Signature:**
+```
+retrieve_context_for_query(
+ query: str,
+ task_type: str = "auto",
+ max_tokens: int = 6000,
+ limit_entities: int = 3,
+ expand_deps: bool = True,
+ verbosity: str = "minimal"
+)
+```
+
+### Invocation
+
+```
+IDE → KnowCodeMCPServer:
+ tools/call {
+ name: "retrieve_context_for_query",
+ arguments: {query, task_type, max_tokens, limit_entities, expand_deps, verbosity}
+ }
+```
+
+### Execution
+
+```
+KnowCodeMCPServer → KnowCodeService:
+ service.retrieve_context_for_query(…)
+
+KnowCodeService → RetrievalOrchestrator:
+ full hybrid pipeline:
+ classify → embed → BM25+FAISS → rerank → expand_dependencies → synthesize
+```
+
+> This is the same pipeline described in `seq_query_retrieval.drawio` — steps 4 through 14 apply in full.
+
+```
+RetrievalOrchestrator → KnowCodeMCPServer:
+ {context_text, sufficiency_score, total_tokens,
+ [+ query, task_type, retrieval_mode, evidence[] per verbosity level]}
+```
+
+### Result formatting
+
+```
+KnowCodeMCPServer:
+ format_result()
+ → MCP content block {type: "text", text: json.dumps(result)}
+
+KnowCodeMCPServer → IDE:
+ tools/call response
+```
+
+---
+
+## Error Handling
+
+All tool handler exceptions are caught at the server level. On error the server returns:
+
+```json
+{
+ "isError": true,
+ "content": [{"type": "text", "text": ""}]
+}
+```
+
+No unhandled exception propagates through the STDIO transport.
+
+---
+
+## Tool Summary
+
+| Tool | Arguments | Internal call | Returns |
+|---|---|---|---|
+| `search_codebase` | `query`, `limit=10` | `knowledge_store.search()` — substring + token match | `[{id, name, qualified_name, kind, file_path, line_start}]` top limit |
+| `get_entity_context` | `entity_id`, `task_type=general`, `max_tokens=2000` | `synthesize_with_task()` + `_calculate_sufficiency()` | `{entity_id, qualified_name, context_text, total_tokens, sufficiency_score, task_type}` |
+| `trace_calls` | `entity_id`, `direction=callees`, `depth=1` | BFS on relationship graph (max\_results=50) | `[{id, name, qualified_name, kind, file_path, line_start, call_depth}]` |
+| `retrieve_context_for_query` | `query`, `task_type=auto`, `max_tokens=6000`, `limit_entities=3`, `expand_deps=true`, `verbosity=minimal` | Full hybrid pipeline (steps 4–14 of seq\_query\_retrieval) | `{context_text, sufficiency_score, total_tokens, …per verbosity}` |
diff --git a/docs/diagrams/seq_query_retrieval.drawio b/docs/diagrams/seq_query_retrieval.drawio
new file mode 100644
index 0000000..3a302a6
--- /dev/null
+++ b/docs/diagrams/seq_query_retrieval.drawio
@@ -0,0 +1,132 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/diagrams/seq_query_retrieval.md b/docs/diagrams/seq_query_retrieval.md
new file mode 100644
index 0000000..90b6914
--- /dev/null
+++ b/docs/diagrams/seq_query_retrieval.md
@@ -0,0 +1,273 @@
+# Sequence Diagram — Query / Retrieval Workflow
+
+> Textual narration of [`seq_query_retrieval.drawio`](seq_query_retrieval.drawio).
+> Every participant, message, and note in the draw.io file is described here in full.
+
+**Triggered by:** `knowcode context` · `knowcode ask` · REST `POST /api/v1/context/query` · MCP `retrieve_context_for_query`
+
+---
+
+## Participants
+
+| Participant | File | Role |
+|---|---|---|
+| User / Agent | — | Issues query or question |
+| CLI / REST / MCP | `cli/cli.py`, `api/api.py`, `mcp/server.py` | Entry point — routes to KnowCodeService |
+| KnowCodeService | `service.py` | Central orchestrator |
+| RetrievalOrchestrator | `retrieval/orchestrator.py` | Validates, classifies, retrieves, synthesizes |
+| QueryClassifier | `llm/query_classifier.py` | Detects task type via regex pattern matching |
+| SearchEngine | `retrieval/search_engine.py` | Embeds query, calls HybridIndex, reranks |
+| HybridIndex | `retrieval/hybrid_index.py` | Merges BM25 (lexical) + FAISS (dense) results |
+| Reranker | `retrieval/reranker.py` | Cross-encoder reranking (VoyageAI primary, signal fallback) |
+| expand\_dependencies | `retrieval/completeness.py` | Expands callee context for top-ranked chunks |
+| ContextSynthesizer | `analysis/context_synthesizer.py` | Builds ContextBundle; computes sufficiency score |
+| Agent / LLM (ask cmd) | `llm/agent.py` | Generates natural language answer (Alt B only) |
+
+---
+
+## Step 1 — User invokes query entry point
+
+```
+User → CLI/REST/MCP: query / question / entity_id
+```
+
+The caller uses one of four entry points:
+- `knowcode context ` — CLI, returns structured context
+- `knowcode ask ` — CLI, returns LLM-generated answer
+- `POST /api/v1/context/query` — REST API (`QueryRequest`)
+- `retrieve_context_for_query` — MCP tool call
+
+## Step 2 — Entry point calls service
+
+```
+CLI/REST/MCP → KnowCodeService:
+ service.retrieve_context_for_query(
+ query, max_tokens=6000, task_type,
+ limit_entities=3, expand_deps, verbosity
+ )
+```
+
+## Step 3 — Service delegates to orchestrator
+
+```
+KnowCodeService → RetrievalOrchestrator:
+ orchestrator.retrieve_context_for_query(…)
+```
+
+## Step 4 — Validate preconditions
+
+```
+RetrievalOrchestrator: _assert_store_exists() + _assert_index_exists()
+```
+
+Raises HTTP 412 if the knowledge store or semantic index has not been built yet.
+
+## Step 5 — Classify query
+
+```
+RetrievalOrchestrator → QueryClassifier: classify_query(query)
+```
+
+The classifier uses five sets of weighted regex patterns (one per `TaskType`):
+- `IMPLEMENTATION`, `DEBUGGING`, `ARCHITECTURE`, `TESTING`, `GENERAL`
+
+Returns: `(TaskType, confidence)`.
+
+`resolved_task_type = task_type override (if caller supplied) OR detected task_type`
+
+## Step 6 — Lazy-init search engine
+
+```
+RetrievalOrchestrator:
+ service.get_search_engine()
+ → HybridIndex(chunk_repository, vector_store) [created once, cached]
+```
+
+## Step 7 — Validate index compatibility
+
+```
+RetrievalOrchestrator:
+ _validate_index_compatibility(index_path)
+ → checks embedding dimension + model name match
+ → raises on mismatch
+```
+
+## Step 8 — Search: retrieve scored chunks
+
+```
+RetrievalOrchestrator → SearchEngine:
+ engine.search_scored(query, limit=max(10, limit_entities×5), expand_deps)
+```
+
+### Step 9 — Embed query
+
+```
+SearchEngine:
+ embedding_provider.embed_single(query) → query_vector (dim=1024)
+```
+
+### Step 10 — Hybrid search
+
+```
+SearchEngine → HybridIndex: hybrid_index.search(query, query_vec, limit=limit×2)
+```
+
+Internally HybridIndex executes three sub-steps:
+
+- **10a** — BM25 search on `ChunkRepository` token lists (lexical)
+- **10b** — FAISS similarity search on `VectorStore` (`IndexFlatIP`, cosine similarity via L2-normalized inner product)
+- **10c** — Merge + normalize scores → `list[(CodeChunk, score)]`
+
+Returns: top `limit×2` candidates back to SearchEngine.
+
+### Step 11 — Rerank
+
+```
+SearchEngine → Reranker: reranker.rerank(query, results, top_k=limit)
+```
+
+- **Primary**: VoyageAI `rerank-2.5` cross-encoder
+- **Fallback** (if VoyageAI unavailable): signal-based scoring:
+ - `boost_documented × 1.2`
+ - `boost_recent × 1.1`
+ - query text found in content: `× 1.5`
+ - exact entity kind match: `× 2.0`
+
+Returns: `list[(CodeChunk, score)]` top\_k reranked.
+
+### Step 12 — Expand dependencies
+
+```
+SearchEngine → expand_dependencies(chunk, chunk_repo, store, max_depth=1)
+```
+
+For each top-ranked chunk (when `expand_deps=True`):
+- `chunk_repo.get_by_entity(entity_id)` — fetch all chunks for the entity
+- `store.get_callees(entity_id)` — walk CALLS relationships one level deep
+
+Returns: `list[ScoredChunk]` with `source` field: `retrieved` (original result) or `dependency` (callee).
+
+SearchEngine returns `List[ScoredChunk]` to RetrievalOrchestrator.
+
+---
+
+> **Note — Semantic fallback**: If semantic retrieval raises an exception,
+> RetrievalOrchestrator falls back to lexical search:
+> `store.search(query)` + keyword expansion.
+
+---
+
+## Step 13 — [Loop] Synthesize context per entity
+
+For each selected `entity_id` (top `limit_entities` unique entities from the evidence list):
+
+```
+RetrievalOrchestrator → ContextSynthesizer:
+ service.get_context(
+ entity_id, task_type,
+ per_entity_max_tokens,
+ summarize=(verbosity == 'minimal')
+ )
+```
+
+Internally:
+
+- **13a** — `synthesize_with_task(entity_id, task_type)` — applies `TASK_TEMPLATES` priority order and per-section boost multipliers for the resolved task type
+- **13b** — `_calculate_sufficiency(task_type, content_included, entity, text)` → float `0.0–1.0`
+
+Returns:
+```
+{
+ context_text,
+ total_tokens,
+ truncated,
+ included_entities,
+ task_type,
+ sufficiency_score
+}
+```
+
+## Step 14 — Assemble final response
+
+```
+RetrievalOrchestrator:
+ context_text = '\n---\n'.join(context_parts)
+ sufficiency = avg(sufficiency_scores)
+ apply verbosity filter
+```
+
+### Verbosity filter
+
+| Level | Fields returned |
+|---|---|
+| `minimal` | `context_text`, `sufficiency_score`, `total_tokens`, `reduction_summary` |
+| `standard` | + `query`, `task_type`, `task_confidence`, `retrieval_mode`, `max_tokens`, `truncated` |
+| `verbose` | + `evidence[]` (`rank`, `chunk_id`, `entity_id`, `score`, `source`) |
+| `diagnostic` | full dict — all fields + `errors[]` |
+
+---
+
+## Alt A — Return context to caller
+
+**Applies to:** `CLI context` · `REST /api/v1/context/query` · `MCP retrieve_context_for_query`
+
+```
+Step 15a:
+ KnowCodeService → CLI/REST/MCP: QueryResponse / ContextResponse
+ CLI/REST/MCP → User: structured context dict
+```
+
+---
+
+## Alt B — Ask command: pass to Agent / LLM
+
+**Applies to:** `CLI ask`
+
+### Step 15b — Invoke Agent
+
+```
+CLI → Agent: agent.answer(query) OR agent.smart_answer(query, force_llm)
+```
+
+### smart\_answer sufficiency check
+
+```
+Agent: check sufficiency_score ≥ threshold (default 0.8, from AppConfig)
+```
+
+- **If sufficient**: `_format_local_answer()` — returns context-only answer; no LLM tokens consumed.
+- **If insufficient or `force_llm=True`**: proceed to LLM call below.
+
+### Step 16 — Build prompt
+
+```
+Agent:
+ get_prompt_template(task_type) + context_text + question
+```
+
+### Step 17 — LLM failover loop
+
+```
+[ loop ] for each model in config.models order (RPM + RPD rate-limit check per model)
+```
+
+- **17a** — Google Gemini: `client.models.generate_content(model, prompt)`
+- **17b** — OpenAI-compatible (OpenRouter / Mistral): `client.chat.completions.create(model, messages)`
+- **17c** — `rate_limiter.record_usage(model.name)` → `~/.knowcode/usage_stats.json`
+- **17d** — On `ResourceExhausted` or other error → try next model in list
+
+```
+[ end loop ]
+```
+
+### Step 18 — Return answer
+
+```
+Agent → CLI: answer text
+```
+
+### Step 19 — CLI returns to User
+
+```
+CLI → User: {answer, source=llm|local, task_type, sufficiency_score}
+```