feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430
feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430dusterbloom wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
6 issues found across 16 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_target_graph.cpp">
<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| // Restore snapshot (skip KV copy when pooled; pager handles KV separately). | ||
| const PrefixSnapshot & snap_ref = prefix_snapshots_[slot]; | ||
| const bool snap_pooled = snap_ref.is_pooled; | ||
| restore_target_cache(snap_ref, cache_, snap_pooled); |
There was a problem hiding this comment.
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 899:
<comment>restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</comment>
<file context>
@@ -851,16 +893,29 @@ GenerateResult Qwen35Backend::restore_and_generate_impl(int slot,
+ // Restore snapshot (skip KV copy when pooled; pager handles KV separately).
+ const PrefixSnapshot & snap_ref = prefix_snapshots_[slot];
+ const bool snap_pooled = snap_ref.is_pooled;
+ restore_target_cache(snap_ref, cache_, snap_pooled);
+
+ // Pooled restore: rebuild pager from blob so KV rows are accessible.
</file context>
| restore_target_cache(snap_ref, cache_, snap_pooled); | |
| if (!restore_target_cache(snap_ref, cache_, snap_pooled)) { | |
| result.error = "restore"; | |
| out_io.emit(-1); | |
| return result; | |
| } |
| kill -0 "$pid" 2>/dev/null || break | ||
| sleep 2 | ||
| done | ||
| curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \ |
There was a problem hiding this comment.
P2: Don't use || true to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_moe_paged.sh, line 61:
<comment>Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</comment>
<file context>
@@ -0,0 +1,83 @@
+ kill -0 "$pid" 2>/dev/null || break
+ sleep 2
+ done
+ curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \
+ --data @"$REQ" 2>/dev/null \
+ | python3 -c 'import sys,json; print(json.load(sys.stdin)["choices"][0]["message"]["content"])' \
</file context>
| // qwen3.6-35B-A3B-like budget on a 24 GiB card: | ||
| // ~80 KiB/token KV (5 GiB @ 65536, 10 GiB @ 131072) | ||
| // experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present. | ||
| const uint64_t MiB = 1024ull * 1024; |
There was a problem hiding this comment.
P3: Missing #include <cstdint> for uint64_t. Test file relies on transitive include from the header kvflash_placement.h, which makes it fragile against future header cleanup.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_placement.cpp, line 26:
<comment>Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</comment>
<file context>
@@ -0,0 +1,85 @@
+ // qwen3.6-35B-A3B-like budget on a 24 GiB card:
+ // ~80 KiB/token KV (5 GiB @ 65536, 10 GiB @ 131072)
+ // experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present.
+ const uint64_t MiB = 1024ull * 1024;
+ const uint64_t GiB = 1024ull * MiB;
+ const uint64_t kv_per_tok = 80 * 1024; // bytes/token
</file context>
|
|
||
| // Persistent pipelined state (initialized once, reused across requests) | ||
| std::unique_ptr<struct PipelinedDecodeState> pipe_state_; | ||
| std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_; |
There was a problem hiding this comment.
P3: New private members are unused dead code (hybrid_spec_graph_cache_, spec_microbench_done_). Drop them until the cache/microbench path is actually implemented.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.h, line 111:
<comment>New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</comment>
<file context>
@@ -83,13 +96,20 @@ class Qwen35MoeBackend : public Qwen35Backend {
// Persistent pipelined state (initialized once, reused across requests)
std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
+ std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;
+ bool spec_microbench_done_ = false;
bool ensure_pipe_state(int kv_start);
</file context>
There was a problem hiding this comment.
12 issues found across 53 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_target_graph.cpp">
<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| while not log_path.exists() and time.time() < deadline: | ||
| time.sleep(1) | ||
|
|
||
| cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0 |
There was a problem hiding this comment.
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 723:
<comment>Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</comment>
<file context>
@@ -0,0 +1,1361 @@
+ while not log_path.exists() and time.time() < deadline:
+ time.sleep(1)
+
+ cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0
+
+ results = []
</file context>
| // If N changed from default 5, the IDs were definitely set by | ||
| // early-read and should be respected. | ||
| const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS); | ||
| if (was_early_read) { |
There was a problem hiding this comment.
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 480:
<comment>Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</comment>
<file context>
@@ -463,12 +463,41 @@ bool load_target_gguf_partial(const std::string & path,
+ // If N changed from default 5, the IDs were definitely set by
+ // early-read and should be respected.
+ const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS);
+ if (was_early_read) {
+ std::printf("[loader] using drafter-specified capture layers (%d)\n", N);
+ } else {
</file context>
| obj["extra_body"]["session_id"] = self.session_id | ||
| if self.force_temperature is not None: | ||
| obj["temperature"] = self.force_temperature | ||
| if self.think_budget and path.startswith("/v1/messages"): |
There was a problem hiding this comment.
P2: think_budget uses truthiness, so 0 is treated as "unset" and skips thinking injection for /v1/messages.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 125:
<comment>`think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) </comment>
<file context>
@@ -99,14 +102,28 @@ def do_POST(self):
+ obj["extra_body"]["session_id"] = self.session_id
+ if self.force_temperature is not None:
+ obj["temperature"] = self.force_temperature
+ if self.think_budget and path.startswith("/v1/messages"):
+ obj["thinking"] = {"type": "enabled", "budget_tokens": self.think_budget}
body = json.dumps(obj).encode("utf-8")
</file context>
| --model "$MODEL_ID" \ | ||
| --tools "$CLAUDE_TOOLS" \ | ||
| --permission-mode dontAsk \ | ||
| --dangerously-skip-permissions \ |
There was a problem hiding this comment.
P2: CLAUDE_TOOLS config is now ignored because --tools was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_claude_code.sh, line 79:
<comment>`CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</comment>
<file context>
@@ -69,9 +76,9 @@ timeout "${CLAUDE_TIMEOUT}s" "$CLAUDE_BIN" \
--model "$MODEL_ID" \
- --tools "$CLAUDE_TOOLS" \
- --permission-mode dontAsk \
+ --dangerously-skip-permissions \
--no-session-persistence \
+ "${CLAUDE_EXTRA[@]}" \
</file context>
| --dangerously-skip-permissions \ | |
| --tools "$CLAUDE_TOOLS" \ | |
| --dangerously-skip-permissions \ |
| str(SERVER_BIN), | ||
| str(TGT), | ||
| "--host", HOST, | ||
| "--port", str(PORT), |
There was a problem hiding this comment.
P2: Configured --port is ignored when launching the server; server and client can target different ports.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 514:
<comment>Configured `--port` is ignored when launching the server; server and client can target different ports.</comment>
<file context>
@@ -0,0 +1,1361 @@
+ str(SERVER_BIN),
+ str(TGT),
+ "--host", HOST,
+ "--port", str(PORT),
+ "--max-ctx", str(MAX_CTX),
+ "--cache-type-k", ctk,
</file context>
| | f16 | 18.0s | 174 | 76.8 | 12.86 | | ||
| | q4_0 | 18.0s | 167 | 76.8 | 12.86 | | ||
| | q8_0 | 18.1s | 143 | 66.4 | 11.25 | | ||
| | tq3_0 | 23.6s | 109 | 76.8 | 12.86 | |
There was a problem hiding this comment.
P3: Truncated sentence in KV precision sweep analysis — f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4 cuts off mid-thought with no closing paren or wrap-up for the section.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/NOTES.md, line 51:
<comment>Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</comment>
<file context>
@@ -0,0 +1,56 @@
+| f16 | 18.0s | 174 | 76.8 | 12.86 |
+| q4_0 | 18.0s | 167 | 76.8 | 12.86 |
+| q8_0 | 18.1s | 143 | 66.4 | 11.25 |
+| tq3_0 | 23.6s | 109 | 76.8 | 12.86 |
+f16 best; q4_0 EQUAL (free VRAM saver, no accept/AL cost); q8_0 ANOMALOUS (lower accept 66.4
+## KVFlash added to the 35B agentic config?
</file context>
|
|
||
| if not args.session_id: | ||
| print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True) | ||
| if not args.session_id and args.force_temperature is None: |
There was a problem hiding this comment.
P3: Startup warning is inaccurate when only THINK_BUDGET is configured. It can mislead debugging because proxy is not pass-through in that mode.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 143:
<comment>Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</comment>
<file context>
@@ -120,19 +137,23 @@ def main():
- if not args.session_id:
- print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True)
+ if not args.session_id and args.force_temperature is None:
+ print("[session-proxy] WARNING: no session_id or force_temperature set; proxy is pass-through only", flush=True)
</file context>
| - ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table). | ||
| - ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md). | ||
|
|
||
| See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and |
There was a problem hiding this comment.
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/RECIPE.md, line 123:
<comment>Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</comment>
<file context>
@@ -0,0 +1,124 @@
+- ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table).
+- ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md).
+
+See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and
+`ctxsweep/NOTES.md` for the supporting measurements.
</file context>
There was a problem hiding this comment.
1 issue found across 13 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| "mirror_cap": 40960, | ||
| "prompt": "needle_06k", | ||
| "status": "OK", | ||
| "accept_pct": 92.7, |
There was a problem hiding this comment.
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json, line 89:
<comment>Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</comment>
<file context>
@@ -0,0 +1,130 @@
+ "mirror_cap": 40960,
+ "prompt": "needle_06k",
+ "status": "OK",
+ "accept_pct": 92.7,
+ "avg_commit": 14.83,
+ "decode_tps_spec": 220.57,
</file context>
9b501bd to
fc90d1e
Compare
There was a problem hiding this comment.
33 issues found across 22 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path) | ||
| print(f"Server PID: {proc.pid}") | ||
|
|
||
| healthy = wait_healthy() |
There was a problem hiding this comment.
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:
<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>
<file context>
@@ -0,0 +1,328 @@
+ proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+ print(f"Server PID: {proc.pid}")
+
+ healthy = wait_healthy()
+ if not healthy:
+ print("ERROR: Server did not become healthy within timeout")
</file context>
|
|
||
| ## Verdict | ||
|
|
||
| **The 15% gap is PRIMARILY THE MODEL, not the config.** |
There was a problem hiding this comment.
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:
<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>
<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>
| return cmd | ||
|
|
||
|
|
||
| def launch_server(log_path): |
There was a problem hiding this comment.
P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:
<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>
<file context>
@@ -0,0 +1,586 @@
+ return cmd
+
+
+def launch_server(log_path):
+ """Spawn the server in a child process. Returns (proc, log_fh)."""
+ env = os.environ.copy()
</file context>
|
|
||
| for line in lines: | ||
| line = line.strip() | ||
| if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower(): |
There was a problem hiding this comment.
P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:
<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>
<file context>
@@ -0,0 +1,408 @@
+
+ for line in lines:
+ line = line.strip()
+ if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+ result["oom"] = True
+ if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>
| deadline = time.time() + timeout | ||
| while time.time() < deadline: | ||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:
<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>
<file context>
@@ -0,0 +1,408 @@
+ deadline = time.time() + timeout
+ while time.time() < deadline:
+ try:
+ result = subprocess.run(
+ ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+ capture_output=True, text=True, timeout=5
</file context>
| wall_s = parse_wall_s(parsed["server_done"]) | ||
| prompt_tok = parse_prompt_tok_from_done(parsed["server_done"]) | ||
| gate_line = parsed["spec_gate"] | ||
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None |
There was a problem hiding this comment.
P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:
<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>
<file context>
@@ -0,0 +1,437 @@
+ wall_s = parse_wall_s(parsed["server_done"])
+ prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+ gate_line = parsed["spec_gate"]
+ is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+ gate_floor_reason = "N/A"
</file context>
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None | |
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None |
| action="append", | ||
| default=[], | ||
| metavar="ARG", | ||
| help="Extra arg to pass to BOTH server binaries (repeatable). " |
There was a problem hiding this comment.
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:
<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>
<file context>
@@ -0,0 +1,452 @@
+ action="append",
+ default=[],
+ metavar="ARG",
+ help="Extra arg to pass to BOTH server binaries (repeatable). "
+ "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+ )
</file context>
| SEED = 42 | ||
| N_GEN = 128 # decode tokens per probe | ||
| SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health | ||
| CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens |
There was a problem hiding this comment.
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:
<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>
<file context>
@@ -0,0 +1,452 @@
+SEED = 42
+N_GEN = 128 # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>
|
|
||
| | Bench | Blog Target | This Run | Status | | ||
| |-----------------------------|-------------|------------------|--------------------| | ||
| | Binary md5 | — | e9cb2790bb8ede64 | — | |
There was a problem hiding this comment.
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:
<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>
<file context>
@@ -0,0 +1,143 @@
+
+| Bench | Blog Target | This Run | Status |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5 | — | e9cb2790bb8ede64 | — |
+| HumanEval mean tok/s | 129.52 | **110.21** | FAIL -19.3 tok/s |
+| HumanEval mean AL | 8.31 | **11.04** | PASS +2.73 |
</file context>
| - D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens. | ||
| - gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens. | ||
| - int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens. | ||
| - B — build flag: DONE (server/build GRAPHS=ON). |
There was a problem hiding this comment.
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:
<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>
<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>
There was a problem hiding this comment.
33 issues found across 22 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path) | ||
| print(f"Server PID: {proc.pid}") | ||
|
|
||
| healthy = wait_healthy() |
There was a problem hiding this comment.
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:
<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>
<file context>
@@ -0,0 +1,328 @@
+ proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+ print(f"Server PID: {proc.pid}")
+
+ healthy = wait_healthy()
+ if not healthy:
+ print("ERROR: Server did not become healthy within timeout")
</file context>
|
|
||
| ## Verdict | ||
|
|
||
| **The 15% gap is PRIMARILY THE MODEL, not the config.** |
There was a problem hiding this comment.
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:
<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>
<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>
| return cmd | ||
|
|
||
|
|
||
| def launch_server(log_path): |
There was a problem hiding this comment.
P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:
<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>
<file context>
@@ -0,0 +1,586 @@
+ return cmd
+
+
+def launch_server(log_path):
+ """Spawn the server in a child process. Returns (proc, log_fh)."""
+ env = os.environ.copy()
</file context>
|
|
||
| for line in lines: | ||
| line = line.strip() | ||
| if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower(): |
There was a problem hiding this comment.
P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:
<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>
<file context>
@@ -0,0 +1,408 @@
+
+ for line in lines:
+ line = line.strip()
+ if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+ result["oom"] = True
+ if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>
| deadline = time.time() + timeout | ||
| while time.time() < deadline: | ||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:
<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>
<file context>
@@ -0,0 +1,408 @@
+ deadline = time.time() + timeout
+ while time.time() < deadline:
+ try:
+ result = subprocess.run(
+ ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+ capture_output=True, text=True, timeout=5
</file context>
| wall_s = parse_wall_s(parsed["server_done"]) | ||
| prompt_tok = parse_prompt_tok_from_done(parsed["server_done"]) | ||
| gate_line = parsed["spec_gate"] | ||
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None |
There was a problem hiding this comment.
P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:
<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>
<file context>
@@ -0,0 +1,437 @@
+ wall_s = parse_wall_s(parsed["server_done"])
+ prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+ gate_line = parsed["spec_gate"]
+ is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+ gate_floor_reason = "N/A"
</file context>
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None | |
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None |
| action="append", | ||
| default=[], | ||
| metavar="ARG", | ||
| help="Extra arg to pass to BOTH server binaries (repeatable). " |
There was a problem hiding this comment.
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:
<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>
<file context>
@@ -0,0 +1,452 @@
+ action="append",
+ default=[],
+ metavar="ARG",
+ help="Extra arg to pass to BOTH server binaries (repeatable). "
+ "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+ )
</file context>
| SEED = 42 | ||
| N_GEN = 128 # decode tokens per probe | ||
| SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health | ||
| CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens |
There was a problem hiding this comment.
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:
<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>
<file context>
@@ -0,0 +1,452 @@
+SEED = 42
+N_GEN = 128 # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>
|
|
||
| | Bench | Blog Target | This Run | Status | | ||
| |-----------------------------|-------------|------------------|--------------------| | ||
| | Binary md5 | — | e9cb2790bb8ede64 | — | |
There was a problem hiding this comment.
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:
<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>
<file context>
@@ -0,0 +1,143 @@
+
+| Bench | Blog Target | This Run | Status |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5 | — | e9cb2790bb8ede64 | — |
+| HumanEval mean tok/s | 129.52 | **110.21** | FAIL -19.3 tok/s |
+| HumanEval mean AL | 8.31 | **11.04** | PASS +2.73 |
</file context>
| - D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens. | ||
| - gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens. | ||
| - int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens. | ||
| - B — build flag: DONE (server/build GRAPHS=ON). |
There was a problem hiding this comment.
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:
<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>
<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>
There was a problem hiding this comment.
7 issues found across 10 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.") | ||
| print("The feature mirror is NOT the cause (both arms use AR without draft).") | ||
| sys.exit(1) | ||
| if c0_self and c0_c1: |
There was a problem hiding this comment.
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/phase3_gate_intraproc.py, line 220:
<comment>Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</comment>
<file context>
@@ -0,0 +1,231 @@
+ print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.")
+ print("The feature mirror is NOT the cause (both arms use AR without draft).")
+ sys.exit(1)
+ if c0_self and c0_c1:
+ print(f"GATE: PASS (AR mode) — C0 self-consistent AND C1 identical to C0.")
+ print(f"Phase 3 KV+SSM seam is correct. Warm-prefill speedup: {p0:.3f}s -> {p1:.3f}s ({speedup:.1f}x)")
</file context>
| @@ -0,0 +1,179 @@ | |||
| # Fast FlashAttention for very-low-bit (3-bit / ternary) KV cache — prior art | |||
There was a problem hiding this comment.
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 5:
<comment>External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</comment>
<file context>
@@ -0,0 +1,179 @@
+
+**Problem:** `tq3_0` KV in llama.cpp/ggml-cuda decodes ~2× slower than `q4_0`/`f16` because there is no fast tensor-core FlashAttention kernel for it. This document surveys how the community (llama.cpp maintainers, research literature, production engines) handles fast attention over sub-4-bit KV.
+
+Research date: 2026-06-22.
+
+---
</file context>
| break | ||
|
|
||
| # Spearman rank correlation | ||
| from scipy.stats import spearmanr |
There was a problem hiding this comment.
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 335:
<comment>scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</comment>
<file context>
@@ -0,0 +1,392 @@
+ break
+
+ # Spearman rank correlation
+ from scipy.stats import spearmanr
+ rho_1bit, _ = spearmanr(s_true, s_1bit)
+ rho_2bit, _ = spearmanr(s_true, s_2bit)
</file context>
| @@ -0,0 +1,277 @@ | |||
| # TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp) | |||
There was a problem hiding this comment.
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 273:
<comment>External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</comment>
<file context>
@@ -0,0 +1,277 @@
+## Source URLs (all fetched 2026-06-22)
+
+- Repo: https://github.com/Indras-Mirror/llama.cpp-turboq-mtp
+- Kernel: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4.cuh
+- Launcher: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4-launch.cuh
+- Centroids/WHT: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/tbq4-cuda.cuh
</file context>
| @@ -0,0 +1,277 @@ | |||
| # TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp) | |||
There was a problem hiding this comment.
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 253:
<comment>MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</comment>
<file context>
@@ -0,0 +1,277 @@
+ or the visible commit list (see Caveats). The mechanism that *produces* that result — fused
+ dequant, no HBM FP16 KV — is confirmed in code.
+
+## License / attribution
+
+- **MIT** (llama.cpp upstream license; fork shows an MIT badge). Reusing the kernel is permitted
</file context>
| if (n < expected) return false; | ||
|
|
||
| // Read ledger into a temp buffer before reset() clears state. | ||
| std::vector<uint8_t> ledger_was_res(nc, 1u); // default: treat as resident |
There was a problem hiding this comment.
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided nc before using it to allocate ledger/host buffers and resize chunks_. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/kvflash_pager.h, line 589:
<comment>deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</comment>
<file context>
@@ -515,42 +530,79 @@ class KvFlashPager {
+ if (n < expected) return false;
+
+ // Read ledger into a temp buffer before reset() clears state.
+ std::vector<uint8_t> ledger_was_res(nc, 1u); // default: treat as resident
+ std::vector<float> ledger_scores(nc, -std::numeric_limits<float>::infinity());
+ if (has_led) {
</file context>
| **Verdict: PARTIAL-REFUTES Momus.** | ||
|
|
||
| 1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys. | ||
| But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%. |
There was a problem hiding this comment.
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md, line 6:
<comment>Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</comment>
<file context>
@@ -0,0 +1,100 @@
+**Verdict: PARTIAL-REFUTES Momus.**
+
+1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys.
+But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%.
+2-bit (magnitude only, no sign) = worse than random at count-recall. 3-bit ≈ full (ρ=0.97).
+
</file context>
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
e39d636 to
81ee43a
Compare
…loop, kvflash_qk OOB guard, moe park UAF, atoi-UB, re-prefill fallback, pin_range half-open, gguf arr-type guards) Adapted for pr/kvflash-moe-prefill-snapshot (Luce-Org#430): seeded_scores_ptr/n_chunks calls in score_chunks omitted (seeded ledger API not yet in this branch); park() UAF fix delegates to base class (moe_hybrid_logits_sg_ not cached persistently in this branch so no additional teardown needed).
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged KV in logical chunk order; header-validated against layout) and a factored for_each_segment() helper. serde uses synchronous get/set and adapts to the pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning (pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the ensure_free_block + reselect protections; empty by default (byte-identical non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard reject, pinning, deadlock guard, reset.
…r KVFlash Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction automatic in alloc_span); the restore residual delta routes through the same chunked path. Pooled snapshot save/restore serializes the pager into the prefix snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn 128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385) and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data. GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool sizes; cross-turn disk restore round-trips losslessly.
…gment Three complexity cuts, no behavior change (GPU sink-recall gate + serde/ placement unit tests green): - merge restore residual's identical snap_pooled/else chunk loops into one (the else ct ternary already subsumes the pooled case) - extract chunked_prefill() shared by generate_impl kvf_paged + restore residual - inline single-caller for_each_segment template into serialize net -25 lines (54 ins / 79 del).
…uard removal core 71371d8 removed the !layout_known_ short-circuit; cold_prefix_boundary now returns the last eligible boundary. Updates the stale ==0 expectation. CI: test_server_unit.cpp.
…loop, kvflash_qk OOB guard, moe park UAF, atoi-UB, re-prefill fallback, pin_range half-open, gguf arr-type guards) Adapted for pr/kvflash-moe-prefill-snapshot (Luce-Org#430): seeded_scores_ptr/n_chunks calls in score_chunks omitted (seeded ledger API not yet in this branch); park() UAF fix delegates to base class (moe_hybrid_logits_sg_ not cached persistently in this branch so no additional teardown needed).
There was a problem hiding this comment.
7 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:129">
P2: `read_draft_capture_config` should validate `capture_ids` and `max_ids` before writing to the caller-supplied buffer; otherwise a null/negative input can cause invalid memory writes.</violation>
</file>
<file name="server/src/common/kvflash_qk.h">
<violation number="1" location="server/src/common/kvflash_qk.h:53">
P2: Default `seeded_n = -1` creates an out-of-bounds read risk when a caller passes a `seeded` buffer shorter than `n_chunks` without setting `seeded_n`.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1080">
P2: Complex dummy-slot normalization logic is duplicated between the cached fast path and the inline rebuild path in the same function. This increases maintenance risk: a future bug fix or behavioral tweak to one loop can be missed in the other, producing path-dependent behavior for the same routing inputs. Extract a shared helper and call it from both paths.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| float missing_score = -2.0f, | ||
| const float * seeded = nullptr, | ||
| float seeded_sentinel = -std::numeric_limits<float>::infinity(), | ||
| int seeded_n = -1) { |
There was a problem hiding this comment.
P2: Default seeded_n = -1 creates an out-of-bounds read risk when a caller passes a seeded buffer shorter than n_chunks without setting seeded_n.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/kvflash_qk.h, line 53:
<comment>Default `seeded_n = -1` creates an out-of-bounds read risk when a caller passes a `seeded` buffer shorter than `n_chunks` without setting `seeded_n`.</comment>
<file context>
@@ -47,7 +47,10 @@ inline void kvflash_qk_chunk_scores(
+ float missing_score = -2.0f,
+ const float * seeded = nullptr,
+ float seeded_sentinel = -std::numeric_limits<float>::infinity(),
+ int seeded_n = -1) {
const int group = d.n_q_heads / d.n_kv_heads;
const int n_chunks = (int)pooled_keys.size();
</file context>
| // [0, n_hot_init) is already taken by another slot we break and | ||
| // keep `next` as-is (duplicate), which is safe — the zero-weight | ||
| // slot is ignored by ids_to_sorted_host anyway. | ||
| int tries = 0; |
There was a problem hiding this comment.
P2: Complex dummy-slot normalization logic is duplicated between the cached fast path and the inline rebuild path in the same function. This increases maintenance risk: a future bug fix or behavioral tweak to one loop can be missed in the other, producing path-dependent behavior for the same routing inputs. Extract a shared helper and call it from both paths.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.cpp, line 1080:
<comment>Complex dummy-slot normalization logic is duplicated between the cached fast path and the inline rebuild path in the same function. This increases maintenance risk: a future bug fix or behavioral tweak to one loop can be missed in the other, producing path-dependent behavior for the same routing inputs. Extract a shared helper and call it from both paths.</comment>
<file context>
@@ -1073,8 +1073,16 @@ static bool eval_moe_hybrid_ffn_batched_core(
+ // [0, n_hot_init) is already taken by another slot we break and
+ // keep `next` as-is (duplicate), which is safe — the zero-weight
+ // slot is ignored by ids_to_sorted_host anyway.
+ int tries = 0;
+ while (tries < n_hot_init &&
+ [&]{ for (int k=0; k<n_used; ++k) if (k!=s && hot_sel[base+k]==next) return true; return false; }()) {
</file context>
| int & n_capture, | ||
| int * capture_ids, | ||
| int max_ids) { | ||
| n_capture = 0; |
There was a problem hiding this comment.
P2: read_draft_capture_config should validate capture_ids and max_ids before writing to the caller-supplied buffer; otherwise a null/negative input can cause invalid memory writes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 129:
<comment>`read_draft_capture_config` should validate `capture_ids` and `max_ids` before writing to the caller-supplied buffer; otherwise a null/negative input can cause invalid memory writes.</comment>
<file context>
@@ -117,6 +117,76 @@ int count_swa_layers(const DraftWeights & w) {
+ int & n_capture,
+ int * capture_ids,
+ int max_ids) {
+ n_capture = 0;
+ gguf_init_params gip{};
+ gip.no_alloc = true;
</file context>
| n_capture = 0; | |
| n_capture = 0; | |
| if (!capture_ids || max_ids <= 0) return false; |
bd8bc7a to
bffe297
Compare
What
Pooled chunked prefill for qwen35moe (Qwen3.6-35B-A3B) over KVFlash: when the
prompt exceeds the resident pool, prefill loops
hybrid_forward_batchoverchunk-sized slices with live eviction instead of refusing. Plus pooled
snapshot/restore (save/restore the bounded pool across requests) and a
complexity-only refactor (dedup the two identical restore chunk loops, extract
chunked_prefill, inline a single-caller helper — net −25 LOC, behaviour-identical).Stacking
This is the tip of the KVFlash-MoE stack and depends on:
Until those merge this PR's diff includes their commits; rebasing after they land
leaves only the prefill-snapshot + refactor commits.
Tests
test_kvflash_moe_paged.sh— GPU silent-corruption gate: a sink fact in thefirst (protected) chunk is recalled after the middle is evicted, and the greedy
(temp-0) answer is identical across two pool sizes. Green on RTX 3090 / Q3_K_M.