Conversation
"max_tokens" is deprectated in favor of "max_completion_tokens" which sets the upper bound for reasoning+output token. Closes: #13700
* model : Update label for LFM2-24B-A2B ``` ❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | lfm2moe 24B.A2B Q4_0 | 12.54 GiB | 23.84 B | CPU | 10 | pp1 | 30.35 ± 2.49 | | lfm2moe 8B.A1B Q4_0 | 4.41 GiB | 8.34 B | CPU | 10 | pp1 | 49.24 ± 1.93 | ``` * Remove extra line
* gguf : prevent integer overflow for ggml_context mem size * ggml : fix int overflows in ggml_new_object() * gguf : prevent string exhaustion * gguf : prevent array elements exhaustion * ggml : fix negative tensor type oob * py : assert that alignment is non-zero power of 2 * ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object * gguf-py : error on duplicate keys when reading * py : restore tensor_fields * enforce proper alignment in add_custom_alignment * gguf : better name * gguf : fix ctx size for no_alloc == true * gguf : minor print fix * ggml : print values when overflow * ggml : remove deprecated ggml_type_sizef() * ggml : relax ggml_type asserts to debug-only * gguf : add mem_size overflow test * gguf : add file size check for arrays * ggml : relax asseerts for ggml_get_type_traits() * flake8 fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…outer mode (#19854) * server: fix query params lost when proxying requests in multi-model router mode * server: re-encode query params using httplib::encode_query_component in proxy
This patch adds consistency checks in the ggml-virtgpu-backend (running on the host side) to ensure that the data received from the guest is consistent (valid pointers, valid sizes and offsets).
…thods ``` 1. bck->iface.synchronize(bck) 2. buft->iface.get_alloc_size(buft, op) 3. buft->iface.get_max_size(buft) ``` these three methods are optional in the GGML interface. `get_max_size` was already properly defaulted, but `backend sychronize` and `butf get_max_size` would have segfaulted the backend if not implemented.
…lrenderer APIR (still unmerged)
… in a host/guest shared file
…64 GB with libkrun
…iles Compilation breaks when ordered alphabetically.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThis pull request introduces per-device backend instance management with explicit device handles and backend IDs for the VirtGPU APIR system; adds comprehensive input validation and overflow protection across GGML core, GGUF, and buffer operations; adds a new LLM_TYPE_24B_A2B model type; updates model implementations with explicit graph expansions; and enhances server HTTP handling with query string support. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Frontend App
participant BackendInit as Backend Init Flow
participant DeviceMgr as Device Manager
participant BackendInst as Backend Instance
participant DispatchAPI as APIR Dispatch
Client->>BackendInit: apir_backend_initialize(reg_fn)
BackendInit->>DeviceMgr: ensure_device_extension(dev)
DeviceMgr->>DeviceMgr: check device registry
DeviceMgr->>BackendInst: create_backend_instance(dev)
BackendInst->>BackendInst: allocate & validate magic
BackendInit->>DispatchAPI: backend_dispatch_initialize()
DispatchAPI->>BackendInst: initialize backend via device iface
BackendInit->>Client: return device_handle + backend_id
Note over Client,BackendInst: Later: Graph Compute
Client->>DispatchAPI: apir_backend_graph_compute(device_handle, backend_id, cgraph)
DispatchAPI->>DeviceMgr: get_backend_instance(device_handle, backend_id)
DeviceMgr->>BackendInst: retrieve instance & validate
BackendInst->>DispatchAPI: execute graph_compute
DispatchAPI->>Client: return status
Note over Client,BackendInst: Cleanup
Client->>DispatchAPI: apir_backend_cleanup(device_handle, backend_id)
DispatchAPI->>DeviceMgr: cleanup_backend_instance()
DeviceMgr->>BackendInst: teardown & dealloc
DispatchAPI->>Client: cleanup complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test topsail remoting_mac |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp (1)
47-71:⚠️ Potential issue | 🟠 MajorAdd null/fatal checks after
apir_decode_ggml_tensorcalls in three functions.
apir_decode_ggml_tensorcan return NULL if the RPC tensor data is malformed (early return whenapir_decode_apir_rpc_tensor_inplacefails). Currently, no NULL check is performed before passing the decoded tensor to buffer interface functions, allowing a NULL or invalid pointer to be dereferenced with potential host-side crash or memory corruption from a malicious guest.This pattern is present in three functions:
backend_buffer_set_tensor(line 49): tensor used at line 71 without checkbackend_buffer_get_tensor(line 90): tensor used at line 111 without checkbackend_buffer_cpy_tensor(lines 129–130): src and dst tensors used at line 132 without checksOther functions like
backend_buffer_clearcorrectly validate decoded handles withif (!buffer || apir_decoder_get_fatal(dec))after decoding. Apply the same pattern after tensor decodes.Proposed fix
tensor = (ggml_tensor *) (uintptr_t) apir_decode_ggml_tensor(dec); + if (!tensor || apir_decoder_get_fatal(dec)) { + GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Invalid tensor handle from guest\n", __func__); + return 1; + } + uint32_t shmem_res_id;Apply analogously after each
apir_decode_ggml_tensorcall.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp` around lines 47 - 71, After calling apir_decode_ggml_tensor in each of backend_buffer_set_tensor, backend_buffer_get_tensor, and backend_buffer_cpy_tensor, add a null/fatal check and return an error if decoding failed: verify the decoded pointer (e.g., tensor, src, dst) is non-NULL and that apir_decoder_get_fatal(dec) is false before using it; if the check fails, log/return the same error path used elsewhere (return 1) to avoid passing NULL into buffer->iface.set_tensor, buffer->iface.get_tensor, or buffer->iface.cpy_tensor. Ensure checks are added immediately after each apir_decode_ggml_tensor call and use the same pattern as the existing buffer/decoder validation (if (!tensor || apir_decoder_get_fatal(dec)) { return 1; }).ggml/src/ggml-virtgpu/virtgpu.cpp (1)
124-146:⚠️ Potential issue | 🔴 CriticalBug: Duplicate condition —
APIR_LOAD_LIBRARY_ENV_VAR_MISSINGis checked twice; second branch is dead code.Line 125 checks
ret == APIR_LOAD_LIBRARY_ENV_VAR_MISSING, so theelse ifat line 136 with the same condition can never be reached. Based on the error message at line 138 ("some symbols are missing"), this was likely meant to beAPIR_LOAD_LIBRARY_SYMBOL_MISSING.🐛 Proposed fix
- } else if (ret == APIR_LOAD_LIBRARY_ENV_VAR_MISSING) { + } else if (ret == APIR_LOAD_LIBRARY_SYMBOL_MISSING) { GGML_ABORT(GGML_VIRTGPU "%s: could not load the backend library, some symbols are missing. " "Make sure virglrenderer is correctly configured by the hypervisor. (%s) ", __func__, apir_load_library_error(ret));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/virtgpu.cpp` around lines 124 - 146, The duplicate conditional checks for APIR_LOAD_LIBRARY_ENV_VAR_MISSING cause the "some symbols are missing" branch to be unreachable; in the virtgpu.cpp block that examines ret (the APIR load result) replace the second else if condition currently using APIR_LOAD_LIBRARY_ENV_VAR_MISSING with the correct constant APIR_LOAD_LIBRARY_SYMBOL_MISSING so the branch that calls GGML_ABORT(... "some symbols are missing" ..., apir_load_library_error(ret)) is reachable; ensure you reference the same apir_load_library_error(ret), __func__, and GGML_ABORT usage as the other branches.
🧹 Nitpick comments (17)
gguf-py/gguf/gguf_reader.py (1)
208-212: Move the fallback comments above theraiseto avoid dead-code confusion.Lines 211-212 are comment-only, but they appear in the dead-code region after an unconditional
raise. Readers must mentally discount theraiseto understand what the comments mean. Moving them (and the# TODO) directly above theraisemakes the intent clearer without any behavior change.♻️ Suggested restructuring
if field.name in self.fields: + # TODO: add option to make this a warning and accept duplicate keys: + # logger.warning(f'Duplicate key {field.name} at offset {field.offset}') + # self.fields[field.name + '_{}'.format(field.offset)] = field raise KeyError(f'Duplicate {field.name} already in list at offset {field.offset}') - - # TODO: add option to make this a warning and accept duplicate keys like below - # logger.warning(f'Duplicate key {field.name} at offset {field.offset}') - # self.fields[field.name + '_{}'.format(field.offset)] = field🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gguf-py/gguf/gguf_reader.py` around lines 208 - 212, Move the fallback-comment lines and the TODO so they appear immediately before the unconditional raise that raises KeyError; specifically, relocate the lines "# TODO: add option to make this a warning and accept duplicate keys like below" and the two commented fallback alternatives ("# logger.warning(...)" and "# self.fields[...]= field") to just above the raise statement that references field.name and field.offset (the duplicate-key handling in gguf_reader.py where the KeyError is raised), leaving behavior unchanged but avoiding dead-code confusion.src/models/qwen35moe.cpp (1)
32-33: Same graph-split anchor — same optional comment suggestion applies.See the note in
qwen35.cpp. Identical change and identical rationale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/qwen35moe.cpp` around lines 32 - 33, The ggml_build_forward_expand(gf, cur); call in qwen35moe.cpp needs the same graph-split anchor/comment adjustment applied in qwen35.cpp: update the line with the identical optional comment or anchor marker used in qwen35.cpp so the graph-splitting logic is clear and consistent (locate the ggml_build_forward_expand call and mirror the comment/anchor change from qwen35.cpp).src/models/kimi-linear.cpp (1)
119-120: Same graph-split anchor as the other models — consider adding the explanatory comment.Identical pattern to the change in
qwen35.cpp.curis already reachable via the main terminal expand at Line 391; the pre-anchor is meaningful only for VirtGPU graph-split scenarios. A brief comment here (as suggested for the other files) would keep intent clear.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/kimi-linear.cpp` around lines 119 - 120, Add a brief explanatory comment above the call to ggml_build_forward_expand(gf, cur) clarifying that the pre-anchor is redundant for normal execution because cur is already reachable via the main terminal expand, but is retained intentionally to support VirtGPU graph-split scenarios; reference ggml_build_forward_expand and cur in the comment so future readers understand the intent and parity with the change made in qwen35.cpp.src/models/qwen3next.cpp (1)
24-25: Same graph-split anchor — same optional comment suggestion applies.Consistent with the other three models. Consider adding a brief comment explaining the VirtGPU graph-split rationale, as this call is otherwise non-obvious to future readers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/qwen3next.cpp` around lines 24 - 25, Add a brief inline comment above the ggml_build_forward_expand(gf, cur) call in qwen3next.cpp explaining that this is the VirtGPU "graph-split" anchor (same rationale used in the other three models): state that this call intentionally marks the forward-graph split point for VirtGPU execution, why it’s needed (e.g., separating CPU/GPU stages or enabling virtualized GPU graph replay), and that it must remain to preserve graph partitioning behavior; reference the ggml_build_forward_expand function name so reviewers can locate and align the wording with the comments in the other model files.src/models/qwen35.cpp (1)
32-33:ggml_build_forward_expandonattn_norm— verify intent for all execution paths.
cur(attn_norm) is always consumed by eitherbuild_layer_attn_linearorbuild_layer_attnimmediately below, so it would be reachable from the terminalggml_build_forward_expand(gf, cur)at Line 84 via normal backwards traversal. The only scenario where this early anchor is required is when the compute graph is split across backends (e.g., VirtGPU guest/host boundary) and the split point falls between the norm and the downstream ops.If that is the intent, the comment explaining why this pre-anchor is needed would help future readers distinguish it from an accidental duplicate. No functional concern.
+ // Anchor attn_norm in the graph before the recurrent/attn branch split so that + // backend graph-split (VirtGPU) sees it as a completed node on this side. ggml_build_forward_expand(gf, cur);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/models/qwen35.cpp` around lines 32 - 33, The call to ggml_build_forward_expand(gf, cur) on attn_norm appears duplicate because cur is immediately consumed by build_layer_attn_linear or build_layer_attn; either document intent for cross-backend split points or remove the early anchor. Update the code around ggml_build_forward_expand(gf, cur) to either (a) add a concise comment explaining this pre-anchor is intentional to ensure the compute graph root is preserved when the graph may be split across backends (e.g., VirtGPU guest/host boundary), referencing attn_norm and the downstream build_layer_attn_linear/build_layer_attn calls, or (b) if no cross-backend split is intended, remove the early ggml_build_forward_expand(gf, cur) to avoid the apparent duplicate and rely on the terminal ggml_build_forward_expand(gf, cur) reachable via normal traversal.tests/test-gguf.cpp (3)
424-426: Magic number32should have a comment.The value is non-obvious without context — any value from 1 to ~
GGUF_DEFAULT_ALIGNMENTwould suffice. A one-liner clarifying it deliberately writes fewer bytes than any single tensor requires would make the intent clear:💬 Suggested annotation
+ // Write far fewer bytes than the declared tensor data requires, + // so the parser detects the overflow before attempting any allocation. if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) { nbytes = 32; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test-gguf.cpp` around lines 424 - 426, Replace the unexplained magic literal 32 in the HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch by adding a one-line comment next to the assignment to nbytes (or replace with a named constant) clarifying that this value is intentionally chosen to be smaller than the minimum bytes any single tensor would require (e.g., any value in 1..GGUF_DEFAULT_ALIGNMENT would work) so the test deliberately writes fewer bytes than a tensor; reference the symbols hft, HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, nbytes and GGUF_DEFAULT_ALIGNMENT in the comment for clarity.
410-410: The(uint64_t)cast onalignmentis a necessary correctness fix — approve.
GGML_PADexpands to(((x) + (n) - 1) & ~((n) - 1)). Whenxissize_t(64-bit) andnisuint32_t,~((n) - 1)is evaluated asuint32_tand then zero-extended to 64 bits when AND-ed with the 64-bitx. This silently zeros the upper 32 bits of the result — for a row size like0x7FFFFFFFFFFFFFC0the computed pad would be completely wrong. Castingalignmenttouint64_tforces the entire expression to 64-bit arithmetic and fixes the mask.The same pattern without the cast appears at line 640 in
handcrafted_check_tensors:expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);That site is not exercised for the overflow case (the function is guarded by
expect_context_not_null), so it's not currently broken, but it should be made consistent.♻️ Apply the same cast at line 640 for consistency
- expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment); + expected_offset += GGML_PAD(ggml_row_size(type, ne), (uint64_t) alignment);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test-gguf.cpp` at line 410, The pad calculation incorrectly uses a 32-bit mask when alignment is a smaller integer; update the second occurrence to match the first by casting alignment to uint64_t so the GGML_PAD expression uses 64-bit arithmetic: change the call in handcrafted_check_tensors that updates expected_offset (expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);) to cast alignment to uint64_t, mirroring the fix applied where offset is updated; ensure you reference GGML_PAD, ggml_row_size, alignment, expected_offset, and handcrafted_check_tensors when making the change.
201-206: Consider adding a comment explaining the choice of0x7FFFFFFFFFFFFFC0.The magic value is non-obvious. A brief comment noting it's the largest
int64_tvalue already aligned toGGUF_DEFAULT_ALIGNMENT(64 bytes), so that two such tensors bring the combined data size to~0xFFFFFFFFFFFFFF80, would aid future readers without misleading them.💬 Suggested annotation
if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) { tensor_configs.resize(2); + // 0x7FFFFFFFFFFFFFC0 == INT64_MAX rounded down to the nearest 64-byte boundary. + // Two such I8 tensors produce a combined data size of ~UINT64_MAX, triggering + // a memory-size overflow in the GGUF parser. tensor_configs[0] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } }; tensor_configs[1] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test-gguf.cpp` around lines 201 - 206, Add a clarifying comment above the HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch explaining the magic constant 0x7FFFFFFFFFFFFFC0: state that it is the largest int64_t value already aligned to GGUF_DEFAULT_ALIGNMENT (64 bytes), and that creating two tensors with that size causes the combined data size to overflow (approx 0xFFFFFFFFFFFFFF80). Place this comment next to the tensor_configs assignments (symbols: hft, HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, tensor_configs, GGML_TYPE_I8, GGUF_DEFAULT_ALIGNMENT) so future readers understand why that specific value was chosen.ggml/src/ggml-virtgpu/ggml-remoting.h (1)
39-40: Initialize new context fields defensively.Consider default-initializing these fields to deterministic values to avoid accidental use of uninitialized state on partial-init paths.
💡 Proposed refactor
- uintptr_t device_handle; // Device handle returned from backend initialization - uint32_t backend_id; // Backend ID within the device + uintptr_t device_handle = 0; // Device handle returned from backend initialization + uint32_t backend_id = 0; // Backend ID within the device🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/ggml-remoting.h` around lines 39 - 40, The struct fields device_handle and backend_id are not given deterministic defaults and can be read uninitialized on partial init; update the context type that declares device_handle and backend_id to explicitly initialize them (e.g., set device_handle = 0 or UINTPTR_MAX and backend_id = 0 or UINT32_MAX as your sentinel) and ensure the allocation/creation function (the context constructor/creator) assigns those defaults before any early-return paths so any partially-initialized context has known values; reference and update the symbols device_handle and backend_id in the struct definition and in the context creation/initialization function.ggml/src/ggml-virtgpu/ggml-backend.cpp (1)
11-18: Consider nullifyingctx->device_handleandctx->backend_idafter cleanup.After
apir_backend_cleanupreturns, the device contextctxstill holds the now-staledevice_handleandbackend_idvalues. Sincectxis owned by the device (not the backend), it survivesdelete backend. A subsequentdevice_initon the same device would overwrite them, but defensive zeroing avoids accidental use of stale handles.Suggested improvement
apir_backend_cleanup(gpu, ctx->device_handle, ctx->backend_id); + ctx->device_handle = 0; + ctx->backend_id = 0; + delete backend;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/ggml-backend.cpp` around lines 11 - 18, In ggml_backend_remoting_free, after calling apir_backend_cleanup(gpu, ctx->device_handle, ctx->backend_id) clear the stale handles on the surviving ggml_backend_remoting_device_context by setting ctx->device_handle and ctx->backend_id to their null/zero values so the device context (accessible via backend->context) cannot accidentally use stale values after delete backend; keep the cleanup call and then nullify those members defensively.ggml/src/ggml-virtgpu/virtgpu.cpp (1)
500-502: Integer division truncates sub-millisecond precision in timeout calculation.
(end_time - start_time) / 1000000performs integer division (both operands arelong long/int). The result assigned tofloat duration_mswill always be a whole number of milliseconds. For short durations this could report 0ms when actual elapsed time is e.g. 0.5ms. This is unlikely to cause a real timeout bug since the comparison is> max_wait_ms, but it does mean sub-ms elapsed time is invisible.Suggested fix
- float duration_ms = (end_time - start_time) / 1000000; + float duration_ms = (float)(end_time - start_time) / 1000000.0f;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/virtgpu.cpp` around lines 500 - 502, The calculation for duration_ms uses integer division so sub-millisecond precision is lost; change the division to floating-point (e.g., divide by 1000000.0f or cast (end_time - start_time) to float/double) so duration_ms preserves fractional milliseconds — update the expression that computes duration_ms (using ts_end, end_time, start_time, duration_ms) to perform floating-point division.ggml/src/ggml-virtgpu/ggml-backend-reg.cpp (1)
41-77: Consider extracting the duplicated name-prefixing pattern.The "[virtgpu] " prefix logic at lines 41-56 and 62-77 is nearly identical. A small helper would reduce duplication:
Example helper
static char * make_prefixed_name(const char * rmt_name) { size_t len = strlen(rmt_name) + 11; // "[virtgpu] " + null char * prefixed = (char *) malloc(len); if (!prefixed) { free((void *) rmt_name); return nullptr; } snprintf(prefixed, len, "[virtgpu] %s", rmt_name); return prefixed; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/ggml-backend-reg.cpp` around lines 41 - 77, The code duplicates the "[virtgpu] " name-prefixing logic for apir_device_get_name and apir_buffer_type_get_name; extract this into a helper (e.g., static char * make_prefixed_name(const char *rmt_name)) that allocates the prefixed string, returns nullptr on allocation failure, and frees the original rmt_name on failure so callers (the blocks that set gpu->cached_device_info.name and gpu->cached_buffer_type.name) can simply call the helper and GGML_ABORT if it returns nullptr; replace both duplicated blocks with calls to make_prefixed_name and assign the results to gpu->cached_device_info.name and gpu->cached_buffer_type.name respectively.ggml/src/ggml-virtgpu/backend/backend.cpp (1)
84-88: Dead code:library_regcan never be null here.
library_regis assigned at line 67 with a fallback toGGML_DEFAULT_BACKEND_REG(a non-null string literal), so this branch is unreachable. Consider removing it to avoid misleading future readers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/backend/backend.cpp` around lines 84 - 88, The branch checking for library_reg being null is dead code because library_reg is assigned earlier with a fallback to the non-null GGML_DEFAULT_BACKEND_REG; remove the unreachable if-block that logs GGML_VIRTGPU_BCK and returns APIR_LOAD_LIBRARY_ENV_VAR_MISSING. Update any surrounding logic in the function in backend.cpp (where library_reg is set and used) to rely on the guaranteed non-null value and delete the check referencing library_reg and APIR_LLAMA_CPP_GGML_LIBRARY_REG_ENV to avoid misleading future readers.ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp (2)
58-82: Minor TOCTOU betweenensure_device_extensionandget_device_extension.
ensure_device_extension(line 59) acquires and releasesdevice_extensions_mutex, thenget_device_extension(line 60) acquires it again. Ifcleanup_device_extensionis called from another thread between these two calls, the extension could be removed, causingget_device_extensionto returnnullptr.In practice this would require concurrent create + cleanup on the same device which is a caller misuse, but you could eliminate the gap by combining the ensure + get into a single lock scope, or by having
ensure_device_extensionreturn the pointer directly.♻️ Proposed consolidation
-// Create new backend instance for device -uintptr_t create_backend_instance(ggml_backend_dev_t device) { - ensure_device_extension(device); - apir_device_extension * ext = get_device_extension(device); - if (ext == nullptr) { - return 0; // Failed - } +// Ensure device extension exists and return it (caller must not hold device_extensions_mutex) +apir_device_extension * ensure_and_get_device_extension(ggml_backend_dev_t device) { + std::lock_guard<std::mutex> lock(device_extensions_mutex); + auto it = device_extensions.find(device); + if (it == device_extensions.end()) { + apir_device_extension * ext = new apir_device_extension(); + ext->next_backend_id = 1; + ggml_backend_dev_props props; + device->iface.get_props(device, &props); + ext->async_backend = props.caps.async; + ext->magic = APIR_DEVICE_EXTENSION_MAGIC; + device_extensions[device] = ext; + return ext; + } + apir_device_extension * ext = it->second; + return (ext->magic == APIR_DEVICE_EXTENSION_MAGIC) ? ext : nullptr; +} + +// Create new backend instance for device +uintptr_t create_backend_instance(ggml_backend_dev_t device) { + apir_device_extension * ext = ensure_and_get_device_extension(device); + if (ext == nullptr) { + return 0; // Failed + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp` around lines 58 - 82, The TOCTOU arises because create_backend_instance calls ensure_device_extension() then get_device_extension(), allowing cleanup_device_extension to remove the extension in between; to fix this, change the flow so the extension pointer is obtained under a single lock: either modify ensure_device_extension to return the apir_device_extension* (or add a new ensure_and_get_device_extension(device) that returns the pointer) and use that pointer directly in create_backend_instance instead of calling get_device_extension(), or hold the device_extensions_mutex across both operations so the extension cannot be removed; update create_backend_instance to use the returned apir_device_extension* (or the single-locked getter) and remove the separate get_device_extension call.
157-182: Truncation ofbackend_idfromuintptr_ttouint32_t.Line 179 casts
backend_id(auintptr_t, which is 64-bit on LP64 platforms) touint32_t. Ifnext_backend_idever exceedsUINT32_MAX, this silently truncates. Since IDs start at 1 and increment, this is practically unreachable, but an assert would make the assumption explicit.🛡️ Optional: add an assertion
+ GGML_ASSERT(backend_id <= UINT32_MAX && "backend_id exceeds uint32_t range"); *out_backend_id = (uint32_t) backend_id;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp` around lines 157 - 182, The function backend_dispatch_initialize casts a uintptr_t backend_id to uint32_t causing silent truncation; update backend_dispatch_initialize to validate the returned backend_id from create_backend_instance() before assigning to *out_backend_id: check backend_id != 0 and that backend_id <= UINT32_MAX (or use an assert/GGML_LOG_ERROR and return APIR_BACKEND_INITIALIZE_BACKEND_INIT_FAILED if it doesn't fit), referencing the create_backend_instance call and the backend_id/out_backend_id variables so the failure is detected and logged instead of silently truncating.ggml/src/gguf.cpp (1)
327-347:nbytes_remain()— consider the performance impact on repeated calls.This helper seeks to EOF and back on every invocation. It's called once per vector read and once per individual string read (including each element when reading a string array). For very large string arrays on slow/remote filesystems, the overhead of
N+1seek-pairs could add up.A minor optimization would be to cache the file size at the start of
gguf_init_from_file_impland compute remaining bytes fromfile_size - gguf_ftell(file), avoiding the repeated seeks. Not blocking, but worth considering if large GGUF files become a performance concern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/gguf.cpp` around lines 327 - 347, The nbytes_remain() helper currently seeks to EOF and back on every call, which is expensive for many small reads; compute and cache the total file size once at the start of gguf_init_from_file_impl (e.g., call gguf_fseek/gguf_ftell once to get file_size) and change nbytes_remain() (or callers) to return file_size - gguf_ftell(file) using that cached file_size instead of performing repeated seeks; store the cached size in a field or local context accessible to nbytes_remain() and ensure initialization still validates/sets the cached size before any reads occur (referencing nbytes_remain and gguf_init_from_file_impl).ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp (1)
125-128: Don’t silently ignore backend cleanup RPC failures.On Line 125-127,
retis dropped. At minimum, log a failure so leaked backend instances are diagnosable.Proposed improvement
REMOTE_CALL(gpu, encoder, decoder, ret); + if (ret != 0) { + GGML_LOG_ERROR(GGML_VIRTGPU "%s: Backend cleanup failed with ret=%d\n", __func__, ret); + } + remote_call_finish(gpu, encoder, decoder); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp` around lines 125 - 128, The RPC return value stored in ret after REMOTE_CALL(gpu, encoder, decoder, ret) is being ignored; modify the cleanup path around REMOTE_CALL and remote_call_finish in virtgpu-forward-backend.cpp to check ret for errors and at minimum log a descriptive error via your logging facility (include function/context like REMOTE_CALL(gpu, encoder, decoder, ret) and remote_call_finish(gpu, encoder, decoder) in the message) so backend cleanup RPC failures are recorded and diagnosable; ensure the code still calls remote_call_finish even on error and propagate or handle the error appropriately per project conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp`:
- Around line 143-154: The code currently decodes device_handle and backend_id
with apir_decode_uintptr_t and apir_decode_uint32_t and immediately calls
cleanup_backend_instance(device, backend_id); update this to validate the decode
results before use: capture and check the return/error codes from
apir_decode_uintptr_t(dec, &device_handle) and apir_decode_uint32_t(dec,
&backend_id), only cast to ggml_backend_dev_t and call
cleanup_backend_instance(device, backend_id) when both decodes succeed, and
otherwise avoid calling cleanup_backend_instance and handle/log the decode
failure appropriately.
- Around line 44-47: The code currently ignores shmem_res_id (GGML_UNUSED)
leaving cgraph_size untrusted and allowing potential OOB reads when decoding;
replace the placeholder by retrieving the shared-memory buffer length for the
given shmem_res_id (or use an API that returns the mapped size) and immediately
validate that cgraph_size <= that shared-memory length, returning an
error/failing the request if it is larger; until you add the new Virgl->APIR
callback, implement an immediate mitigation that rejects any payload where
cgraph_size exceeds a conservative max or the known mapped size to prevent OOB
access in the decoder that reads using cgraph_size.
- Around line 16-24: The decode of function_ptr via apir_decode_uintptr_t may
fail and leave function_ptr uninitialized before calling
backend_dispatch_initialize; update the flow in this block to initialize
function_ptr (e.g., to 0), capture and check the return value of
apir_decode_uintptr_t, and bail out or return an error (or log) if the decode
fails instead of continuing to cast and pass ggml_backend_reg_fct_p into
backend_dispatch_initialize; ensure the function uses the checked/initialized
ggml_backend_reg_fct_p only after a successful decode.
- Around line 111-117: The loop that checks node support uses an undefined
variable `dev`; replace uses of `dev` with the correct `device` variable: update
the call to the backend support check from dev->iface.supports_op(dev, op) to
device->iface.supports_op(device, op) so the check uses the declared `device`
object (the loop iterating over cgraph->n_nodes and calling
ggml_graph_node/ggml_op_desc should remain unchanged).
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.h`:
- Around line 30-33: The comment on struct apir_backend_instance is wrong:
update the inline magic comment for apir_backend_instance to match the canonical
constant APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA) or vice-versa; specifically,
locate struct apir_backend_instance and the macro APIR_BACKEND_INSTANCE_MAGIC
and make the descriptive comment use the same hex value as the macro to avoid
mismatch (ensure only the comment is changed unless you intend to change the
macro value).
In `@ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h`:
- Around line 109-110: The global backend_buffers set is accessed concurrently
without protection; introduce a std::mutex (e.g., backend_buffers_lock) and use
scoped locking (std::lock_guard<std::mutex> or std::unique_lock<std::mutex>)
around every access to backend_buffers including the validation check that uses
backend_buffers.find(buffer) and the mutating calls in apir_track_backend_buffer
(insert) and apir_untrack_backend_buffer (erase) so all reads/writes are
serialized and thread-safe.
In `@ggml/src/ggml-virtgpu/ggml-backend-device.cpp`:
- Around line 76-77: The code currently uses a process-wide static cache (static
std::atomic<bool> initialized and static ggml_backend_buffer_type buft) which
causes buft to hold the first device's device/context for all subsequent
devices; change the cache to be device-scoped by moving ggml_backend_buffer_type
buft (and the per-device initialized flag) into the per-device context struct
ggml_backend_remoting_device_context and initialize buft once per device when
accessing it (replace the function-local static buft/initialized usage with
reads/writes to the fields on ggml_backend_remoting_device_context so each
device gets its own buffer type initialized with the correct device/context).
In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp`:
- Around line 31-36: The code currently treats ret == 0 as success before
verifying that decoding succeeded; after calling apir_decode_uintptr_t(decoder,
&device_handle) and apir_decode_uint32_t(decoder, &backend_id) you must check
the decoder's fatal/error state and only consider initialization successful if
decoding produced no errors. Update the block that currently checks ret to call
the decoder error/fatal-check API (e.g., apir_decoder_fatal(decoder) or
equivalent) after the two apir_decode_* calls, set ret to a non-zero error code
and log an error if decoding failed, and only log and return success when the
decoder indicates no fatal error (use the same symbols: decoder, device_handle,
backend_id, apir_decode_uintptr_t, apir_decode_uint32_t).
In `@ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp`:
- Line 176: The format string in the GGML_ABORT call that uses GGML_VIRTGPU and
__func__ incorrectly prints the size_t variable size with "%ld"; update the
format specifier to "%zu" so the call GGML_ABORT(GGML_VIRTGPU "%s: Couldn't
allocate %zub of guest-host shared buffer", __func__, size) correctly and
portably formats the size_t value (no cast needed).
In `@ggml/src/ggml.c`:
- Around line 902-903: Replace the standard C assert() calls in the type
accessor functions with GGML_ASSERT() so checks remain active in Release builds:
change assertions that validate "type >= 0" and "type < GGML_TYPE_COUNT" (and
any similar checks around type_traits access) to GGML_ASSERT(type >= 0 && type <
GGML_TYPE_COUNT). Also update ggml_type_name() to perform a runtime bounds check
against GGML_TYPE_COUNT before indexing type_traits and return a safe fallback
string (e.g., "unknown") when the check fails to avoid out-of-bounds reads.
---
Outside diff comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp`:
- Around line 47-71: After calling apir_decode_ggml_tensor in each of
backend_buffer_set_tensor, backend_buffer_get_tensor, and
backend_buffer_cpy_tensor, add a null/fatal check and return an error if
decoding failed: verify the decoded pointer (e.g., tensor, src, dst) is non-NULL
and that apir_decoder_get_fatal(dec) is false before using it; if the check
fails, log/return the same error path used elsewhere (return 1) to avoid passing
NULL into buffer->iface.set_tensor, buffer->iface.get_tensor, or
buffer->iface.cpy_tensor. Ensure checks are added immediately after each
apir_decode_ggml_tensor call and use the same pattern as the existing
buffer/decoder validation (if (!tensor || apir_decoder_get_fatal(dec)) { return
1; }).
In `@ggml/src/ggml-virtgpu/virtgpu.cpp`:
- Around line 124-146: The duplicate conditional checks for
APIR_LOAD_LIBRARY_ENV_VAR_MISSING cause the "some symbols are missing" branch to
be unreachable; in the virtgpu.cpp block that examines ret (the APIR load
result) replace the second else if condition currently using
APIR_LOAD_LIBRARY_ENV_VAR_MISSING with the correct constant
APIR_LOAD_LIBRARY_SYMBOL_MISSING so the branch that calls GGML_ABORT(... "some
symbols are missing" ..., apir_load_library_error(ret)) is reachable; ensure you
reference the same apir_load_library_error(ret), __func__, and GGML_ABORT usage
as the other branches.
---
Nitpick comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp`:
- Around line 58-82: The TOCTOU arises because create_backend_instance calls
ensure_device_extension() then get_device_extension(), allowing
cleanup_device_extension to remove the extension in between; to fix this, change
the flow so the extension pointer is obtained under a single lock: either modify
ensure_device_extension to return the apir_device_extension* (or add a new
ensure_and_get_device_extension(device) that returns the pointer) and use that
pointer directly in create_backend_instance instead of calling
get_device_extension(), or hold the device_extensions_mutex across both
operations so the extension cannot be removed; update create_backend_instance to
use the returned apir_device_extension* (or the single-locked getter) and remove
the separate get_device_extension call.
- Around line 157-182: The function backend_dispatch_initialize casts a
uintptr_t backend_id to uint32_t causing silent truncation; update
backend_dispatch_initialize to validate the returned backend_id from
create_backend_instance() before assigning to *out_backend_id: check backend_id
!= 0 and that backend_id <= UINT32_MAX (or use an assert/GGML_LOG_ERROR and
return APIR_BACKEND_INITIALIZE_BACKEND_INIT_FAILED if it doesn't fit),
referencing the create_backend_instance call and the backend_id/out_backend_id
variables so the failure is detected and logged instead of silently truncating.
In `@ggml/src/ggml-virtgpu/backend/backend.cpp`:
- Around line 84-88: The branch checking for library_reg being null is dead code
because library_reg is assigned earlier with a fallback to the non-null
GGML_DEFAULT_BACKEND_REG; remove the unreachable if-block that logs
GGML_VIRTGPU_BCK and returns APIR_LOAD_LIBRARY_ENV_VAR_MISSING. Update any
surrounding logic in the function in backend.cpp (where library_reg is set and
used) to rely on the guaranteed non-null value and delete the check referencing
library_reg and APIR_LLAMA_CPP_GGML_LIBRARY_REG_ENV to avoid misleading future
readers.
In `@ggml/src/ggml-virtgpu/ggml-backend-reg.cpp`:
- Around line 41-77: The code duplicates the "[virtgpu] " name-prefixing logic
for apir_device_get_name and apir_buffer_type_get_name; extract this into a
helper (e.g., static char * make_prefixed_name(const char *rmt_name)) that
allocates the prefixed string, returns nullptr on allocation failure, and frees
the original rmt_name on failure so callers (the blocks that set
gpu->cached_device_info.name and gpu->cached_buffer_type.name) can simply call
the helper and GGML_ABORT if it returns nullptr; replace both duplicated blocks
with calls to make_prefixed_name and assign the results to
gpu->cached_device_info.name and gpu->cached_buffer_type.name respectively.
In `@ggml/src/ggml-virtgpu/ggml-backend.cpp`:
- Around line 11-18: In ggml_backend_remoting_free, after calling
apir_backend_cleanup(gpu, ctx->device_handle, ctx->backend_id) clear the stale
handles on the surviving ggml_backend_remoting_device_context by setting
ctx->device_handle and ctx->backend_id to their null/zero values so the device
context (accessible via backend->context) cannot accidentally use stale values
after delete backend; keep the cleanup call and then nullify those members
defensively.
In `@ggml/src/ggml-virtgpu/ggml-remoting.h`:
- Around line 39-40: The struct fields device_handle and backend_id are not
given deterministic defaults and can be read uninitialized on partial init;
update the context type that declares device_handle and backend_id to explicitly
initialize them (e.g., set device_handle = 0 or UINTPTR_MAX and backend_id = 0
or UINT32_MAX as your sentinel) and ensure the allocation/creation function (the
context constructor/creator) assigns those defaults before any early-return
paths so any partially-initialized context has known values; reference and
update the symbols device_handle and backend_id in the struct definition and in
the context creation/initialization function.
In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp`:
- Around line 125-128: The RPC return value stored in ret after REMOTE_CALL(gpu,
encoder, decoder, ret) is being ignored; modify the cleanup path around
REMOTE_CALL and remote_call_finish in virtgpu-forward-backend.cpp to check ret
for errors and at minimum log a descriptive error via your logging facility
(include function/context like REMOTE_CALL(gpu, encoder, decoder, ret) and
remote_call_finish(gpu, encoder, decoder) in the message) so backend cleanup RPC
failures are recorded and diagnosable; ensure the code still calls
remote_call_finish even on error and propagate or handle the error appropriately
per project conventions.
In `@ggml/src/ggml-virtgpu/virtgpu.cpp`:
- Around line 500-502: The calculation for duration_ms uses integer division so
sub-millisecond precision is lost; change the division to floating-point (e.g.,
divide by 1000000.0f or cast (end_time - start_time) to float/double) so
duration_ms preserves fractional milliseconds — update the expression that
computes duration_ms (using ts_end, end_time, start_time, duration_ms) to
perform floating-point division.
In `@ggml/src/gguf.cpp`:
- Around line 327-347: The nbytes_remain() helper currently seeks to EOF and
back on every call, which is expensive for many small reads; compute and cache
the total file size once at the start of gguf_init_from_file_impl (e.g., call
gguf_fseek/gguf_ftell once to get file_size) and change nbytes_remain() (or
callers) to return file_size - gguf_ftell(file) using that cached file_size
instead of performing repeated seeks; store the cached size in a field or local
context accessible to nbytes_remain() and ensure initialization still
validates/sets the cached size before any reads occur (referencing nbytes_remain
and gguf_init_from_file_impl).
In `@gguf-py/gguf/gguf_reader.py`:
- Around line 208-212: Move the fallback-comment lines and the TODO so they
appear immediately before the unconditional raise that raises KeyError;
specifically, relocate the lines "# TODO: add option to make this a warning and
accept duplicate keys like below" and the two commented fallback alternatives
("# logger.warning(...)" and "# self.fields[...]= field") to just above the
raise statement that references field.name and field.offset (the duplicate-key
handling in gguf_reader.py where the KeyError is raised), leaving behavior
unchanged but avoiding dead-code confusion.
In `@src/models/kimi-linear.cpp`:
- Around line 119-120: Add a brief explanatory comment above the call to
ggml_build_forward_expand(gf, cur) clarifying that the pre-anchor is redundant
for normal execution because cur is already reachable via the main terminal
expand, but is retained intentionally to support VirtGPU graph-split scenarios;
reference ggml_build_forward_expand and cur in the comment so future readers
understand the intent and parity with the change made in qwen35.cpp.
In `@src/models/qwen35.cpp`:
- Around line 32-33: The call to ggml_build_forward_expand(gf, cur) on attn_norm
appears duplicate because cur is immediately consumed by build_layer_attn_linear
or build_layer_attn; either document intent for cross-backend split points or
remove the early anchor. Update the code around ggml_build_forward_expand(gf,
cur) to either (a) add a concise comment explaining this pre-anchor is
intentional to ensure the compute graph root is preserved when the graph may be
split across backends (e.g., VirtGPU guest/host boundary), referencing attn_norm
and the downstream build_layer_attn_linear/build_layer_attn calls, or (b) if no
cross-backend split is intended, remove the early ggml_build_forward_expand(gf,
cur) to avoid the apparent duplicate and rely on the terminal
ggml_build_forward_expand(gf, cur) reachable via normal traversal.
In `@src/models/qwen35moe.cpp`:
- Around line 32-33: The ggml_build_forward_expand(gf, cur); call in
qwen35moe.cpp needs the same graph-split anchor/comment adjustment applied in
qwen35.cpp: update the line with the identical optional comment or anchor marker
used in qwen35.cpp so the graph-splitting logic is clear and consistent (locate
the ggml_build_forward_expand call and mirror the comment/anchor change from
qwen35.cpp).
In `@src/models/qwen3next.cpp`:
- Around line 24-25: Add a brief inline comment above the
ggml_build_forward_expand(gf, cur) call in qwen3next.cpp explaining that this is
the VirtGPU "graph-split" anchor (same rationale used in the other three
models): state that this call intentionally marks the forward-graph split point
for VirtGPU execution, why it’s needed (e.g., separating CPU/GPU stages or
enabling virtualized GPU graph replay), and that it must remain to preserve
graph partitioning behavior; reference the ggml_build_forward_expand function
name so reviewers can locate and align the wording with the comments in the
other model files.
In `@tests/test-gguf.cpp`:
- Around line 424-426: Replace the unexplained magic literal 32 in the
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch by adding a one-line comment next to
the assignment to nbytes (or replace with a named constant) clarifying that this
value is intentionally chosen to be smaller than the minimum bytes any single
tensor would require (e.g., any value in 1..GGUF_DEFAULT_ALIGNMENT would work)
so the test deliberately writes fewer bytes than a tensor; reference the symbols
hft, HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, nbytes and GGUF_DEFAULT_ALIGNMENT in
the comment for clarity.
- Line 410: The pad calculation incorrectly uses a 32-bit mask when alignment is
a smaller integer; update the second occurrence to match the first by casting
alignment to uint64_t so the GGML_PAD expression uses 64-bit arithmetic: change
the call in handcrafted_check_tensors that updates expected_offset
(expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);) to cast
alignment to uint64_t, mirroring the fix applied where offset is updated; ensure
you reference GGML_PAD, ggml_row_size, alignment, expected_offset, and
handcrafted_check_tensors when making the change.
- Around line 201-206: Add a clarifying comment above the
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch explaining the magic constant
0x7FFFFFFFFFFFFFC0: state that it is the largest int64_t value already aligned
to GGUF_DEFAULT_ALIGNMENT (64 bytes), and that creating two tensors with that
size causes the combined data size to overflow (approx 0xFFFFFFFFFFFFFF80).
Place this comment next to the tensor_configs assignments (symbols: hft,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, tensor_configs, GGML_TYPE_I8,
GGUF_DEFAULT_ALIGNMENT) so future readers understand why that specific value was
chosen.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (47)
docs/backend/VirtGPU.mdggml/include/ggml.hggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cppggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cppggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cppggml/src/ggml-virtgpu/backend/backend-dispatched.cppggml/src/ggml-virtgpu/backend/backend-dispatched.gen.hggml/src/ggml-virtgpu/backend/backend-dispatched.hggml/src/ggml-virtgpu/backend/backend-virgl-apir.hggml/src/ggml-virtgpu/backend/backend.cppggml/src/ggml-virtgpu/backend/shared/api_remoting.hggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.hggml/src/ggml-virtgpu/backend/shared/apir_backend.hggml/src/ggml-virtgpu/backend/shared/apir_cs.hggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.hggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.hggml/src/ggml-virtgpu/ggml-backend-buffer-type.cppggml/src/ggml-virtgpu/ggml-backend-device.cppggml/src/ggml-virtgpu/ggml-backend-reg.cppggml/src/ggml-virtgpu/ggml-backend.cppggml/src/ggml-virtgpu/ggml-remoting.hggml/src/ggml-virtgpu/ggmlremoting_functions.yamlggml/src/ggml-virtgpu/include/apir_hw.hggml/src/ggml-virtgpu/regenerate_remoting.pyggml/src/ggml-virtgpu/virtgpu-forward-backend.cppggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cppggml/src/ggml-virtgpu/virtgpu-forward-buffer.cppggml/src/ggml-virtgpu/virtgpu-forward-device.cppggml/src/ggml-virtgpu/virtgpu-forward-impl.hggml/src/ggml-virtgpu/virtgpu-forward.gen.hggml/src/ggml-virtgpu/virtgpu.cppggml/src/ggml-virtgpu/virtgpu.hggml/src/ggml.cggml/src/gguf.cppgguf-py/gguf/gguf_reader.pygguf-py/gguf/gguf_writer.pysrc/llama-model.cppsrc/llama-model.hsrc/models/kimi-linear.cppsrc/models/qwen35.cppsrc/models/qwen35moe.cppsrc/models/qwen3next.cpptests/test-gguf.cpptools/server/server-http.cpptools/server/server-http.htools/server/server-models.cpptools/server/server-task.cpp
💤 Files with no reviewable changes (1)
- ggml/include/ggml.h
| uintptr_t function_ptr; | ||
| apir_decode_uintptr_t(dec, &function_ptr); | ||
| void * ggml_backend_reg_fct_p = (void *) function_ptr; | ||
|
|
||
| // Call the actual initialization | ||
| uintptr_t device_handle = 0; | ||
| uint32_t backend_id = 0; | ||
| uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id); | ||
|
|
There was a problem hiding this comment.
Guard initialize flow against decoder failure before dispatching.
function_ptr on Line 16 is uninitialized before decode. If decode fails, Line 23 may pass garbage into backend_dispatch_initialize.
Proposed fix
- uintptr_t function_ptr;
+ uintptr_t function_ptr = 0;
apir_decode_uintptr_t(dec, &function_ptr);
+ if (apir_decoder_get_fatal(dec)) {
+ apir_decoder_set_fatal(dec);
+ return 1;
+ }
void * ggml_backend_reg_fct_p = (void *) function_ptr;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uintptr_t function_ptr; | |
| apir_decode_uintptr_t(dec, &function_ptr); | |
| void * ggml_backend_reg_fct_p = (void *) function_ptr; | |
| // Call the actual initialization | |
| uintptr_t device_handle = 0; | |
| uint32_t backend_id = 0; | |
| uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id); | |
| uintptr_t function_ptr = 0; | |
| apir_decode_uintptr_t(dec, &function_ptr); | |
| if (apir_decoder_get_fatal(dec)) { | |
| apir_decoder_set_fatal(dec); | |
| return 1; | |
| } | |
| void * ggml_backend_reg_fct_p = (void *) function_ptr; | |
| // Call the actual initialization | |
| uintptr_t device_handle = 0; | |
| uint32_t backend_id = 0; | |
| uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines 16
- 24, The decode of function_ptr via apir_decode_uintptr_t may fail and leave
function_ptr uninitialized before calling backend_dispatch_initialize; update
the flow in this block to initialize function_ptr (e.g., to 0), capture and
check the return value of apir_decode_uintptr_t, and bail out or return an error
(or log) if the decode fails instead of continuing to cast and pass
ggml_backend_reg_fct_p into backend_dispatch_initialize; ensure the function
uses the checked/initialized ggml_backend_reg_fct_p only after a successful
decode.
| // place-holder: validate that the size of shmem_res_id is <= cgraph_size | ||
| // need to add another method in the Virgl->APIR callback interface | ||
| GGML_UNUSED(shmem_res_id); | ||
|
|
There was a problem hiding this comment.
cgraph_size is still not validated against actual shared-memory length.
The placeholder on Line 44-46 means cgraph_size remains untrusted. That flows into decoder bounds at Line 94, which is a potential OOB-read vector.
Immediate mitigation (fail oversized payloads until exact shmem-size check exists)
static uint32_t validate_graph_operation(size_t cgraph_size, uint32_t shmem_res_id, const char * operation) {
+ static constexpr size_t APIR_MAX_CGRAPH_SIZE = 1ull << 30; // 1 GiB hard safety cap
if (cgraph_size == 0) {
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Zero-size computation graph\n", operation);
return 1;
}
+ if (cgraph_size > APIR_MAX_CGRAPH_SIZE) {
+ GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Oversized computation graph (%zu)\n", operation, cgraph_size);
+ return 1;
+ }
// place-holder: validate that the size of shmem_res_id is <= cgraph_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines 44
- 47, The code currently ignores shmem_res_id (GGML_UNUSED) leaving cgraph_size
untrusted and allowing potential OOB reads when decoding; replace the
placeholder by retrieving the shared-memory buffer length for the given
shmem_res_id (or use an API that returns the mapped size) and immediately
validate that cgraph_size <= that shared-memory length, returning an
error/failing the request if it is larger; until you add the new Virgl->APIR
callback, implement an immediate mitigation that rejects any payload where
cgraph_size exceeds a conservative max or the known mapped size to prevent OOB
access in the decoder that reads using cgraph_size.
| for (int idx = 0; idx < cgraph->n_nodes; idx++) { | ||
| ggml_tensor * op = ggml_graph_node(cgraph, idx); | ||
| if (dev->iface.supports_op(dev, op)) { | ||
| continue; | ||
| } | ||
| GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", idx, ggml_op_desc(op)); | ||
| GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx, | ||
| ggml_op_desc(op)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "supports_op call-site context:"
rg -nP -C3 '\bsupports_op\s*\(' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
echo
echo "ggml_backend_dev_t declarations in this file:"
rg -nP -C2 '\bggml_backend_dev_t\s+\w+' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
echo
echo "APIR_BACKEND_CHECK_SUPPORTS_OP definitions/usages:"
rg -nP '\bAPIR_BACKEND_CHECK_SUPPORTS_OP\b' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp ggml/src/ggml-virtgpu/backendRepository: crc-org/llama.cpp
Length of output: 1260
🏁 Script executed:
rg -nP '\bdev\b' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp | head -30Repository: crc-org/llama.cpp
Length of output: 111
dev is undefined; use device instead.
On line 113, the call uses dev, but the device variable is declared as device on line 69. This will cause a compile error when APIR_BACKEND_CHECK_SUPPORTS_OP == 1.
Proposed fix
- if (dev->iface.supports_op(dev, op)) {
+ if (device->iface.supports_op(device, op)) {
continue;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (int idx = 0; idx < cgraph->n_nodes; idx++) { | |
| ggml_tensor * op = ggml_graph_node(cgraph, idx); | |
| if (dev->iface.supports_op(dev, op)) { | |
| continue; | |
| } | |
| GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", idx, ggml_op_desc(op)); | |
| GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx, | |
| ggml_op_desc(op)); | |
| for (int idx = 0; idx < cgraph->n_nodes; idx++) { | |
| ggml_tensor * op = ggml_graph_node(cgraph, idx); | |
| if (device->iface.supports_op(device, op)) { | |
| continue; | |
| } | |
| GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx, | |
| ggml_op_desc(op)); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines
111 - 117, The loop that checks node support uses an undefined variable `dev`;
replace uses of `dev` with the correct `device` variable: update the call to the
backend support check from dev->iface.supports_op(dev, op) to
device->iface.supports_op(device, op) so the check uses the declared `device`
object (the loop iterating over cgraph->n_nodes and calling
ggml_graph_node/ggml_op_desc should remain unchanged).
| // Decode device handle first | ||
| uintptr_t device_handle; | ||
| apir_decode_uintptr_t(dec, &device_handle); | ||
| ggml_backend_dev_t device = (ggml_backend_dev_t) device_handle; | ||
|
|
||
| // Decode backend ID second | ||
| uint32_t backend_id; | ||
| apir_decode_uint32_t(dec, &backend_id); | ||
|
|
||
| // Cleanup specific backend instance | ||
| cleanup_backend_instance(device, backend_id); | ||
|
|
There was a problem hiding this comment.
Validate decoded cleanup identifiers before using them.
Cleanup currently decodes and immediately uses IDs. If decode fails, Line 153 can run with invalid values.
Proposed fix
- uintptr_t device_handle;
+ uintptr_t device_handle = 0;
apir_decode_uintptr_t(dec, &device_handle);
ggml_backend_dev_t device = (ggml_backend_dev_t) device_handle;
// Decode backend ID second
- uint32_t backend_id;
+ uint32_t backend_id = 0;
apir_decode_uint32_t(dec, &backend_id);
+ if (apir_decoder_get_fatal(dec)) {
+ apir_decoder_set_fatal(dec);
+ return 1;
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines
143 - 154, The code currently decodes device_handle and backend_id with
apir_decode_uintptr_t and apir_decode_uint32_t and immediately calls
cleanup_backend_instance(device, backend_id); update this to validate the decode
results before use: capture and check the return/error codes from
apir_decode_uintptr_t(dec, &device_handle) and apir_decode_uint32_t(dec,
&backend_id), only cast to ggml_backend_dev_t and call
cleanup_backend_instance(device, backend_id) when both decodes succeed, and
otherwise avoid calling cleanup_backend_instance and handle/log the decode
failure appropriately.
| struct apir_backend_instance { | ||
| ggml_backend_t bck; // The actual backend | ||
| uint32_t magic; // For validation: 0xAB1234CD | ||
| }; |
There was a problem hiding this comment.
Misleading comment: magic value doesn't match the actual macro.
The comment on line 32 says 0xAB1234CD but APIR_BACKEND_INSTANCE_MAGIC (line 47) is defined as 0xCD4321BA.
Fix
struct apir_backend_instance {
ggml_backend_t bck; // The actual backend
- uint32_t magic; // For validation: 0xAB1234CD
+ uint32_t magic; // For validation: APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA)
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct apir_backend_instance { | |
| ggml_backend_t bck; // The actual backend | |
| uint32_t magic; // For validation: 0xAB1234CD | |
| }; | |
| struct apir_backend_instance { | |
| ggml_backend_t bck; // The actual backend | |
| uint32_t magic; // For validation: APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA) | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.h` around lines 30 - 33, The
comment on struct apir_backend_instance is wrong: update the inline magic
comment for apir_backend_instance to match the canonical constant
APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA) or vice-versa; specifically, locate
struct apir_backend_instance and the macro APIR_BACKEND_INSTANCE_MAGIC and make
the descriptive comment use the same hex value as the macro to avoid mismatch
(ensure only the comment is changed unless you intend to change the macro
value).
| extern std::unordered_set<ggml_backend_buffer_t> backend_buffers; | ||
| if (backend_buffers.find(buffer) == backend_buffers.end()) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate global set declaration and all call sites touching tracked buffers.
rg -n -C3 '\bbackend_buffers\b|apir_track_backend_buffer\s*\(|apir_untrack_backend_buffer\s*\(|apir_get_track_backend_buffers\s*\(' ggml/src/ggml-virtgpu
# Inspect whether accesses are guarded by mutex/lock primitives nearby.
rg -n -C3 'std::mutex|std::lock_guard|std::scoped_lock|std::shared_mutex|std::unique_lock' ggml/src/ggml-virtgpuRepository: crc-org/llama.cpp
Length of output: 15753
🏁 Script executed:
# Verify the implementation of apir_cs_ggml-rpc-back.cpp to see full context
head -50 ggml/src/ggml-virtgpu/backend/apir_cs_ggml-rpc-back.cppRepository: crc-org/llama.cpp
Length of output: 1860
🏁 Script executed:
# Check if backend_buffers is used in headers that might be included in multi-threaded contexts
rg -l 'apir_track_backend_buffer|apir_untrack_backend_buffer|apir_get_track_backend_buffers' ggml/src/ggml-virtgpuRepository: crc-org/llama.cpp
Length of output: 394
Add synchronization to protect backend_buffers from concurrent access.
The global backend_buffers set in apir_cs_ggml-rpc-back.cpp lacks any mutex or lock protection. All accesses—including find() at lines 109-110, insert() in apir_track_backend_buffer(), and erase() in apir_untrack_backend_buffer()—operate on an unprotected shared state. Since std::unordered_set is not thread-safe, concurrent calls to track/untrack buffers can race with the validation check, potentially invalidating the security boundary or causing undefined behavior. Wrap all accesses with a mutex (e.g., std::mutex backend_buffers_lock) or use a thread-safe container.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h` around lines 109 - 110,
The global backend_buffers set is accessed concurrently without protection;
introduce a std::mutex (e.g., backend_buffers_lock) and use scoped locking
(std::lock_guard<std::mutex> or std::unique_lock<std::mutex>) around every
access to backend_buffers including the validation check that uses
backend_buffers.find(buffer) and the mutating calls in apir_track_backend_buffer
(insert) and apir_untrack_backend_buffer (erase) so all reads/writes are
serialized and thread-safe.
| static std::atomic<bool> initialized = false; | ||
| static ggml_backend_buffer_type buft; |
There was a problem hiding this comment.
Static singleton buffer-type cache is not device-safe.
Line 76 and Line 99 use one process-wide static buft per function. After first initialization, later devices get the first device’s .device/.context, which breaks per-device correctness.
Please make this cache device-scoped (e.g., store buffer types in ggml_backend_remoting_device_context and initialize once per device).
Also applies to: 99-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/ggml-backend-device.cpp` around lines 76 - 77, The code
currently uses a process-wide static cache (static std::atomic<bool> initialized
and static ggml_backend_buffer_type buft) which causes buft to hold the first
device's device/context for all subsequent devices; change the cache to be
device-scoped by moving ggml_backend_buffer_type buft (and the per-device
initialized flag) into the per-device context struct
ggml_backend_remoting_device_context and initialize buft once per device when
accessing it (replace the function-local static buft/initialized usage with
reads/writes to the fields on ggml_backend_remoting_device_context so each
device gets its own buffer type initialized with the correct device/context).
| if (ret == 0) { | ||
| // Success - decode the response | ||
| apir_decode_uintptr_t(decoder, &device_handle); | ||
| apir_decode_uint32_t(decoder, &backend_id); | ||
| GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle, | ||
| backend_id); |
There was a problem hiding this comment.
Check decoder fatal state before returning initialize success.
On Line 31, success is determined only by ret == 0. If decoding on Line 33-34 fails, this can still return success with zeroed IDs.
Proposed fix
if (ret == 0) {
// Success - decode the response
apir_decode_uintptr_t(decoder, &device_handle);
apir_decode_uint32_t(decoder, &backend_id);
- GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle,
- backend_id);
+ if (apir_decoder_get_fatal(decoder)) {
+ GGML_LOG_ERROR(GGML_VIRTGPU "%s: Malformed initialize response payload\n", __func__);
+ ret = 1;
+ device_handle = 0;
+ backend_id = 0;
+ } else {
+ GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__,
+ (void *) device_handle, backend_id);
+ }
} else {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (ret == 0) { | |
| // Success - decode the response | |
| apir_decode_uintptr_t(decoder, &device_handle); | |
| apir_decode_uint32_t(decoder, &backend_id); | |
| GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle, | |
| backend_id); | |
| if (ret == 0) { | |
| // Success - decode the response | |
| apir_decode_uintptr_t(decoder, &device_handle); | |
| apir_decode_uint32_t(decoder, &backend_id); | |
| if (apir_decoder_get_fatal(decoder)) { | |
| GGML_LOG_ERROR(GGML_VIRTGPU "%s: Malformed initialize response payload\n", __func__); | |
| ret = 1; | |
| device_handle = 0; | |
| backend_id = 0; | |
| } else { | |
| GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, | |
| (void *) device_handle, backend_id); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp` around lines 31 - 36, The
code currently treats ret == 0 as success before verifying that decoding
succeeded; after calling apir_decode_uintptr_t(decoder, &device_handle) and
apir_decode_uint32_t(decoder, &backend_id) you must check the decoder's
fatal/error state and only consider initialization successful if decoding
produced no errors. Update the block that currently checks ret to call the
decoder error/fatal-check API (e.g., apir_decoder_fatal(decoder) or equivalent)
after the two apir_decode_* calls, set ret to a non-zero error code and log an
error if decoding failed, and only log and return success when the decoder
indicates no fatal error (use the same symbols: decoder, device_handle,
backend_id, apir_decode_uintptr_t, apir_decode_uint32_t).
|
|
||
| if (virtgpu_shmem_create(gpu, size, &buffer_context.shmem)) { | ||
| GGML_ABORT(GGML_VIRTGPU "Couldn't allocate the guest-host shared buffer"); | ||
| GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %ldb of guest-host shared buffer", __func__, size); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, verify the exact code at line 176
sed -n '170,180p' ggml/src/ggml-virtgpu/virtgpu-forward-device.cppRepository: crc-org/llama.cpp
Length of output: 443
🏁 Script executed:
# Check the function signature to confirm size parameter type
sed -n '1,200p' ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp | grep -B20 'Couldn.t allocate.*shared buffer'Repository: crc-org/llama.cpp
Length of output: 810
🏁 Script executed:
# Scan for %ld format specifiers in virtgpu code
rg '%ld' ggml/src/ggml-virtgpu/ -nRepository: crc-org/llama.cpp
Length of output: 220
🏁 Script executed:
# Search for %ld patterns in all virtgpu files
rg '%ld' ggml/src/ggml-virtgpu/ --type cpp --type h -nRepository: crc-org/llama.cpp
Length of output: 220
Use %zu instead of %ld for size_t parameter in format string.
Line 176 formats size_t size with %ld, which is not portable across platforms where size_t differs in width from unsigned long. Use %zu, the standard format specifier for size_t in C99 and later.
Proposed fix
- GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %ldb of guest-host shared buffer", __func__, size);
+ GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %zuB of guest-host shared buffer", __func__, size);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp` at line 176, The format
string in the GGML_ABORT call that uses GGML_VIRTGPU and __func__ incorrectly
prints the size_t variable size with "%ld"; update the format specifier to "%zu"
so the call GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %zub of guest-host
shared buffer", __func__, size) correctly and portably formats the size_t value
(no cast needed).
| assert(type >= 0); | ||
| assert(type < GGML_TYPE_COUNT); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the specific functions mentioned in the review
rg -n -A 5 'ggml_get_type_traits|ggml_blck_size|ggml_type_size|ggml_row_size|ggml_type_name|ggml_is_quantized' ggml/src/ggml.c | head -150Repository: crc-org/llama.cpp
Length of output: 5637
🏁 Script executed:
# Check if GGML_ASSERT is defined and how it differs from assert
rg -n 'define GGML_ASSERT|define ASSERT' ggml/src/ggml.c ggml/src/ggml.hRepository: crc-org/llama.cpp
Length of output: 189
🏁 Script executed:
# Check for NDEBUG usage in the codebase
rg -n 'NDEBUG|DNDEBUG' --type-list | head -20Repository: crc-org/llama.cpp
Length of output: 450
🏁 Script executed:
# Check the build configuration files
fd 'CMakeLists|Makefile|build' | head -20Repository: crc-org/llama.cpp
Length of output: 616
🏁 Script executed:
# Search for GGML_ASSERT definition more broadly
rg -n 'define GGML_ASSERT' ggml/src/Repository: crc-org/llama.cpp
Length of output: 194
🏁 Script executed:
# Check the header files in ggml directory
fd -e h ggml/Repository: crc-org/llama.cpp
Length of output: 386
🏁 Script executed:
# Look in ggml.h for GGML_ASSERT
cat ggml/include/ggml.h | head -200Repository: crc-org/llama.cpp
Length of output: 7332
🏁 Script executed:
# Check CMakeLists.txt for NDEBUG or release configurations
rg -n 'NDEBUG|CMAKE_BUILD_TYPE|Release' CMakeLists.txtRepository: crc-org/llama.cpp
Length of output: 345
🏁 Script executed:
# Verify which assert macro is actually used - check includes
head -100 ggml/src/ggml.c | grep -E '^#include|^#define'Repository: crc-org/llama.cpp
Length of output: 1085
🏁 Script executed:
# Search entire repo for GGML_ASSERT usage (not just definition)
rg -n 'GGML_ASSERT\(' --type c --type h | head -50Repository: crc-org/llama.cpp
Length of output: 3427
🏁 Script executed:
# Check if GGML_ASSERT is maybe defined in a common header
cat ggml/include/ggml.h | grep -A5 'define GGML_ASSERT'Repository: crc-org/llama.cpp
Length of output: 313
🏁 Script executed:
# Look for any assert-related definitions in ggml headers
rg -n 'define.*ASSERT' ggml/include/Repository: crc-org/llama.cpp
Length of output: 158
🏁 Script executed:
# Check ggml-impl.h and ggml-backend.h
rg -n 'define.*ASSERT|GGML_ASSERT' ggml/src/ggml-impl.h ggml/src/ggml-backend.h 2>/dev/null | head -20Repository: crc-org/llama.cpp
Length of output: 268
🏁 Script executed:
# Check git history for ggml_type_name to see if it had a fallback
git log -p --all -S 'ggml_type_name' -- ggml/src/ggml.c | head -100Repository: crc-org/llama.cpp
Length of output: 43
🏁 Script executed:
# Check what GGML_ABORT does
rg -n 'define GGML_ABORT' ggml/include/ggml.h -A 3Repository: crc-org/llama.cpp
Length of output: 311
Replace assert() with GGML_ASSERT() in type accessor functions—debug-only checks vanish in Release builds
Type bounds checks at lines 902–903, 1269–1295 use C's standard assert(), which is disabled when NDEBUG is defined (default in Release builds). This leaves type_traits[type] open to out-of-bounds reads if invalid enum values are passed. ggml_type_name() also dropped its fallback and now unconditionally indexes the array.
Replace assert() with GGML_ASSERT(), which is not affected by NDEBUG and will abort on failure. For ggml_type_name(), use a runtime check with a fallback to avoid OOB reads:
Proposed fix
const struct ggml_type_traits * ggml_get_type_traits(enum ggml_type type) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
+ GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
return &type_traits[type];
}
int64_t ggml_blck_size(enum ggml_type type) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
+ GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
return type_traits[type].blck_size;
}
size_t ggml_type_size(enum ggml_type type) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
+ GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
return type_traits[type].type_size;
}
size_t ggml_row_size(enum ggml_type type, int64_t ne) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
+ GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
assert(ne % ggml_blck_size(type) == 0);
return ggml_type_size(type)*ne/ggml_blck_size(type);
}
const char * ggml_type_name(enum ggml_type type) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
- return type_traits[type].type_name;
+ if ((unsigned) type >= GGML_TYPE_COUNT) {
+ return "NONE";
+ }
+ return type_traits[type].type_name;
}
bool ggml_is_quantized(enum ggml_type type) {
- assert(type >= 0);
- assert(type < GGML_TYPE_COUNT);
+ GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
return type_traits[type].is_quantized;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ggml/src/ggml.c` around lines 902 - 903, Replace the standard C assert()
calls in the type accessor functions with GGML_ASSERT() so checks remain active
in Release builds: change assertions that validate "type >= 0" and "type <
GGML_TYPE_COUNT" (and any similar checks around type_traits access) to
GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT). Also update ggml_type_name()
to perform a runtime bounds check against GGML_TYPE_COUNT before indexing
type_traits and return a safe fallback string (e.g., "unknown") when the check
fails to avoid out-of-bounds reads.
|
🔴 Test of 'mac_ai test pre_cleanup_ci' failed after 00 hours 00 minutes 27 seconds. 🔴 • Link to the test results. • No reports index generated... Test configuration: Failure indicator: Empty. (See run.log) |
|
/test topsail remoting_mac |
1 similar comment
|
/test topsail remoting_mac |
|
🟢 Test of 'mac_ai test test_ci' succeeded after 00 hours 27 minutes 28 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test topsail remoting_linux |
|
🔴 Test of 'mac_ai test test_ci' failed after 00 hours 03 minutes 46 seconds. 🔴 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test topsail remoting_mac |
|
🟢 Test of 'mac_ai test test_ci' succeeded after 00 hours 26 minutes 06 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
Make sure to read the contributing guidelines before submitting a PR
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Deprecations
ggml_type_sizefAPI