Skip to content

Add backend object management#29

Open
kpouget wants to merge 26 commits intocrc-org:mainfrom
kpouget:full2
Open

Add backend object management#29
kpouget wants to merge 26 commits intocrc-org:mainfrom
kpouget:full2

Conversation

@kpouget
Copy link
Collaborator

@kpouget kpouget commented Feb 25, 2026

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for new LLM model type (24B.A2B)
    • Enhanced server with query string handling in HTTP requests
  • Bug Fixes

    • Hardened memory validation with overflow detection and bounds checking
    • Improved buffer operation validation to prevent invalid memory access
    • Enhanced error handling and reporting across backend operations
  • Documentation

    • Updated VirtGPU documentation with shared-memory size constraints (64 GB limit)
  • Deprecations

    • Removed deprecated ggml_type_sizef API

rgerganov and others added 26 commits February 24, 2026 10:30
"max_tokens" is deprectated in favor of "max_completion_tokens" which
sets the upper bound for reasoning+output token.

Closes: #13700
* model : Update label for LFM2-24B-A2B

```
❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2moe 24B.A2B Q4_0           |  12.54 GiB |    23.84 B | CPU        |      10 |             pp1 |         30.35 ± 2.49 |
| lfm2moe 8B.A1B Q4_0            |   4.41 GiB |     8.34 B | CPU        |      10 |             pp1 |         49.24 ± 1.93 |
```

* Remove extra line
* gguf : prevent integer overflow for ggml_context mem size

* ggml : fix int overflows in ggml_new_object()

* gguf : prevent string exhaustion

* gguf : prevent array elements exhaustion

* ggml : fix negative tensor type oob

* py : assert that alignment is non-zero power of 2

* ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object

* gguf-py : error on duplicate keys when reading

* py : restore tensor_fields

* enforce proper alignment in add_custom_alignment

* gguf : better name

* gguf : fix ctx size for no_alloc == true

* gguf : minor print fix

* ggml : print values when overflow

* ggml : remove deprecated ggml_type_sizef()

* ggml : relax ggml_type asserts to debug-only

* gguf : add mem_size overflow test

* gguf : add file size check for arrays

* ggml : relax asseerts for ggml_get_type_traits()

* flake8 fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…outer mode (#19854)

* server: fix query params lost when proxying requests in multi-model router mode

* server: re-encode query params using httplib::encode_query_component in proxy
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
…thods

```
  1. bck->iface.synchronize(bck)
  2. buft->iface.get_alloc_size(buft, op)
  3. buft->iface.get_max_size(buft)
```

these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
…iles

Compilation breaks when ordered alphabetically.
@openshift-ci
Copy link

openshift-ci bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

This pull request introduces per-device backend instance management with explicit device handles and backend IDs for the VirtGPU APIR system; adds comprehensive input validation and overflow protection across GGML core, GGUF, and buffer operations; adds a new LLM_TYPE_24B_A2B model type; updates model implementations with explicit graph expansions; and enhances server HTTP handling with query string support.

Changes

Cohort / File(s) Summary
VirtGPU Backend Initialization & Dispatch Refactoring
ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp, ggml/src/ggml-virtgpu/backend/backend-dispatched.h, ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h, ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp, ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp, ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
Introduced per-device backend instance management with explicit backend lifecycle handling (initialize/cleanup), added device handle and backend ID tracking and validation, refactored backend-dispatched handlers with new validation helpers and error paths, enhanced input validation for buffer operations with overflow detection.
APIR Remote Backend API & Dispatch
ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h, ggml/src/ggml-virtgpu/backend/shared/apir_backend.h, ggml/src/ggml-virtgpu/backend/shared/api_remoting.h, ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h
Added new APIR command types (BACKEND_INITIALIZE=22, BACKEND_CLEANUP=24) with updated dispatch table count, introduced apir_dispatch_command_name inline function for command type stringification, expanded apir_backend_dispatcher signature with additional encoder/decoder parameters, added APIR_FORWARD_FAILED_TO_SYNC_STREAMS error code, and updated error handling paths.
VirtGPU Forward Backend Implementation
ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp, ggml/src/ggml-virtgpu/virtgpu-forward.gen.h, ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
Introduced apir_backend_initialize and apir_backend_cleanup functions with device/backend handle management, refactored apir_backend_graph_compute to accept and transmit device_handle and backend_id parameters, enhanced REMOTE_CALL_PREPARE and REMOTE_CALL macros with command-name tracking and error reporting.
VirtGPU Core & Registration
ggml/src/ggml-virtgpu/ggml-backend.cpp, ggml/src/ggml-virtgpu/ggml-backend-reg.cpp, ggml/src/ggml-virtgpu/ggml-remoting.h, ggml/src/ggml-virtgpu/include/apir_hw.h, ggml/src/ggml-virtgpu/backend.cpp, ggml/src/ggml-virtgpu/regenerate_remoting.py
Updated backend initialization to use APIR lifecycle (initialize/cleanup), added device_handle and backend_id to device context, introduced prefixed naming for device and buffer types, reworked code generation for apir_dispatch_command_name, minor formatting adjustments.
GGML Core Safety Hardening
ggml/src/ggml.c
Added bounds checking via assertions for type enum values in type trait accessors, introduced overflow-safe memory allocation checks in ggml_new_object detecting potential overflows in size calculations, added assertion in ggml_new_tensor_impl ensuring GGML_TENSOR_SIZE safety, strengthened object allocation with explicit overflow detection.
GGUF Reader/Writer Safety
ggml/src/gguf.cpp, gguf-py/gguf/gguf_reader.py, gguf-py/gguf/gguf_writer.py
Added bounds validation for vector and string reads with maximum element checks, enhanced overflow-aware memory sizing for context allocation, introduced remaining byte tracking and file position validation, added alignment power-of-two and positivity validation, enforced strict duplicate field detection raising KeyError.
VirtGPU Test & Utility Updates
tests/test-gguf.cpp, ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp, ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp, ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp, ggml/src/ggml-virtgpu/virtgpu.cpp, ggml/src/ggml-virtgpu/virtgpu.h
Added MEM_SIZE_OVERFLOW test case with extreme tensor shapes, cast alignment to uint64_t for overflow protection, improved error messages with function names and sizes, minor formatting adjustments and log message reflowing.
LLM Model Type Addition
src/llama-model.h, src/llama-model.cpp
Added new LLM_TYPE_24B_A2B enum value with 40-layer mapping and type name declaration.
Model Graph Expansion Updates
src/models/kimi-linear.cpp, src/models/qwen35.cpp, src/models/qwen35moe.cpp, src/models/qwen3next.cpp
Added explicit ggml_build_forward_expand calls after attn_norm computation; removed conv_states callback logging in linear attention paths.
Server HTTP Query String Support
tools/server/server-http.h, tools/server/server-http.cpp, tools/server/server-models.cpp, tools/server/server-task.cpp
Added query_string field to server_http_req struct, introduced build_query_string helper to extract query parameters, updated proxy request construction to append query string to path, enhanced timeout parameter passing to proxy, reworked n_predict fallback logic to check max_completion_tokens.
Documentation & Deprecated API Removal
docs/backend/VirtGPU.md, ggml/include/ggml.h
Added shared-memory size limitation note for libkrun hypervisor (64 GB limit); removed deprecated ggml_type_sizef function declaration.
VirtGPU Formatting & Includes
ggml/src/ggml-virtgpu/backend/shared/apir_cs.h, ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h, ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h, ggml/src/ggml-virtgpu/ggml-backend-device.cpp, ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
Added security hardening in ggml buffer decoding with validation against tracked backend buffers, introduced header guards and formatting controls (clang-format off/on), added runtime buffer handle validation, minor signature formatting and comment adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Frontend App
    participant BackendInit as Backend Init Flow
    participant DeviceMgr as Device Manager
    participant BackendInst as Backend Instance
    participant DispatchAPI as APIR Dispatch

    Client->>BackendInit: apir_backend_initialize(reg_fn)
    BackendInit->>DeviceMgr: ensure_device_extension(dev)
    DeviceMgr->>DeviceMgr: check device registry
    DeviceMgr->>BackendInst: create_backend_instance(dev)
    BackendInst->>BackendInst: allocate & validate magic
    BackendInit->>DispatchAPI: backend_dispatch_initialize()
    DispatchAPI->>BackendInst: initialize backend via device iface
    BackendInit->>Client: return device_handle + backend_id
    
    Note over Client,BackendInst: Later: Graph Compute
    Client->>DispatchAPI: apir_backend_graph_compute(device_handle, backend_id, cgraph)
    DispatchAPI->>DeviceMgr: get_backend_instance(device_handle, backend_id)
    DeviceMgr->>BackendInst: retrieve instance & validate
    BackendInst->>DispatchAPI: execute graph_compute
    DispatchAPI->>Client: return status
    
    Note over Client,BackendInst: Cleanup
    Client->>DispatchAPI: apir_backend_cleanup(device_handle, backend_id)
    DispatchAPI->>DeviceMgr: cleanup_backend_instance()
    DeviceMgr->>BackendInst: teardown & dealloc
    DispatchAPI->>Client: cleanup complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 Device handles flow, backends initialize with care,
Safety checks bloom where overflows dared,
Per-device instances dance in the virtual air,
Cleanup paths mirror the startup's declare,
Code now guards against the overflow's snare! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the boilerplate reminder to read contributing guidelines with no substantive details about changes, objectives, or testing. Add a detailed description explaining the backend object management changes, the rationale for lifecycle handling, key modifications to APIR layers, new error codes, and testing performed.
Docstring Coverage ⚠️ Warning Docstring coverage is 13.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add backend object management' directly summarizes the main focus of the changeset—introducing backend object lifecycle management across ggml-virtgpu layers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kpouget
Copy link
Collaborator Author

kpouget commented Feb 25, 2026

/test topsail remoting_mac
/cluster mac

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp (1)

47-71: ⚠️ Potential issue | 🟠 Major

Add null/fatal checks after apir_decode_ggml_tensor calls in three functions.

apir_decode_ggml_tensor can return NULL if the RPC tensor data is malformed (early return when apir_decode_apir_rpc_tensor_inplace fails). Currently, no NULL check is performed before passing the decoded tensor to buffer interface functions, allowing a NULL or invalid pointer to be dereferenced with potential host-side crash or memory corruption from a malicious guest.

This pattern is present in three functions:

  • backend_buffer_set_tensor (line 49): tensor used at line 71 without check
  • backend_buffer_get_tensor (line 90): tensor used at line 111 without check
  • backend_buffer_cpy_tensor (lines 129–130): src and dst tensors used at line 132 without checks

Other functions like backend_buffer_clear correctly validate decoded handles with if (!buffer || apir_decoder_get_fatal(dec)) after decoding. Apply the same pattern after tensor decodes.

Proposed fix
    tensor = (ggml_tensor *) (uintptr_t) apir_decode_ggml_tensor(dec);

+    if (!tensor || apir_decoder_get_fatal(dec)) {
+        GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Invalid tensor handle from guest\n", __func__);
+        return 1;
+    }
+
    uint32_t shmem_res_id;

Apply analogously after each apir_decode_ggml_tensor call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp` around lines 47
- 71, After calling apir_decode_ggml_tensor in each of
backend_buffer_set_tensor, backend_buffer_get_tensor, and
backend_buffer_cpy_tensor, add a null/fatal check and return an error if
decoding failed: verify the decoded pointer (e.g., tensor, src, dst) is non-NULL
and that apir_decoder_get_fatal(dec) is false before using it; if the check
fails, log/return the same error path used elsewhere (return 1) to avoid passing
NULL into buffer->iface.set_tensor, buffer->iface.get_tensor, or
buffer->iface.cpy_tensor. Ensure checks are added immediately after each
apir_decode_ggml_tensor call and use the same pattern as the existing
buffer/decoder validation (if (!tensor || apir_decoder_get_fatal(dec)) { return
1; }).
ggml/src/ggml-virtgpu/virtgpu.cpp (1)

124-146: ⚠️ Potential issue | 🔴 Critical

Bug: Duplicate condition — APIR_LOAD_LIBRARY_ENV_VAR_MISSING is checked twice; second branch is dead code.

Line 125 checks ret == APIR_LOAD_LIBRARY_ENV_VAR_MISSING, so the else if at line 136 with the same condition can never be reached. Based on the error message at line 138 ("some symbols are missing"), this was likely meant to be APIR_LOAD_LIBRARY_SYMBOL_MISSING.

🐛 Proposed fix
-        } else if (ret == APIR_LOAD_LIBRARY_ENV_VAR_MISSING) {
+        } else if (ret == APIR_LOAD_LIBRARY_SYMBOL_MISSING) {
             GGML_ABORT(GGML_VIRTGPU
                        "%s: could not load the backend library, some symbols are missing. "
                        "Make sure virglrenderer is correctly configured by the hypervisor. (%s) ",
                        __func__, apir_load_library_error(ret));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/virtgpu.cpp` around lines 124 - 146, The duplicate
conditional checks for APIR_LOAD_LIBRARY_ENV_VAR_MISSING cause the "some symbols
are missing" branch to be unreachable; in the virtgpu.cpp block that examines
ret (the APIR load result) replace the second else if condition currently using
APIR_LOAD_LIBRARY_ENV_VAR_MISSING with the correct constant
APIR_LOAD_LIBRARY_SYMBOL_MISSING so the branch that calls GGML_ABORT(... "some
symbols are missing" ..., apir_load_library_error(ret)) is reachable; ensure you
reference the same apir_load_library_error(ret), __func__, and GGML_ABORT usage
as the other branches.
🧹 Nitpick comments (17)
gguf-py/gguf/gguf_reader.py (1)

208-212: Move the fallback comments above the raise to avoid dead-code confusion.

Lines 211-212 are comment-only, but they appear in the dead-code region after an unconditional raise. Readers must mentally discount the raise to understand what the comments mean. Moving them (and the # TODO) directly above the raise makes the intent clearer without any behavior change.

♻️ Suggested restructuring
 if field.name in self.fields:
+    # TODO: add option to make this a warning and accept duplicate keys:
+    # logger.warning(f'Duplicate key {field.name} at offset {field.offset}')
+    # self.fields[field.name + '_{}'.format(field.offset)] = field
     raise KeyError(f'Duplicate {field.name} already in list at offset {field.offset}')
-
-    # TODO: add option to make this a warning and accept duplicate keys like below
-    # logger.warning(f'Duplicate key {field.name} at offset {field.offset}')
-    # self.fields[field.name + '_{}'.format(field.offset)] = field
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gguf-py/gguf/gguf_reader.py` around lines 208 - 212, Move the
fallback-comment lines and the TODO so they appear immediately before the
unconditional raise that raises KeyError; specifically, relocate the lines "#
TODO: add option to make this a warning and accept duplicate keys like below"
and the two commented fallback alternatives ("# logger.warning(...)" and "#
self.fields[...]= field") to just above the raise statement that references
field.name and field.offset (the duplicate-key handling in gguf_reader.py where
the KeyError is raised), leaving behavior unchanged but avoiding dead-code
confusion.
src/models/qwen35moe.cpp (1)

32-33: Same graph-split anchor — same optional comment suggestion applies.

See the note in qwen35.cpp. Identical change and identical rationale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/qwen35moe.cpp` around lines 32 - 33, The
ggml_build_forward_expand(gf, cur); call in qwen35moe.cpp needs the same
graph-split anchor/comment adjustment applied in qwen35.cpp: update the line
with the identical optional comment or anchor marker used in qwen35.cpp so the
graph-splitting logic is clear and consistent (locate the
ggml_build_forward_expand call and mirror the comment/anchor change from
qwen35.cpp).
src/models/kimi-linear.cpp (1)

119-120: Same graph-split anchor as the other models — consider adding the explanatory comment.

Identical pattern to the change in qwen35.cpp. cur is already reachable via the main terminal expand at Line 391; the pre-anchor is meaningful only for VirtGPU graph-split scenarios. A brief comment here (as suggested for the other files) would keep intent clear.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/kimi-linear.cpp` around lines 119 - 120, Add a brief explanatory
comment above the call to ggml_build_forward_expand(gf, cur) clarifying that the
pre-anchor is redundant for normal execution because cur is already reachable
via the main terminal expand, but is retained intentionally to support VirtGPU
graph-split scenarios; reference ggml_build_forward_expand and cur in the
comment so future readers understand the intent and parity with the change made
in qwen35.cpp.
src/models/qwen3next.cpp (1)

24-25: Same graph-split anchor — same optional comment suggestion applies.

Consistent with the other three models. Consider adding a brief comment explaining the VirtGPU graph-split rationale, as this call is otherwise non-obvious to future readers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/qwen3next.cpp` around lines 24 - 25, Add a brief inline comment
above the ggml_build_forward_expand(gf, cur) call in qwen3next.cpp explaining
that this is the VirtGPU "graph-split" anchor (same rationale used in the other
three models): state that this call intentionally marks the forward-graph split
point for VirtGPU execution, why it’s needed (e.g., separating CPU/GPU stages or
enabling virtualized GPU graph replay), and that it must remain to preserve
graph partitioning behavior; reference the ggml_build_forward_expand function
name so reviewers can locate and align the wording with the comments in the
other model files.
src/models/qwen35.cpp (1)

32-33: ggml_build_forward_expand on attn_norm — verify intent for all execution paths.

cur (attn_norm) is always consumed by either build_layer_attn_linear or build_layer_attn immediately below, so it would be reachable from the terminal ggml_build_forward_expand(gf, cur) at Line 84 via normal backwards traversal. The only scenario where this early anchor is required is when the compute graph is split across backends (e.g., VirtGPU guest/host boundary) and the split point falls between the norm and the downstream ops.

If that is the intent, the comment explaining why this pre-anchor is needed would help future readers distinguish it from an accidental duplicate. No functional concern.

+        // Anchor attn_norm in the graph before the recurrent/attn branch split so that
+        // backend graph-split (VirtGPU) sees it as a completed node on this side.
         ggml_build_forward_expand(gf, cur);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/qwen35.cpp` around lines 32 - 33, The call to
ggml_build_forward_expand(gf, cur) on attn_norm appears duplicate because cur is
immediately consumed by build_layer_attn_linear or build_layer_attn; either
document intent for cross-backend split points or remove the early anchor.
Update the code around ggml_build_forward_expand(gf, cur) to either (a) add a
concise comment explaining this pre-anchor is intentional to ensure the compute
graph root is preserved when the graph may be split across backends (e.g.,
VirtGPU guest/host boundary), referencing attn_norm and the downstream
build_layer_attn_linear/build_layer_attn calls, or (b) if no cross-backend split
is intended, remove the early ggml_build_forward_expand(gf, cur) to avoid the
apparent duplicate and rely on the terminal ggml_build_forward_expand(gf, cur)
reachable via normal traversal.
tests/test-gguf.cpp (3)

424-426: Magic number 32 should have a comment.

The value is non-obvious without context — any value from 1 to ~GGUF_DEFAULT_ALIGNMENT would suffice. A one-liner clarifying it deliberately writes fewer bytes than any single tensor requires would make the intent clear:

💬 Suggested annotation
+        // Write far fewer bytes than the declared tensor data requires,
+        // so the parser detects the overflow before attempting any allocation.
         if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) {
             nbytes = 32;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test-gguf.cpp` around lines 424 - 426, Replace the unexplained magic
literal 32 in the HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch by adding a one-line
comment next to the assignment to nbytes (or replace with a named constant)
clarifying that this value is intentionally chosen to be smaller than the
minimum bytes any single tensor would require (e.g., any value in
1..GGUF_DEFAULT_ALIGNMENT would work) so the test deliberately writes fewer
bytes than a tensor; reference the symbols hft,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, nbytes and GGUF_DEFAULT_ALIGNMENT in the
comment for clarity.

410-410: The (uint64_t) cast on alignment is a necessary correctness fix — approve.

GGML_PAD expands to (((x) + (n) - 1) & ~((n) - 1)). When x is size_t (64-bit) and n is uint32_t, ~((n) - 1) is evaluated as uint32_t and then zero-extended to 64 bits when AND-ed with the 64-bit x. This silently zeros the upper 32 bits of the result — for a row size like 0x7FFFFFFFFFFFFFC0 the computed pad would be completely wrong. Casting alignment to uint64_t forces the entire expression to 64-bit arithmetic and fixes the mask.

The same pattern without the cast appears at line 640 in handcrafted_check_tensors:

expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);

That site is not exercised for the overflow case (the function is guarded by expect_context_not_null), so it's not currently broken, but it should be made consistent.

♻️ Apply the same cast at line 640 for consistency
-        expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);
+        expected_offset += GGML_PAD(ggml_row_size(type, ne), (uint64_t) alignment);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test-gguf.cpp` at line 410, The pad calculation incorrectly uses a
32-bit mask when alignment is a smaller integer; update the second occurrence to
match the first by casting alignment to uint64_t so the GGML_PAD expression uses
64-bit arithmetic: change the call in handcrafted_check_tensors that updates
expected_offset (expected_offset += GGML_PAD(ggml_row_size(type, ne),
alignment);) to cast alignment to uint64_t, mirroring the fix applied where
offset is updated; ensure you reference GGML_PAD, ggml_row_size, alignment,
expected_offset, and handcrafted_check_tensors when making the change.

201-206: Consider adding a comment explaining the choice of 0x7FFFFFFFFFFFFFC0.

The magic value is non-obvious. A brief comment noting it's the largest int64_t value already aligned to GGUF_DEFAULT_ALIGNMENT (64 bytes), so that two such tensors bring the combined data size to ~0xFFFFFFFFFFFFFF80, would aid future readers without misleading them.

💬 Suggested annotation
 if (hft == HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW) {
     tensor_configs.resize(2);
+    // 0x7FFFFFFFFFFFFFC0 == INT64_MAX rounded down to the nearest 64-byte boundary.
+    // Two such I8 tensors produce a combined data size of ~UINT64_MAX, triggering
+    // a memory-size overflow in the GGUF parser.
     tensor_configs[0] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } };
     tensor_configs[1] = { GGML_TYPE_I8, { 0x7FFFFFFFFFFFFFC0, 1, 1, 1 } };
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test-gguf.cpp` around lines 201 - 206, Add a clarifying comment above
the HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch explaining the magic constant
0x7FFFFFFFFFFFFFC0: state that it is the largest int64_t value already aligned
to GGUF_DEFAULT_ALIGNMENT (64 bytes), and that creating two tensors with that
size causes the combined data size to overflow (approx 0xFFFFFFFFFFFFFF80).
Place this comment next to the tensor_configs assignments (symbols: hft,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, tensor_configs, GGML_TYPE_I8,
GGUF_DEFAULT_ALIGNMENT) so future readers understand why that specific value was
chosen.
ggml/src/ggml-virtgpu/ggml-remoting.h (1)

39-40: Initialize new context fields defensively.

Consider default-initializing these fields to deterministic values to avoid accidental use of uninitialized state on partial-init paths.

💡 Proposed refactor
-    uintptr_t device_handle;  // Device handle returned from backend initialization
-    uint32_t  backend_id;     // Backend ID within the device
+    uintptr_t device_handle = 0;  // Device handle returned from backend initialization
+    uint32_t  backend_id    = 0;  // Backend ID within the device
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/ggml-remoting.h` around lines 39 - 40, The struct
fields device_handle and backend_id are not given deterministic defaults and can
be read uninitialized on partial init; update the context type that declares
device_handle and backend_id to explicitly initialize them (e.g., set
device_handle = 0 or UINTPTR_MAX and backend_id = 0 or UINT32_MAX as your
sentinel) and ensure the allocation/creation function (the context
constructor/creator) assigns those defaults before any early-return paths so any
partially-initialized context has known values; reference and update the symbols
device_handle and backend_id in the struct definition and in the context
creation/initialization function.
ggml/src/ggml-virtgpu/ggml-backend.cpp (1)

11-18: Consider nullifying ctx->device_handle and ctx->backend_id after cleanup.

After apir_backend_cleanup returns, the device context ctx still holds the now-stale device_handle and backend_id values. Since ctx is owned by the device (not the backend), it survives delete backend. A subsequent device_init on the same device would overwrite them, but defensive zeroing avoids accidental use of stale handles.

Suggested improvement
     apir_backend_cleanup(gpu, ctx->device_handle, ctx->backend_id);
 
+    ctx->device_handle = 0;
+    ctx->backend_id    = 0;
+
     delete backend;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/ggml-backend.cpp` around lines 11 - 18, In
ggml_backend_remoting_free, after calling apir_backend_cleanup(gpu,
ctx->device_handle, ctx->backend_id) clear the stale handles on the surviving
ggml_backend_remoting_device_context by setting ctx->device_handle and
ctx->backend_id to their null/zero values so the device context (accessible via
backend->context) cannot accidentally use stale values after delete backend;
keep the cleanup call and then nullify those members defensively.
ggml/src/ggml-virtgpu/virtgpu.cpp (1)

500-502: Integer division truncates sub-millisecond precision in timeout calculation.

(end_time - start_time) / 1000000 performs integer division (both operands are long long / int). The result assigned to float duration_ms will always be a whole number of milliseconds. For short durations this could report 0ms when actual elapsed time is e.g. 0.5ms. This is unlikely to cause a real timeout bug since the comparison is > max_wait_ms, but it does mean sub-ms elapsed time is invisible.

Suggested fix
-            float     duration_ms = (end_time - start_time) / 1000000;
+            float     duration_ms = (float)(end_time - start_time) / 1000000.0f;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/virtgpu.cpp` around lines 500 - 502, The calculation
for duration_ms uses integer division so sub-millisecond precision is lost;
change the division to floating-point (e.g., divide by 1000000.0f or cast
(end_time - start_time) to float/double) so duration_ms preserves fractional
milliseconds — update the expression that computes duration_ms (using ts_end,
end_time, start_time, duration_ms) to perform floating-point division.
ggml/src/ggml-virtgpu/ggml-backend-reg.cpp (1)

41-77: Consider extracting the duplicated name-prefixing pattern.

The "[virtgpu] " prefix logic at lines 41-56 and 62-77 is nearly identical. A small helper would reduce duplication:

Example helper
static char * make_prefixed_name(const char * rmt_name) {
    size_t len = strlen(rmt_name) + 11;  // "[virtgpu] " + null
    char * prefixed = (char *) malloc(len);
    if (!prefixed) {
        free((void *) rmt_name);
        return nullptr;
    }
    snprintf(prefixed, len, "[virtgpu] %s", rmt_name);
    return prefixed;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/ggml-backend-reg.cpp` around lines 41 - 77, The code
duplicates the "[virtgpu] " name-prefixing logic for apir_device_get_name and
apir_buffer_type_get_name; extract this into a helper (e.g., static char *
make_prefixed_name(const char *rmt_name)) that allocates the prefixed string,
returns nullptr on allocation failure, and frees the original rmt_name on
failure so callers (the blocks that set gpu->cached_device_info.name and
gpu->cached_buffer_type.name) can simply call the helper and GGML_ABORT if it
returns nullptr; replace both duplicated blocks with calls to make_prefixed_name
and assign the results to gpu->cached_device_info.name and
gpu->cached_buffer_type.name respectively.
ggml/src/ggml-virtgpu/backend/backend.cpp (1)

84-88: Dead code: library_reg can never be null here.

library_reg is assigned at line 67 with a fallback to GGML_DEFAULT_BACKEND_REG (a non-null string literal), so this branch is unreachable. Consider removing it to avoid misleading future readers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend.cpp` around lines 84 - 88, The branch
checking for library_reg being null is dead code because library_reg is assigned
earlier with a fallback to the non-null GGML_DEFAULT_BACKEND_REG; remove the
unreachable if-block that logs GGML_VIRTGPU_BCK and returns
APIR_LOAD_LIBRARY_ENV_VAR_MISSING. Update any surrounding logic in the function
in backend.cpp (where library_reg is set and used) to rely on the guaranteed
non-null value and delete the check referencing library_reg and
APIR_LLAMA_CPP_GGML_LIBRARY_REG_ENV to avoid misleading future readers.
ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp (2)

58-82: Minor TOCTOU between ensure_device_extension and get_device_extension.

ensure_device_extension (line 59) acquires and releases device_extensions_mutex, then get_device_extension (line 60) acquires it again. If cleanup_device_extension is called from another thread between these two calls, the extension could be removed, causing get_device_extension to return nullptr.

In practice this would require concurrent create + cleanup on the same device which is a caller misuse, but you could eliminate the gap by combining the ensure + get into a single lock scope, or by having ensure_device_extension return the pointer directly.

♻️ Proposed consolidation
-// Create new backend instance for device
-uintptr_t create_backend_instance(ggml_backend_dev_t device) {
-    ensure_device_extension(device);
-    apir_device_extension * ext = get_device_extension(device);
-    if (ext == nullptr) {
-        return 0;  // Failed
-    }
+// Ensure device extension exists and return it (caller must not hold device_extensions_mutex)
+apir_device_extension * ensure_and_get_device_extension(ggml_backend_dev_t device) {
+    std::lock_guard<std::mutex> lock(device_extensions_mutex);
+    auto it = device_extensions.find(device);
+    if (it == device_extensions.end()) {
+        apir_device_extension * ext = new apir_device_extension();
+        ext->next_backend_id        = 1;
+        ggml_backend_dev_props props;
+        device->iface.get_props(device, &props);
+        ext->async_backend = props.caps.async;
+        ext->magic                = APIR_DEVICE_EXTENSION_MAGIC;
+        device_extensions[device] = ext;
+        return ext;
+    }
+    apir_device_extension * ext = it->second;
+    return (ext->magic == APIR_DEVICE_EXTENSION_MAGIC) ? ext : nullptr;
+}
+
+// Create new backend instance for device
+uintptr_t create_backend_instance(ggml_backend_dev_t device) {
+    apir_device_extension * ext = ensure_and_get_device_extension(device);
+    if (ext == nullptr) {
+        return 0;  // Failed
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp` around lines 58 - 82,
The TOCTOU arises because create_backend_instance calls
ensure_device_extension() then get_device_extension(), allowing
cleanup_device_extension to remove the extension in between; to fix this, change
the flow so the extension pointer is obtained under a single lock: either modify
ensure_device_extension to return the apir_device_extension* (or add a new
ensure_and_get_device_extension(device) that returns the pointer) and use that
pointer directly in create_backend_instance instead of calling
get_device_extension(), or hold the device_extensions_mutex across both
operations so the extension cannot be removed; update create_backend_instance to
use the returned apir_device_extension* (or the single-locked getter) and remove
the separate get_device_extension call.

157-182: Truncation of backend_id from uintptr_t to uint32_t.

Line 179 casts backend_id (a uintptr_t, which is 64-bit on LP64 platforms) to uint32_t. If next_backend_id ever exceeds UINT32_MAX, this silently truncates. Since IDs start at 1 and increment, this is practically unreachable, but an assert would make the assumption explicit.

🛡️ Optional: add an assertion
+    GGML_ASSERT(backend_id <= UINT32_MAX && "backend_id exceeds uint32_t range");
     *out_backend_id = (uint32_t) backend_id;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp` around lines 157 - 182,
The function backend_dispatch_initialize casts a uintptr_t backend_id to
uint32_t causing silent truncation; update backend_dispatch_initialize to
validate the returned backend_id from create_backend_instance() before assigning
to *out_backend_id: check backend_id != 0 and that backend_id <= UINT32_MAX (or
use an assert/GGML_LOG_ERROR and return
APIR_BACKEND_INITIALIZE_BACKEND_INIT_FAILED if it doesn't fit), referencing the
create_backend_instance call and the backend_id/out_backend_id variables so the
failure is detected and logged instead of silently truncating.
ggml/src/gguf.cpp (1)

327-347: nbytes_remain() — consider the performance impact on repeated calls.

This helper seeks to EOF and back on every invocation. It's called once per vector read and once per individual string read (including each element when reading a string array). For very large string arrays on slow/remote filesystems, the overhead of N+1 seek-pairs could add up.

A minor optimization would be to cache the file size at the start of gguf_init_from_file_impl and compute remaining bytes from file_size - gguf_ftell(file), avoiding the repeated seeks. Not blocking, but worth considering if large GGUF files become a performance concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/gguf.cpp` around lines 327 - 347, The nbytes_remain() helper
currently seeks to EOF and back on every call, which is expensive for many small
reads; compute and cache the total file size once at the start of
gguf_init_from_file_impl (e.g., call gguf_fseek/gguf_ftell once to get
file_size) and change nbytes_remain() (or callers) to return file_size -
gguf_ftell(file) using that cached file_size instead of performing repeated
seeks; store the cached size in a field or local context accessible to
nbytes_remain() and ensure initialization still validates/sets the cached size
before any reads occur (referencing nbytes_remain and gguf_init_from_file_impl).
ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp (1)

125-128: Don’t silently ignore backend cleanup RPC failures.

On Line 125-127, ret is dropped. At minimum, log a failure so leaked backend instances are diagnosable.

Proposed improvement
     REMOTE_CALL(gpu, encoder, decoder, ret);
 
+    if (ret != 0) {
+        GGML_LOG_ERROR(GGML_VIRTGPU "%s: Backend cleanup failed with ret=%d\n", __func__, ret);
+    }
+
     remote_call_finish(gpu, encoder, decoder);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp` around lines 125 - 128,
The RPC return value stored in ret after REMOTE_CALL(gpu, encoder, decoder, ret)
is being ignored; modify the cleanup path around REMOTE_CALL and
remote_call_finish in virtgpu-forward-backend.cpp to check ret for errors and at
minimum log a descriptive error via your logging facility (include
function/context like REMOTE_CALL(gpu, encoder, decoder, ret) and
remote_call_finish(gpu, encoder, decoder) in the message) so backend cleanup RPC
failures are recorded and diagnosable; ensure the code still calls
remote_call_finish even on error and propagate or handle the error appropriately
per project conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp`:
- Around line 143-154: The code currently decodes device_handle and backend_id
with apir_decode_uintptr_t and apir_decode_uint32_t and immediately calls
cleanup_backend_instance(device, backend_id); update this to validate the decode
results before use: capture and check the return/error codes from
apir_decode_uintptr_t(dec, &device_handle) and apir_decode_uint32_t(dec,
&backend_id), only cast to ggml_backend_dev_t and call
cleanup_backend_instance(device, backend_id) when both decodes succeed, and
otherwise avoid calling cleanup_backend_instance and handle/log the decode
failure appropriately.
- Around line 44-47: The code currently ignores shmem_res_id (GGML_UNUSED)
leaving cgraph_size untrusted and allowing potential OOB reads when decoding;
replace the placeholder by retrieving the shared-memory buffer length for the
given shmem_res_id (or use an API that returns the mapped size) and immediately
validate that cgraph_size <= that shared-memory length, returning an
error/failing the request if it is larger; until you add the new Virgl->APIR
callback, implement an immediate mitigation that rejects any payload where
cgraph_size exceeds a conservative max or the known mapped size to prevent OOB
access in the decoder that reads using cgraph_size.
- Around line 16-24: The decode of function_ptr via apir_decode_uintptr_t may
fail and leave function_ptr uninitialized before calling
backend_dispatch_initialize; update the flow in this block to initialize
function_ptr (e.g., to 0), capture and check the return value of
apir_decode_uintptr_t, and bail out or return an error (or log) if the decode
fails instead of continuing to cast and pass ggml_backend_reg_fct_p into
backend_dispatch_initialize; ensure the function uses the checked/initialized
ggml_backend_reg_fct_p only after a successful decode.
- Around line 111-117: The loop that checks node support uses an undefined
variable `dev`; replace uses of `dev` with the correct `device` variable: update
the call to the backend support check from dev->iface.supports_op(dev, op) to
device->iface.supports_op(device, op) so the check uses the declared `device`
object (the loop iterating over cgraph->n_nodes and calling
ggml_graph_node/ggml_op_desc should remain unchanged).

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.h`:
- Around line 30-33: The comment on struct apir_backend_instance is wrong:
update the inline magic comment for apir_backend_instance to match the canonical
constant APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA) or vice-versa; specifically,
locate struct apir_backend_instance and the macro APIR_BACKEND_INSTANCE_MAGIC
and make the descriptive comment use the same hex value as the macro to avoid
mismatch (ensure only the comment is changed unless you intend to change the
macro value).

In `@ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h`:
- Around line 109-110: The global backend_buffers set is accessed concurrently
without protection; introduce a std::mutex (e.g., backend_buffers_lock) and use
scoped locking (std::lock_guard<std::mutex> or std::unique_lock<std::mutex>)
around every access to backend_buffers including the validation check that uses
backend_buffers.find(buffer) and the mutating calls in apir_track_backend_buffer
(insert) and apir_untrack_backend_buffer (erase) so all reads/writes are
serialized and thread-safe.

In `@ggml/src/ggml-virtgpu/ggml-backend-device.cpp`:
- Around line 76-77: The code currently uses a process-wide static cache (static
std::atomic<bool> initialized and static ggml_backend_buffer_type buft) which
causes buft to hold the first device's device/context for all subsequent
devices; change the cache to be device-scoped by moving ggml_backend_buffer_type
buft (and the per-device initialized flag) into the per-device context struct
ggml_backend_remoting_device_context and initialize buft once per device when
accessing it (replace the function-local static buft/initialized usage with
reads/writes to the fields on ggml_backend_remoting_device_context so each
device gets its own buffer type initialized with the correct device/context).

In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp`:
- Around line 31-36: The code currently treats ret == 0 as success before
verifying that decoding succeeded; after calling apir_decode_uintptr_t(decoder,
&device_handle) and apir_decode_uint32_t(decoder, &backend_id) you must check
the decoder's fatal/error state and only consider initialization successful if
decoding produced no errors. Update the block that currently checks ret to call
the decoder error/fatal-check API (e.g., apir_decoder_fatal(decoder) or
equivalent) after the two apir_decode_* calls, set ret to a non-zero error code
and log an error if decoding failed, and only log and return success when the
decoder indicates no fatal error (use the same symbols: decoder, device_handle,
backend_id, apir_decode_uintptr_t, apir_decode_uint32_t).

In `@ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp`:
- Line 176: The format string in the GGML_ABORT call that uses GGML_VIRTGPU and
__func__ incorrectly prints the size_t variable size with "%ld"; update the
format specifier to "%zu" so the call GGML_ABORT(GGML_VIRTGPU "%s: Couldn't
allocate %zub of guest-host shared buffer", __func__, size) correctly and
portably formats the size_t value (no cast needed).

In `@ggml/src/ggml.c`:
- Around line 902-903: Replace the standard C assert() calls in the type
accessor functions with GGML_ASSERT() so checks remain active in Release builds:
change assertions that validate "type >= 0" and "type < GGML_TYPE_COUNT" (and
any similar checks around type_traits access) to GGML_ASSERT(type >= 0 && type <
GGML_TYPE_COUNT). Also update ggml_type_name() to perform a runtime bounds check
against GGML_TYPE_COUNT before indexing type_traits and return a safe fallback
string (e.g., "unknown") when the check fails to avoid out-of-bounds reads.

---

Outside diff comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp`:
- Around line 47-71: After calling apir_decode_ggml_tensor in each of
backend_buffer_set_tensor, backend_buffer_get_tensor, and
backend_buffer_cpy_tensor, add a null/fatal check and return an error if
decoding failed: verify the decoded pointer (e.g., tensor, src, dst) is non-NULL
and that apir_decoder_get_fatal(dec) is false before using it; if the check
fails, log/return the same error path used elsewhere (return 1) to avoid passing
NULL into buffer->iface.set_tensor, buffer->iface.get_tensor, or
buffer->iface.cpy_tensor. Ensure checks are added immediately after each
apir_decode_ggml_tensor call and use the same pattern as the existing
buffer/decoder validation (if (!tensor || apir_decoder_get_fatal(dec)) { return
1; }).

In `@ggml/src/ggml-virtgpu/virtgpu.cpp`:
- Around line 124-146: The duplicate conditional checks for
APIR_LOAD_LIBRARY_ENV_VAR_MISSING cause the "some symbols are missing" branch to
be unreachable; in the virtgpu.cpp block that examines ret (the APIR load
result) replace the second else if condition currently using
APIR_LOAD_LIBRARY_ENV_VAR_MISSING with the correct constant
APIR_LOAD_LIBRARY_SYMBOL_MISSING so the branch that calls GGML_ABORT(... "some
symbols are missing" ..., apir_load_library_error(ret)) is reachable; ensure you
reference the same apir_load_library_error(ret), __func__, and GGML_ABORT usage
as the other branches.

---

Nitpick comments:
In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp`:
- Around line 58-82: The TOCTOU arises because create_backend_instance calls
ensure_device_extension() then get_device_extension(), allowing
cleanup_device_extension to remove the extension in between; to fix this, change
the flow so the extension pointer is obtained under a single lock: either modify
ensure_device_extension to return the apir_device_extension* (or add a new
ensure_and_get_device_extension(device) that returns the pointer) and use that
pointer directly in create_backend_instance instead of calling
get_device_extension(), or hold the device_extensions_mutex across both
operations so the extension cannot be removed; update create_backend_instance to
use the returned apir_device_extension* (or the single-locked getter) and remove
the separate get_device_extension call.
- Around line 157-182: The function backend_dispatch_initialize casts a
uintptr_t backend_id to uint32_t causing silent truncation; update
backend_dispatch_initialize to validate the returned backend_id from
create_backend_instance() before assigning to *out_backend_id: check backend_id
!= 0 and that backend_id <= UINT32_MAX (or use an assert/GGML_LOG_ERROR and
return APIR_BACKEND_INITIALIZE_BACKEND_INIT_FAILED if it doesn't fit),
referencing the create_backend_instance call and the backend_id/out_backend_id
variables so the failure is detected and logged instead of silently truncating.

In `@ggml/src/ggml-virtgpu/backend/backend.cpp`:
- Around line 84-88: The branch checking for library_reg being null is dead code
because library_reg is assigned earlier with a fallback to the non-null
GGML_DEFAULT_BACKEND_REG; remove the unreachable if-block that logs
GGML_VIRTGPU_BCK and returns APIR_LOAD_LIBRARY_ENV_VAR_MISSING. Update any
surrounding logic in the function in backend.cpp (where library_reg is set and
used) to rely on the guaranteed non-null value and delete the check referencing
library_reg and APIR_LLAMA_CPP_GGML_LIBRARY_REG_ENV to avoid misleading future
readers.

In `@ggml/src/ggml-virtgpu/ggml-backend-reg.cpp`:
- Around line 41-77: The code duplicates the "[virtgpu] " name-prefixing logic
for apir_device_get_name and apir_buffer_type_get_name; extract this into a
helper (e.g., static char * make_prefixed_name(const char *rmt_name)) that
allocates the prefixed string, returns nullptr on allocation failure, and frees
the original rmt_name on failure so callers (the blocks that set
gpu->cached_device_info.name and gpu->cached_buffer_type.name) can simply call
the helper and GGML_ABORT if it returns nullptr; replace both duplicated blocks
with calls to make_prefixed_name and assign the results to
gpu->cached_device_info.name and gpu->cached_buffer_type.name respectively.

In `@ggml/src/ggml-virtgpu/ggml-backend.cpp`:
- Around line 11-18: In ggml_backend_remoting_free, after calling
apir_backend_cleanup(gpu, ctx->device_handle, ctx->backend_id) clear the stale
handles on the surviving ggml_backend_remoting_device_context by setting
ctx->device_handle and ctx->backend_id to their null/zero values so the device
context (accessible via backend->context) cannot accidentally use stale values
after delete backend; keep the cleanup call and then nullify those members
defensively.

In `@ggml/src/ggml-virtgpu/ggml-remoting.h`:
- Around line 39-40: The struct fields device_handle and backend_id are not
given deterministic defaults and can be read uninitialized on partial init;
update the context type that declares device_handle and backend_id to explicitly
initialize them (e.g., set device_handle = 0 or UINTPTR_MAX and backend_id = 0
or UINT32_MAX as your sentinel) and ensure the allocation/creation function (the
context constructor/creator) assigns those defaults before any early-return
paths so any partially-initialized context has known values; reference and
update the symbols device_handle and backend_id in the struct definition and in
the context creation/initialization function.

In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp`:
- Around line 125-128: The RPC return value stored in ret after REMOTE_CALL(gpu,
encoder, decoder, ret) is being ignored; modify the cleanup path around
REMOTE_CALL and remote_call_finish in virtgpu-forward-backend.cpp to check ret
for errors and at minimum log a descriptive error via your logging facility
(include function/context like REMOTE_CALL(gpu, encoder, decoder, ret) and
remote_call_finish(gpu, encoder, decoder) in the message) so backend cleanup RPC
failures are recorded and diagnosable; ensure the code still calls
remote_call_finish even on error and propagate or handle the error appropriately
per project conventions.

In `@ggml/src/ggml-virtgpu/virtgpu.cpp`:
- Around line 500-502: The calculation for duration_ms uses integer division so
sub-millisecond precision is lost; change the division to floating-point (e.g.,
divide by 1000000.0f or cast (end_time - start_time) to float/double) so
duration_ms preserves fractional milliseconds — update the expression that
computes duration_ms (using ts_end, end_time, start_time, duration_ms) to
perform floating-point division.

In `@ggml/src/gguf.cpp`:
- Around line 327-347: The nbytes_remain() helper currently seeks to EOF and
back on every call, which is expensive for many small reads; compute and cache
the total file size once at the start of gguf_init_from_file_impl (e.g., call
gguf_fseek/gguf_ftell once to get file_size) and change nbytes_remain() (or
callers) to return file_size - gguf_ftell(file) using that cached file_size
instead of performing repeated seeks; store the cached size in a field or local
context accessible to nbytes_remain() and ensure initialization still
validates/sets the cached size before any reads occur (referencing nbytes_remain
and gguf_init_from_file_impl).

In `@gguf-py/gguf/gguf_reader.py`:
- Around line 208-212: Move the fallback-comment lines and the TODO so they
appear immediately before the unconditional raise that raises KeyError;
specifically, relocate the lines "# TODO: add option to make this a warning and
accept duplicate keys like below" and the two commented fallback alternatives
("# logger.warning(...)" and "# self.fields[...]= field") to just above the
raise statement that references field.name and field.offset (the duplicate-key
handling in gguf_reader.py where the KeyError is raised), leaving behavior
unchanged but avoiding dead-code confusion.

In `@src/models/kimi-linear.cpp`:
- Around line 119-120: Add a brief explanatory comment above the call to
ggml_build_forward_expand(gf, cur) clarifying that the pre-anchor is redundant
for normal execution because cur is already reachable via the main terminal
expand, but is retained intentionally to support VirtGPU graph-split scenarios;
reference ggml_build_forward_expand and cur in the comment so future readers
understand the intent and parity with the change made in qwen35.cpp.

In `@src/models/qwen35.cpp`:
- Around line 32-33: The call to ggml_build_forward_expand(gf, cur) on attn_norm
appears duplicate because cur is immediately consumed by build_layer_attn_linear
or build_layer_attn; either document intent for cross-backend split points or
remove the early anchor. Update the code around ggml_build_forward_expand(gf,
cur) to either (a) add a concise comment explaining this pre-anchor is
intentional to ensure the compute graph root is preserved when the graph may be
split across backends (e.g., VirtGPU guest/host boundary), referencing attn_norm
and the downstream build_layer_attn_linear/build_layer_attn calls, or (b) if no
cross-backend split is intended, remove the early ggml_build_forward_expand(gf,
cur) to avoid the apparent duplicate and rely on the terminal
ggml_build_forward_expand(gf, cur) reachable via normal traversal.

In `@src/models/qwen35moe.cpp`:
- Around line 32-33: The ggml_build_forward_expand(gf, cur); call in
qwen35moe.cpp needs the same graph-split anchor/comment adjustment applied in
qwen35.cpp: update the line with the identical optional comment or anchor marker
used in qwen35.cpp so the graph-splitting logic is clear and consistent (locate
the ggml_build_forward_expand call and mirror the comment/anchor change from
qwen35.cpp).

In `@src/models/qwen3next.cpp`:
- Around line 24-25: Add a brief inline comment above the
ggml_build_forward_expand(gf, cur) call in qwen3next.cpp explaining that this is
the VirtGPU "graph-split" anchor (same rationale used in the other three
models): state that this call intentionally marks the forward-graph split point
for VirtGPU execution, why it’s needed (e.g., separating CPU/GPU stages or
enabling virtualized GPU graph replay), and that it must remain to preserve
graph partitioning behavior; reference the ggml_build_forward_expand function
name so reviewers can locate and align the wording with the comments in the
other model files.

In `@tests/test-gguf.cpp`:
- Around line 424-426: Replace the unexplained magic literal 32 in the
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch by adding a one-line comment next to
the assignment to nbytes (or replace with a named constant) clarifying that this
value is intentionally chosen to be smaller than the minimum bytes any single
tensor would require (e.g., any value in 1..GGUF_DEFAULT_ALIGNMENT would work)
so the test deliberately writes fewer bytes than a tensor; reference the symbols
hft, HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, nbytes and GGUF_DEFAULT_ALIGNMENT in
the comment for clarity.
- Line 410: The pad calculation incorrectly uses a 32-bit mask when alignment is
a smaller integer; update the second occurrence to match the first by casting
alignment to uint64_t so the GGML_PAD expression uses 64-bit arithmetic: change
the call in handcrafted_check_tensors that updates expected_offset
(expected_offset += GGML_PAD(ggml_row_size(type, ne), alignment);) to cast
alignment to uint64_t, mirroring the fix applied where offset is updated; ensure
you reference GGML_PAD, ggml_row_size, alignment, expected_offset, and
handcrafted_check_tensors when making the change.
- Around line 201-206: Add a clarifying comment above the
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW branch explaining the magic constant
0x7FFFFFFFFFFFFFC0: state that it is the largest int64_t value already aligned
to GGUF_DEFAULT_ALIGNMENT (64 bytes), and that creating two tensors with that
size causes the combined data size to overflow (approx 0xFFFFFFFFFFFFFF80).
Place this comment next to the tensor_configs assignments (symbols: hft,
HANDCRAFTED_DATA_MEM_SIZE_OVERFLOW, tensor_configs, GGML_TYPE_I8,
GGUF_DEFAULT_ALIGNMENT) so future readers understand why that specific value was
chosen.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa6f918 and b8c61b4.

📒 Files selected for processing (47)
  • docs/backend/VirtGPU.md
  • ggml/include/ggml.h
  • ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
  • ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
  • ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
  • ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
  • ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
  • ggml/src/ggml-virtgpu/backend/backend-dispatched.h
  • ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h
  • ggml/src/ggml-virtgpu/backend/backend.cpp
  • ggml/src/ggml-virtgpu/backend/shared/api_remoting.h
  • ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h
  • ggml/src/ggml-virtgpu/backend/shared/apir_backend.h
  • ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
  • ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
  • ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h
  • ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
  • ggml/src/ggml-virtgpu/ggml-backend-device.cpp
  • ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
  • ggml/src/ggml-virtgpu/ggml-backend.cpp
  • ggml/src/ggml-virtgpu/ggml-remoting.h
  • ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml
  • ggml/src/ggml-virtgpu/include/apir_hw.h
  • ggml/src/ggml-virtgpu/regenerate_remoting.py
  • ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
  • ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
  • ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
  • ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
  • ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
  • ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
  • ggml/src/ggml-virtgpu/virtgpu.cpp
  • ggml/src/ggml-virtgpu/virtgpu.h
  • ggml/src/ggml.c
  • ggml/src/gguf.cpp
  • gguf-py/gguf/gguf_reader.py
  • gguf-py/gguf/gguf_writer.py
  • src/llama-model.cpp
  • src/llama-model.h
  • src/models/kimi-linear.cpp
  • src/models/qwen35.cpp
  • src/models/qwen35moe.cpp
  • src/models/qwen3next.cpp
  • tests/test-gguf.cpp
  • tools/server/server-http.cpp
  • tools/server/server-http.h
  • tools/server/server-models.cpp
  • tools/server/server-task.cpp
💤 Files with no reviewable changes (1)
  • ggml/include/ggml.h

Comment on lines +16 to +24
uintptr_t function_ptr;
apir_decode_uintptr_t(dec, &function_ptr);
void * ggml_backend_reg_fct_p = (void *) function_ptr;

// Call the actual initialization
uintptr_t device_handle = 0;
uint32_t backend_id = 0;
uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard initialize flow against decoder failure before dispatching.

function_ptr on Line 16 is uninitialized before decode. If decode fails, Line 23 may pass garbage into backend_dispatch_initialize.

Proposed fix
-    uintptr_t function_ptr;
+    uintptr_t function_ptr = 0;
     apir_decode_uintptr_t(dec, &function_ptr);
+    if (apir_decoder_get_fatal(dec)) {
+        apir_decoder_set_fatal(dec);
+        return 1;
+    }
     void * ggml_backend_reg_fct_p = (void *) function_ptr;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uintptr_t function_ptr;
apir_decode_uintptr_t(dec, &function_ptr);
void * ggml_backend_reg_fct_p = (void *) function_ptr;
// Call the actual initialization
uintptr_t device_handle = 0;
uint32_t backend_id = 0;
uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id);
uintptr_t function_ptr = 0;
apir_decode_uintptr_t(dec, &function_ptr);
if (apir_decoder_get_fatal(dec)) {
apir_decoder_set_fatal(dec);
return 1;
}
void * ggml_backend_reg_fct_p = (void *) function_ptr;
// Call the actual initialization
uintptr_t device_handle = 0;
uint32_t backend_id = 0;
uint32_t result = backend_dispatch_initialize(ggml_backend_reg_fct_p, &device_handle, &backend_id);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines 16
- 24, The decode of function_ptr via apir_decode_uintptr_t may fail and leave
function_ptr uninitialized before calling backend_dispatch_initialize; update
the flow in this block to initialize function_ptr (e.g., to 0), capture and
check the return value of apir_decode_uintptr_t, and bail out or return an error
(or log) if the decode fails instead of continuing to cast and pass
ggml_backend_reg_fct_p into backend_dispatch_initialize; ensure the function
uses the checked/initialized ggml_backend_reg_fct_p only after a successful
decode.

Comment on lines +44 to +47
// place-holder: validate that the size of shmem_res_id is <= cgraph_size
// need to add another method in the Virgl->APIR callback interface
GGML_UNUSED(shmem_res_id);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

cgraph_size is still not validated against actual shared-memory length.

The placeholder on Line 44-46 means cgraph_size remains untrusted. That flows into decoder bounds at Line 94, which is a potential OOB-read vector.

Immediate mitigation (fail oversized payloads until exact shmem-size check exists)
 static uint32_t validate_graph_operation(size_t cgraph_size, uint32_t shmem_res_id, const char * operation) {
+    static constexpr size_t APIR_MAX_CGRAPH_SIZE = 1ull << 30; // 1 GiB hard safety cap
     if (cgraph_size == 0) {
         GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Zero-size computation graph\n", operation);
         return 1;
     }
+    if (cgraph_size > APIR_MAX_CGRAPH_SIZE) {
+        GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Oversized computation graph (%zu)\n", operation, cgraph_size);
+        return 1;
+    }
 
     // place-holder: validate that the size of shmem_res_id is <= cgraph_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines 44
- 47, The code currently ignores shmem_res_id (GGML_UNUSED) leaving cgraph_size
untrusted and allowing potential OOB reads when decoding; replace the
placeholder by retrieving the shared-memory buffer length for the given
shmem_res_id (or use an API that returns the mapped size) and immediately
validate that cgraph_size <= that shared-memory length, returning an
error/failing the request if it is larger; until you add the new Virgl->APIR
callback, implement an immediate mitigation that rejects any payload where
cgraph_size exceeds a conservative max or the known mapped size to prevent OOB
access in the decoder that reads using cgraph_size.

Comment on lines 111 to +117
for (int idx = 0; idx < cgraph->n_nodes; idx++) {
ggml_tensor * op = ggml_graph_node(cgraph, idx);
if (dev->iface.supports_op(dev, op)) {
continue;
}
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", idx, ggml_op_desc(op));
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx,
ggml_op_desc(op));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "supports_op call-site context:"
rg -nP -C3 '\bsupports_op\s*\(' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp

echo
echo "ggml_backend_dev_t declarations in this file:"
rg -nP -C2 '\bggml_backend_dev_t\s+\w+' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp

echo
echo "APIR_BACKEND_CHECK_SUPPORTS_OP definitions/usages:"
rg -nP '\bAPIR_BACKEND_CHECK_SUPPORTS_OP\b' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp ggml/src/ggml-virtgpu/backend

Repository: crc-org/llama.cpp

Length of output: 1260


🏁 Script executed:

rg -nP '\bdev\b' ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp | head -30

Repository: crc-org/llama.cpp

Length of output: 111


dev is undefined; use device instead.

On line 113, the call uses dev, but the device variable is declared as device on line 69. This will cause a compile error when APIR_BACKEND_CHECK_SUPPORTS_OP == 1.

Proposed fix
-        if (dev->iface.supports_op(dev, op)) {
+        if (device->iface.supports_op(device, op)) {
             continue;
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int idx = 0; idx < cgraph->n_nodes; idx++) {
ggml_tensor * op = ggml_graph_node(cgraph, idx);
if (dev->iface.supports_op(dev, op)) {
continue;
}
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", idx, ggml_op_desc(op));
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx,
ggml_op_desc(op));
for (int idx = 0; idx < cgraph->n_nodes; idx++) {
ggml_tensor * op = ggml_graph_node(cgraph, idx);
if (device->iface.supports_op(device, op)) {
continue;
}
GGML_LOG_ERROR(GGML_VIRTGPU_BCK "%s: Graph node %d (%s) not supported by the backend\n", __func__, idx,
ggml_op_desc(op));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines
111 - 117, The loop that checks node support uses an undefined variable `dev`;
replace uses of `dev` with the correct `device` variable: update the call to the
backend support check from dev->iface.supports_op(dev, op) to
device->iface.supports_op(device, op) so the check uses the declared `device`
object (the loop iterating over cgraph->n_nodes and calling
ggml_graph_node/ggml_op_desc should remain unchanged).

Comment on lines +143 to +154
// Decode device handle first
uintptr_t device_handle;
apir_decode_uintptr_t(dec, &device_handle);
ggml_backend_dev_t device = (ggml_backend_dev_t) device_handle;

// Decode backend ID second
uint32_t backend_id;
apir_decode_uint32_t(dec, &backend_id);

// Cleanup specific backend instance
cleanup_backend_instance(device, backend_id);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate decoded cleanup identifiers before using them.

Cleanup currently decodes and immediately uses IDs. If decode fails, Line 153 can run with invalid values.

Proposed fix
-    uintptr_t device_handle;
+    uintptr_t device_handle = 0;
     apir_decode_uintptr_t(dec, &device_handle);
     ggml_backend_dev_t device = (ggml_backend_dev_t) device_handle;
 
     // Decode backend ID second
-    uint32_t backend_id;
+    uint32_t backend_id = 0;
     apir_decode_uint32_t(dec, &backend_id);
+    if (apir_decoder_get_fatal(dec)) {
+        apir_decoder_set_fatal(dec);
+        return 1;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp` around lines
143 - 154, The code currently decodes device_handle and backend_id with
apir_decode_uintptr_t and apir_decode_uint32_t and immediately calls
cleanup_backend_instance(device, backend_id); update this to validate the decode
results before use: capture and check the return/error codes from
apir_decode_uintptr_t(dec, &device_handle) and apir_decode_uint32_t(dec,
&backend_id), only cast to ggml_backend_dev_t and call
cleanup_backend_instance(device, backend_id) when both decodes succeed, and
otherwise avoid calling cleanup_backend_instance and handle/log the decode
failure appropriately.

Comment on lines +30 to +33
struct apir_backend_instance {
ggml_backend_t bck; // The actual backend
uint32_t magic; // For validation: 0xAB1234CD
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading comment: magic value doesn't match the actual macro.

The comment on line 32 says 0xAB1234CD but APIR_BACKEND_INSTANCE_MAGIC (line 47) is defined as 0xCD4321BA.

Fix
 struct apir_backend_instance {
     ggml_backend_t bck;    // The actual backend
-    uint32_t       magic;  // For validation: 0xAB1234CD
+    uint32_t       magic;  // For validation: APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA)
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct apir_backend_instance {
ggml_backend_t bck; // The actual backend
uint32_t magic; // For validation: 0xAB1234CD
};
struct apir_backend_instance {
ggml_backend_t bck; // The actual backend
uint32_t magic; // For validation: APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA)
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/backend-dispatched.h` around lines 30 - 33, The
comment on struct apir_backend_instance is wrong: update the inline magic
comment for apir_backend_instance to match the canonical constant
APIR_BACKEND_INSTANCE_MAGIC (0xCD4321BA) or vice-versa; specifically, locate
struct apir_backend_instance and the macro APIR_BACKEND_INSTANCE_MAGIC and make
the descriptive comment use the same hex value as the macro to avoid mismatch
(ensure only the comment is changed unless you intend to change the macro
value).

Comment on lines +109 to +110
extern std::unordered_set<ggml_backend_buffer_t> backend_buffers;
if (backend_buffers.find(buffer) == backend_buffers.end()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate global set declaration and all call sites touching tracked buffers.
rg -n -C3 '\bbackend_buffers\b|apir_track_backend_buffer\s*\(|apir_untrack_backend_buffer\s*\(|apir_get_track_backend_buffers\s*\(' ggml/src/ggml-virtgpu

# Inspect whether accesses are guarded by mutex/lock primitives nearby.
rg -n -C3 'std::mutex|std::lock_guard|std::scoped_lock|std::shared_mutex|std::unique_lock' ggml/src/ggml-virtgpu

Repository: crc-org/llama.cpp

Length of output: 15753


🏁 Script executed:

# Verify the implementation of apir_cs_ggml-rpc-back.cpp to see full context
head -50 ggml/src/ggml-virtgpu/backend/apir_cs_ggml-rpc-back.cpp

Repository: crc-org/llama.cpp

Length of output: 1860


🏁 Script executed:

# Check if backend_buffers is used in headers that might be included in multi-threaded contexts
rg -l 'apir_track_backend_buffer|apir_untrack_backend_buffer|apir_get_track_backend_buffers' ggml/src/ggml-virtgpu

Repository: crc-org/llama.cpp

Length of output: 394


Add synchronization to protect backend_buffers from concurrent access.

The global backend_buffers set in apir_cs_ggml-rpc-back.cpp lacks any mutex or lock protection. All accesses—including find() at lines 109-110, insert() in apir_track_backend_buffer(), and erase() in apir_untrack_backend_buffer()—operate on an unprotected shared state. Since std::unordered_set is not thread-safe, concurrent calls to track/untrack buffers can race with the validation check, potentially invalidating the security boundary or causing undefined behavior. Wrap all accesses with a mutex (e.g., std::mutex backend_buffers_lock) or use a thread-safe container.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h` around lines 109 - 110,
The global backend_buffers set is accessed concurrently without protection;
introduce a std::mutex (e.g., backend_buffers_lock) and use scoped locking
(std::lock_guard<std::mutex> or std::unique_lock<std::mutex>) around every
access to backend_buffers including the validation check that uses
backend_buffers.find(buffer) and the mutating calls in apir_track_backend_buffer
(insert) and apir_untrack_backend_buffer (erase) so all reads/writes are
serialized and thread-safe.

Comment on lines +76 to 77
static std::atomic<bool> initialized = false;
static ggml_backend_buffer_type buft;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Static singleton buffer-type cache is not device-safe.

Line 76 and Line 99 use one process-wide static buft per function. After first initialization, later devices get the first device’s .device/.context, which breaks per-device correctness.

Please make this cache device-scoped (e.g., store buffer types in ggml_backend_remoting_device_context and initialize once per device).

Also applies to: 99-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/ggml-backend-device.cpp` around lines 76 - 77, The code
currently uses a process-wide static cache (static std::atomic<bool> initialized
and static ggml_backend_buffer_type buft) which causes buft to hold the first
device's device/context for all subsequent devices; change the cache to be
device-scoped by moving ggml_backend_buffer_type buft (and the per-device
initialized flag) into the per-device context struct
ggml_backend_remoting_device_context and initialize buft once per device when
accessing it (replace the function-local static buft/initialized usage with
reads/writes to the fields on ggml_backend_remoting_device_context so each
device gets its own buffer type initialized with the correct device/context).

Comment on lines +31 to +36
if (ret == 0) {
// Success - decode the response
apir_decode_uintptr_t(decoder, &device_handle);
apir_decode_uint32_t(decoder, &backend_id);
GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle,
backend_id);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Check decoder fatal state before returning initialize success.

On Line 31, success is determined only by ret == 0. If decoding on Line 33-34 fails, this can still return success with zeroed IDs.

Proposed fix
     if (ret == 0) {
         // Success - decode the response
         apir_decode_uintptr_t(decoder, &device_handle);
         apir_decode_uint32_t(decoder, &backend_id);
-        GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle,
-                      backend_id);
+        if (apir_decoder_get_fatal(decoder)) {
+            GGML_LOG_ERROR(GGML_VIRTGPU "%s: Malformed initialize response payload\n", __func__);
+            ret = 1;
+            device_handle = 0;
+            backend_id = 0;
+        } else {
+            GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__,
+                          (void *) device_handle, backend_id);
+        }
     } else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (ret == 0) {
// Success - decode the response
apir_decode_uintptr_t(decoder, &device_handle);
apir_decode_uint32_t(decoder, &backend_id);
GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__, (void *) device_handle,
backend_id);
if (ret == 0) {
// Success - decode the response
apir_decode_uintptr_t(decoder, &device_handle);
apir_decode_uint32_t(decoder, &backend_id);
if (apir_decoder_get_fatal(decoder)) {
GGML_LOG_ERROR(GGML_VIRTGPU "%s: Malformed initialize response payload\n", __func__);
ret = 1;
device_handle = 0;
backend_id = 0;
} else {
GGML_LOG_INFO(GGML_VIRTGPU "%s: Success - device_handle=%p, backend_id=%u\n", __func__,
(void *) device_handle, backend_id);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp` around lines 31 - 36, The
code currently treats ret == 0 as success before verifying that decoding
succeeded; after calling apir_decode_uintptr_t(decoder, &device_handle) and
apir_decode_uint32_t(decoder, &backend_id) you must check the decoder's
fatal/error state and only consider initialization successful if decoding
produced no errors. Update the block that currently checks ret to call the
decoder error/fatal-check API (e.g., apir_decoder_fatal(decoder) or equivalent)
after the two apir_decode_* calls, set ret to a non-zero error code and log an
error if decoding failed, and only log and return success when the decoder
indicates no fatal error (use the same symbols: decoder, device_handle,
backend_id, apir_decode_uintptr_t, apir_decode_uint32_t).


if (virtgpu_shmem_create(gpu, size, &buffer_context.shmem)) {
GGML_ABORT(GGML_VIRTGPU "Couldn't allocate the guest-host shared buffer");
GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %ldb of guest-host shared buffer", __func__, size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, verify the exact code at line 176
sed -n '170,180p' ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp

Repository: crc-org/llama.cpp

Length of output: 443


🏁 Script executed:

# Check the function signature to confirm size parameter type
sed -n '1,200p' ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp | grep -B20 'Couldn.t allocate.*shared buffer'

Repository: crc-org/llama.cpp

Length of output: 810


🏁 Script executed:

# Scan for %ld format specifiers in virtgpu code
rg '%ld' ggml/src/ggml-virtgpu/ -n

Repository: crc-org/llama.cpp

Length of output: 220


🏁 Script executed:

# Search for %ld patterns in all virtgpu files
rg '%ld' ggml/src/ggml-virtgpu/ --type cpp --type h -n

Repository: crc-org/llama.cpp

Length of output: 220


Use %zu instead of %ld for size_t parameter in format string.

Line 176 formats size_t size with %ld, which is not portable across platforms where size_t differs in width from unsigned long. Use %zu, the standard format specifier for size_t in C99 and later.

Proposed fix
-        GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %ldb of guest-host shared buffer", __func__, size);
+        GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %zuB of guest-host shared buffer", __func__, size);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp` at line 176, The format
string in the GGML_ABORT call that uses GGML_VIRTGPU and __func__ incorrectly
prints the size_t variable size with "%ld"; update the format specifier to "%zu"
so the call GGML_ABORT(GGML_VIRTGPU "%s: Couldn't allocate %zub of guest-host
shared buffer", __func__, size) correctly and portably formats the size_t value
(no cast needed).

Comment on lines +902 to +903
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the specific functions mentioned in the review
rg -n -A 5 'ggml_get_type_traits|ggml_blck_size|ggml_type_size|ggml_row_size|ggml_type_name|ggml_is_quantized' ggml/src/ggml.c | head -150

Repository: crc-org/llama.cpp

Length of output: 5637


🏁 Script executed:

# Check if GGML_ASSERT is defined and how it differs from assert
rg -n 'define GGML_ASSERT|define ASSERT' ggml/src/ggml.c ggml/src/ggml.h

Repository: crc-org/llama.cpp

Length of output: 189


🏁 Script executed:

# Check for NDEBUG usage in the codebase
rg -n 'NDEBUG|DNDEBUG' --type-list | head -20

Repository: crc-org/llama.cpp

Length of output: 450


🏁 Script executed:

# Check the build configuration files
fd 'CMakeLists|Makefile|build' | head -20

Repository: crc-org/llama.cpp

Length of output: 616


🏁 Script executed:

# Search for GGML_ASSERT definition more broadly
rg -n 'define GGML_ASSERT' ggml/src/

Repository: crc-org/llama.cpp

Length of output: 194


🏁 Script executed:

# Check the header files in ggml directory
fd -e h ggml/

Repository: crc-org/llama.cpp

Length of output: 386


🏁 Script executed:

# Look in ggml.h for GGML_ASSERT
cat ggml/include/ggml.h | head -200

Repository: crc-org/llama.cpp

Length of output: 7332


🏁 Script executed:

# Check CMakeLists.txt for NDEBUG or release configurations
rg -n 'NDEBUG|CMAKE_BUILD_TYPE|Release' CMakeLists.txt

Repository: crc-org/llama.cpp

Length of output: 345


🏁 Script executed:

# Verify which assert macro is actually used - check includes
head -100 ggml/src/ggml.c | grep -E '^#include|^#define'

Repository: crc-org/llama.cpp

Length of output: 1085


🏁 Script executed:

# Search entire repo for GGML_ASSERT usage (not just definition)
rg -n 'GGML_ASSERT\(' --type c --type h | head -50

Repository: crc-org/llama.cpp

Length of output: 3427


🏁 Script executed:

# Check if GGML_ASSERT is maybe defined in a common header
cat ggml/include/ggml.h | grep -A5 'define GGML_ASSERT'

Repository: crc-org/llama.cpp

Length of output: 313


🏁 Script executed:

# Look for any assert-related definitions in ggml headers
rg -n 'define.*ASSERT' ggml/include/

Repository: crc-org/llama.cpp

Length of output: 158


🏁 Script executed:

# Check ggml-impl.h and ggml-backend.h
rg -n 'define.*ASSERT|GGML_ASSERT' ggml/src/ggml-impl.h ggml/src/ggml-backend.h 2>/dev/null | head -20

Repository: crc-org/llama.cpp

Length of output: 268


🏁 Script executed:

# Check git history for ggml_type_name to see if it had a fallback
git log -p --all -S 'ggml_type_name' -- ggml/src/ggml.c | head -100

Repository: crc-org/llama.cpp

Length of output: 43


🏁 Script executed:

# Check what GGML_ABORT does
rg -n 'define GGML_ABORT' ggml/include/ggml.h -A 3

Repository: crc-org/llama.cpp

Length of output: 311


Replace assert() with GGML_ASSERT() in type accessor functions—debug-only checks vanish in Release builds

Type bounds checks at lines 902–903, 1269–1295 use C's standard assert(), which is disabled when NDEBUG is defined (default in Release builds). This leaves type_traits[type] open to out-of-bounds reads if invalid enum values are passed. ggml_type_name() also dropped its fallback and now unconditionally indexes the array.

Replace assert() with GGML_ASSERT(), which is not affected by NDEBUG and will abort on failure. For ggml_type_name(), use a runtime check with a fallback to avoid OOB reads:

Proposed fix
 const struct ggml_type_traits * ggml_get_type_traits(enum ggml_type type) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
+    GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
     return &type_traits[type];
 }

 int64_t ggml_blck_size(enum ggml_type type) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
+    GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
     return type_traits[type].blck_size;
 }

 size_t ggml_type_size(enum ggml_type type) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
+    GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
     return type_traits[type].type_size;
 }

 size_t ggml_row_size(enum ggml_type type, int64_t ne) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
+    GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
     assert(ne % ggml_blck_size(type) == 0);
     return ggml_type_size(type)*ne/ggml_blck_size(type);
 }

 const char * ggml_type_name(enum ggml_type type) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
-    return type_traits[type].type_name;
+    if ((unsigned) type >= GGML_TYPE_COUNT) {
+        return "NONE";
+    }
+    return type_traits[type].type_name;
 }

 bool ggml_is_quantized(enum ggml_type type) {
-    assert(type >= 0);
-    assert(type < GGML_TYPE_COUNT);
+    GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT);
     return type_traits[type].is_quantized;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ggml/src/ggml.c` around lines 902 - 903, Replace the standard C assert()
calls in the type accessor functions with GGML_ASSERT() so checks remain active
in Release builds: change assertions that validate "type >= 0" and "type <
GGML_TYPE_COUNT" (and any similar checks around type_traits access) to
GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT). Also update ggml_type_name()
to perform a runtime bounds check against GGML_TYPE_COUNT before indexing
type_traits and return a safe fallback string (e.g., "unknown") when the check
fails to avoid out-of-bounds reads.

@topsail-bot
Copy link

topsail-bot bot commented Feb 25, 2026

🔴 Test of 'mac_ai test pre_cleanup_ci' failed after 00 hours 00 minutes 27 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: topsail
PR_POSITIONAL_ARG_0: topsail
PR_POSITIONAL_ARG_1: remoting_mac

Failure indicator: Empty. (See run.log)

@kpouget
Copy link
Collaborator Author

kpouget commented Feb 25, 2026

/test topsail remoting_mac
/cluster mac

1 similar comment
@kpouget
Copy link
Collaborator Author

kpouget commented Feb 25, 2026

/test topsail remoting_mac
/cluster mac

@topsail-bot
Copy link

topsail-bot bot commented Feb 25, 2026

🟢 Test of 'mac_ai test test_ci' succeeded after 00 hours 27 minutes 28 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: topsail
PR_POSITIONAL_ARG_0: topsail
PR_POSITIONAL_ARG_1: remoting_mac

@kpouget
Copy link
Collaborator Author

kpouget commented Feb 25, 2026

/test topsail remoting_linux
/cluster ibm_vm

@topsail-bot
Copy link

topsail-bot bot commented Feb 25, 2026

🔴 Test of 'mac_ai test test_ci' failed after 00 hours 03 minutes 46 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: topsail
PR_POSITIONAL_ARG_0: topsail
PR_POSITIONAL_ARG_1: remoting_linux

Failure indicator:

/tmp/topsail_202602251772056035/000__podman_ramalama_remoting_test/002__mac_ai__remote_ramalama_run_model/FAILURE | [002__mac_ai__remote_ramalama_run_model] ./run_toolbox.py mac_ai remote_ramalama_run_model --base_work_dir=/home/topsail_ci --path=/home/topsail_ci/ramalama-ai/ramalama-0.17.0-apir.0.1.4-rc10/bin/ramalama --device=/dev/dri --env=PYTHONPATH=/home/topsail_ci/ramalama-ai/ramalama-0.17.0-apir.0.1.4-rc10 RAMALAMA_CONTAINER_ENGINE=podman HOME=/home/topsail_ci LD_LIBRARY_PATH=/home/topsail_ci/virglrenderer/v1.2.0-remoting-0.3.6-linux/build/src VIRGL_ROUTE_VENUS_TO_APIR=1 RENDER_SERVER_EXEC_PATH=/usr/libexec/virgl_render_server --model_name=ollama://llama3.2 --image=localhost/remoting:v0.17.0-apir.0.1.4-rc10 --unload=False --port=11434 --oci_runtime=krun --> 2
/tmp/topsail_202602251772056035/000__podman_ramalama_remoting_test/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/tmp/topsail_202602251772056035/000__podman_ramalama_remoting_test" ./run_toolbox.py mac_ai remote_ramalama_run_model --base_work_dir='/home/topsail_ci' --path='/home/topsail_ci/ramalama-ai/ramalama-0.17.0-apir.0.1.4-rc10/bin/ramalama' --device='/dev/dri' --env='PYTHONPATH='/home/topsail_ci/ramalama-ai/ramalama-0.17.0-apir.0.1.4-rc10' RAMALAMA_CONTAINER_ENGINE='podman' HOME='/home/topsail_ci' LD_LIBRARY_PATH='/home/topsail_ci/virglrenderer/v1.2.0-remoting-0.3.6-linux/build/src' VIRGL_ROUTE_VENUS_TO_APIR='1' RENDER_SERVER_EXEC_PATH='/usr/libexec/virgl_render_server'' --model_name='ollama://llama3.2' --image='localhost/remoting:v0.17.0-apir.0.1.4-rc10' --unload='False' --port='11434' --oci_runtime='krun'' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/mac_ai/testing/test_mac_ai.py", line 148, in test_all_platforms
    test_inference(platform)
  File "/opt/topsail/src/projects/mac_ai/testing/test_mac_ai.py", line 262, in test_inference
    model_id = platform.inference_server_mod.run_model(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/topsail/src/projects/mac_ai/testing/ramalama.py", line 253, in run_model
    _run_from_toolbox(

[...]

@kpouget
Copy link
Collaborator Author

kpouget commented Feb 26, 2026

/test topsail remoting_mac
/cluster mac

@topsail-bot
Copy link

topsail-bot bot commented Feb 26, 2026

🟢 Test of 'mac_ai test test_ci' succeeded after 00 hours 26 minutes 06 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: topsail
PR_POSITIONAL_ARG_0: topsail
PR_POSITIONAL_ARG_1: remoting_mac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants