ModelEngine-Group · dante159753 · Jun 4, 2026 · Jun 4, 2026 · Jun 5, 2026 · Jun 5, 2026
@@ -2,21 +2,24 @@
 
 UCM exports metrics through the vLLM `/metrics` endpoint. The metrics are
 registered from `examples/metrics/metrics_configs.yaml`, accumulated inside UCM,
-and exposed through `prometheus_client` in Prometheus multiprocess mode.
+and fanned out to the enabled Python-side consumers.
 
 ## How Metrics Flow
 
 1. `metrics_configs.yaml` defines counters, gauges, and histograms.
-2. `PrometheusStatsLogger` creates matching `prometheus_client` metrics with
-   `model_name` and `worker_id` labels.
-3. Histogram bucket boundaries are taken from the Python Prometheus histogram
+2. The Python metrics dispatcher drains the C++ metrics snapshot once and fans
+   it out to the enabled `multiproc` and `vllm_connector` consumers.
+3. `multiproc` creates `prometheus_client` metrics with `model_name` and
+   `worker_id` labels. `vllm_connector` creates vLLM KV connector metrics with
+   `model_name`, `engine`, and `worker_rank` labels.
+4. Histogram bucket boundaries are taken from the Python Prometheus histogram
    and registered into the C++ metrics library.
-4. UCM code calls `UpdateStats()` on the hot path.
-5. The C++ metrics library records counter, gauge, and histogram bucket deltas in
+5. UCM code calls `UpdateStats()` on the hot path.
+6. The C++ metrics library records counter, gauge, and histogram bucket deltas in
    per-thread double buffers.
-6. Every `log_interval` seconds, the observability thread calls
-   `get_all_stats_and_clear()` and applies the deltas to `prometheus_client`.
-7. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.
+7. The dispatcher applies deltas to each enabled Python consumer without one
+   consumer clearing the other's accumulated snapshot.
+8. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.
 
 Histograms are bucketed at update time. UCM no longer stores raw histogram
 sample vectors, so there is no `histogram_max_length` setting and no histogram
@@ -83,14 +86,14 @@ vllm bench serve \
     --ignore-eos
 ```
 
-Check that UCM metrics are present:
+Check that UCM vLLM connector metrics are present:
 
 ```bash
-curl http://<vllm-worker-ip>:8000/metrics | grep ucm:
+curl http://<vllm-worker-ip>:8000/metrics | grep 'ucm:'
 ```
 
-Prometheus multiprocess `.db` files should also appear in
-`$PROMETHEUS_MULTIPROC_DIR`.
+If the `multiproc` consumer is enabled, Prometheus multiprocess `.db` files
+should also appear in `$PROMETHEUS_MULTIPROC_DIR`.
 
 ### 2. Start Prometheus and Grafana
 
@@ -158,25 +161,30 @@ dashboards while preserving the time range and `model_name` value.
 Each dashboard has a `job` selector. It defaults to **All** and uses regex
 matching, so dashboards also work for metrics that do not carry a `job` label.
 
-The UCM dashboards also have a `View` selector and a `worker_id` selector:
+The UCM dashboards also have a `View` selector and a `worker_rank` selector:
 
 - **Aggregated**: default service-level view. Worker labels are collapsed.
-- **Per Worker**: split panels by `worker_id` for worker-specific diagnosis.
-- **worker_id**: defaults to **All**. Select a specific worker ID to filter all
+- **Per Worker**: split panels by `worker_rank` for worker-specific diagnosis.
+- **worker_rank**: defaults to **All**. Select a specific worker rank to filter all
   UCM panels to that worker only.
 
 Heatmap panels and panels grouped by another dimension may ignore the `View`
 selector because their grouping is already defined by the panel. They still use
-the `worker_id` filter.
+the `worker_rank` filter.
 
 ## Metrics Configuration
 
 Metrics are configured in `examples/metrics/metrics_configs.yaml`:
 
 ```yaml
 log_interval: 5
-multiproc_dir: "/vllm-workspace"
-metric_prefix: "ucm:"
+# multiproc_dir: "/vllm-workspace"
+# multiproc_prefix: "ucm_multiproc:"
+vllm_connector_prefix: "ucm:"
+
+consumers:
+  # multiproc: true
+  vllm_connector: true
 
 counter:
   - name: "cache_load_bytes_total"
@@ -193,17 +201,22 @@ histogram:
     buckets: [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
 ```
 
-Metric names are exported with the configured prefix. For example,
-`cache_load_duration_ms` becomes `ucm:cache_load_duration_ms`. Prometheus also
-exports histogram helper series such as `_bucket`, `_sum`, and `_count`.
+Metric names are exported per consumer. For example, `cache_load_duration_ms`
+is exported as `ucm:cache_load_duration_ms` by the default `vllm_connector`
+consumer. If `multiproc` is also enabled, use a separate prefix such as
+`ucm_multiproc:` so both consumers do not register the same Prometheus metric.
+Prometheus also exports histogram helper series such as `_bucket`, `_sum`,
+and `_count`.
 
 Counter values are increments. Gauge values replace the current value.
 Histogram values are observations that are immediately assigned to configured
 buckets in the C++ metrics library.
 
 ## Available Metrics
 
-The default metrics configuration contains the following UCM metrics.
+The default metrics configuration contains the following UCM metric names. The
+table uses the default `vllm_connector_prefix`. UCM duration metrics are
+exported in milliseconds.
 
 ### Counters
 
@@ -240,30 +253,45 @@ The default metrics configuration contains the following UCM metrics.
 | `ucm:load_speed` | Speed of loading from UCM in GB/s. |
 | `ucm:save_requests_num` | Number of requests saved to UCM. |
 | `ucm:save_blocks_num` | Number of blocks saved to UCM. |
-| `ucm:save_duration` | Time to save to UCM in milliseconds. |
-| `ucm:save_speed` | Speed of saving to UCM in GB/s. |
+| `ucm:save_duration` | Async total time from UCM connector dump submit to completion in milliseconds. |
+| `ucm:save_speed` | Speed of saving to UCM based on async submit-to-completion time in GB/s. |
+| `ucm:save_completion_wait_duration` | Time spent blocked while confirming async UCM connector dump completion in milliseconds. |
+| `ucm:save_completion_wait_speed` | Speed of saving to UCM based on async completion blocking wait time in GB/s. |
 | `ucm:interval_lookup_hit_rates` | Hit rates of UCM lookup requests. |
 | `ucm:cache_lookup_duration_ms` | Cache buffer lookup wall-clock time. |
 | `ucm:cache_lookup_backend_duration_ms` | Backend lookup wall-clock time when descending due to no buffer or buffer miss. |
 | `ucm:cache_load_duration_ms` | End-to-end Cache stage load task duration in milliseconds. |
 | `ucm:cache_dump_duration_ms` | End-to-end Cache stage dump task duration in milliseconds. |
-| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load bandwidth in GB/s. |
-| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump bandwidth in GB/s. |
+| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load throughput in GB/s over the whole task lifetime, including queue/backend waits. Not a DMA bandwidth (see `cache_h2d_bandwidth_gbps`). |
+| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump throughput in GB/s over the whole task lifetime, including queue and compute-event waits. Not a DMA bandwidth (see `cache_d2h_bandwidth_gbps`). |
 | `ucm:cache_load_queue_wait_duration_ms` | Time a Cache load task spent queued before dispatch worker pickup. |
 | `ucm:cache_dump_queue_wait_duration_ms` | Time a Cache dump task spent queued before dispatch worker pickup. |
-| `ucm:cache_load_dispatch_duration_ms` | Cache load dispatch cost: buffer allocation plus backend submission. |
+| `ucm:cache_load_backend_submit_duration_ms` | Cache load backend submit duration: buffer allocation plus backend load submission. |
 | `ucm:cache_shard_backend_wait_ms` | Cache load per-shard `WaitBackendTaskReady()` duration. |
-| `ucm:cache_shard_h2d_ms` | Cache load per-shard H2D async submit duration. |
-| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration. |
-| `ucm:cache_d2h_duration_ms` | Cache dump D2H stream sync phase duration. |
+| `ucm:cache_h2d_submit_ms` | Cache load per-shard H2D async submit CPU cost. Submission only, not the transfer. |
+| `ucm:cache_h2d_sync_ms` | Cache load residual H2D stream drain after the last shard submit. Large values mean H2D copy is the bottleneck. |
+| `ucm:cache_h2d_bandwidth_gbps` | Cache load pure H2D copy bandwidth, directly comparable to memcpy microbenchmarks. |
+| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration (buffer allocation/reuse plus D2H async submit). |
+| `ucm:cache_dump_prereq_wait_ms` | Cache dump wait for the prerequisite compute event before D2H can start. Large values mean dump is compute-gated. |
+| `ucm:cache_d2h_duration_ms` | Cache dump pure D2H copy drain, compute-event wait excluded. |
+| `ucm:cache_d2h_bandwidth_gbps` | Cache dump pure D2H copy bandwidth, directly comparable to memcpy microbenchmarks. |
 | `ucm:cache_dump_backend_submit_duration_ms` | Cache dump synchronous backend submit duration. |
+| `ucm:cache_dump_backend_wait_duration_ms` | Cache dump wait for the lower tier to finish writing. Large values mean storage write is the bottleneck. |
 | `ucm:posix_load_task_duration_ms` | End-to-end Posix load task duration. |
 | `ucm:posix_dump_task_duration_ms` | End-to-end Posix dump task duration. |
 | `ucm:posix_s2h_bandwidth_gbps` | Posix stage read bandwidth per task in GB/s. |
 | `ucm:posix_h2s_bandwidth_gbps` | Posix stage write bandwidth per task in GB/s. |
 | `ucm:posix_load_queue_wait_duration_ms` | Time a Posix load task spent queued before first worker pickup. |
 | `ucm:posix_dump_queue_wait_duration_ms` | Time a Posix dump task spent queued before first worker pickup. |
 | `ucm:layerwise_batch_total_ms` | Layerwise batch wall-clock time from `start_load_kv()` entry to `wait_for_save()` return. |
+| `ucm:layerwise_batch_total_load_only_ms` | Layerwise load-only batch wall-clock time. |
+| `ucm:layerwise_batch_total_save_only_ms` | Layerwise save-only batch wall-clock time. |
+| `ucm:layerwise_batch_total_load_save_ms` | Layerwise load-and-save batch wall-clock time. |
+| `ucm:layerwise_batch_total_no_transfer_ms` | Layerwise batch wall-clock time with neither load nor save work. |
+| `ucm:layerwise_batch_load_wait_total_load_only_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-only layerwise batch. |
+| `ucm:layerwise_batch_load_wait_total_load_save_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-and-save layerwise batch. |
+| `ucm:layerwise_batch_save_tail_save_only_ms` | `wait_for_save()` tail duration within one save-only layerwise batch. |
+| `ucm:layerwise_batch_save_tail_load_save_ms` | `wait_for_save()` tail duration within one load-and-save layerwise batch. |
 | `ucm:layerwise_wait_blocking_ms` | Time `wait_for_layer_load()` blocked before returning. |
 | `ucm:layerwise_wait_tasks_count` | Number of per-request load tasks awaited in a single layer wait. |
 | `ucm:layerwise_inter_wait_interval_ms` | Interval between consecutive `wait_for_layer_load()` calls. |