Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
5cdcf55
[Feat] Add Python-side UCM metrics dispatcher
dante159753 Jun 4, 2026
74e6f97
[Feat] Update Grafana dashboards for vLLM UCM metrics
dante159753 Jun 4, 2026
749d197
Add layerwise batch breakdown metrics
dante159753 Jun 5, 2026
3e373d0
Merge branch 'layerwise-batch-breakdown' into ucm-vllm-connector-metrics
dante159753 Jun 5, 2026
954a4a8
[Feat] Align vLLM UCM metrics dashboards
dante159753 Jun 5, 2026
052a683
[Feat] Add direct connector task breakdown metrics
dante159753 Jun 5, 2026
1209a79
[Feat] Isolate vLLM UCM Grafana dashboards
dante159753 Jun 5, 2026
33ab6f1
[Feat] Add engine filters to UCM dashboards
dante159753 Jun 8, 2026
1805a06
[Feat] Restore aggregated UCM dashboard view
dante159753 Jun 8, 2026
b53e42b
[Feat] Refine UCM dashboard view legends
dante159753 Jun 8, 2026
246fcbc
[Feat] Aggregate UCM dashboards across engines
dante159753 Jun 8, 2026
2446f9b
Merge remote-tracking branch 'origin/develop' into ucm-vllm-connector…
dante159753 Jun 9, 2026
9576aac
[Feat] Default UCM metrics to vLLM connector
dante159753 Jun 9, 2026
7f8fd50
Merge branch 'develop' into ucm-vllm-connector-metrics
dante159753 Jun 9, 2026
1ecf751
[Feat] Add vLLM cache hit rate panels
dante159753 Jun 9, 2026
a67f375
[Feat] Show direct connector breakdown latency
dante159753 Jun 9, 2026
3799ee3
[Feat] Keep combined vLLM cache hit rate panel
dante159753 Jun 9, 2026
082a70e
[Feat] Disable UCM connector CLI summary
dante159753 Jun 9, 2026
129fdb0
[Feat] Preserve legacy UCM metric names
dante159753 Jun 10, 2026
ee8fcb1
Merge branch 'develop' into ucm-vllm-connector-metrics
dante159753 Jun 11, 2026
d5f7185
[Feat] Logger redesign (#1015)
dante159753 Jun 11, 2026
edc4680
[Feat] Split cache H2D D2H metric phases
dante159753 Jun 11, 2026
ab03186
[Feat] Add cache H2D bandwidth metrics
dante159753 Jun 12, 2026
84f396a
[Feat] Adapt UCM metrics to async dump completion
dante159753 Jun 15, 2026
af4046f
[Feat] Add HMA connector metrics
dante159753 Jun 15, 2026
680d3a5
[Refactor] Simplify HMA metrics completion tracking
dante159753 Jun 17, 2026
0dc784e
Merge remote-tracking branch 'origin/develop' into ucm-vllm-connector…
dante159753 Jun 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 59 additions & 31 deletions docs/source/user-guide/metrics/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,24 @@

UCM exports metrics through the vLLM `/metrics` endpoint. The metrics are
registered from `examples/metrics/metrics_configs.yaml`, accumulated inside UCM,
and exposed through `prometheus_client` in Prometheus multiprocess mode.
and fanned out to the enabled Python-side consumers.

## How Metrics Flow

1. `metrics_configs.yaml` defines counters, gauges, and histograms.
2. `PrometheusStatsLogger` creates matching `prometheus_client` metrics with
`model_name` and `worker_id` labels.
3. Histogram bucket boundaries are taken from the Python Prometheus histogram
2. The Python metrics dispatcher drains the C++ metrics snapshot once and fans
it out to the enabled `multiproc` and `vllm_connector` consumers.
3. `multiproc` creates `prometheus_client` metrics with `model_name` and
`worker_id` labels. `vllm_connector` creates vLLM KV connector metrics with
`model_name`, `engine`, and `worker_rank` labels.
4. Histogram bucket boundaries are taken from the Python Prometheus histogram
and registered into the C++ metrics library.
4. UCM code calls `UpdateStats()` on the hot path.
5. The C++ metrics library records counter, gauge, and histogram bucket deltas in
5. UCM code calls `UpdateStats()` on the hot path.
6. The C++ metrics library records counter, gauge, and histogram bucket deltas in
per-thread double buffers.
6. Every `log_interval` seconds, the observability thread calls
`get_all_stats_and_clear()` and applies the deltas to `prometheus_client`.
7. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.
7. The dispatcher applies deltas to each enabled Python consumer without one
consumer clearing the other's accumulated snapshot.
8. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.

Histograms are bucketed at update time. UCM no longer stores raw histogram
sample vectors, so there is no `histogram_max_length` setting and no histogram
Expand Down Expand Up @@ -83,14 +86,14 @@ vllm bench serve \
--ignore-eos
```

Check that UCM metrics are present:
Check that UCM vLLM connector metrics are present:

```bash
curl http://<vllm-worker-ip>:8000/metrics | grep ucm:
curl http://<vllm-worker-ip>:8000/metrics | grep 'ucm:'
```

Prometheus multiprocess `.db` files should also appear in
`$PROMETHEUS_MULTIPROC_DIR`.
If the `multiproc` consumer is enabled, Prometheus multiprocess `.db` files
should also appear in `$PROMETHEUS_MULTIPROC_DIR`.

### 2. Start Prometheus and Grafana

Expand Down Expand Up @@ -158,25 +161,30 @@ dashboards while preserving the time range and `model_name` value.
Each dashboard has a `job` selector. It defaults to **All** and uses regex
matching, so dashboards also work for metrics that do not carry a `job` label.

The UCM dashboards also have a `View` selector and a `worker_id` selector:
The UCM dashboards also have a `View` selector and a `worker_rank` selector:

- **Aggregated**: default service-level view. Worker labels are collapsed.
- **Per Worker**: split panels by `worker_id` for worker-specific diagnosis.
- **worker_id**: defaults to **All**. Select a specific worker ID to filter all
- **Per Worker**: split panels by `worker_rank` for worker-specific diagnosis.
- **worker_rank**: defaults to **All**. Select a specific worker rank to filter all
UCM panels to that worker only.

Heatmap panels and panels grouped by another dimension may ignore the `View`
selector because their grouping is already defined by the panel. They still use
the `worker_id` filter.
the `worker_rank` filter.

## Metrics Configuration

Metrics are configured in `examples/metrics/metrics_configs.yaml`:

```yaml
log_interval: 5
multiproc_dir: "/vllm-workspace"
metric_prefix: "ucm:"
# multiproc_dir: "/vllm-workspace"
# multiproc_prefix: "ucm_multiproc:"
vllm_connector_prefix: "ucm:"

consumers:
# multiproc: true
vllm_connector: true

counter:
- name: "cache_load_bytes_total"
Expand All @@ -193,17 +201,22 @@ histogram:
buckets: [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
```

Metric names are exported with the configured prefix. For example,
`cache_load_duration_ms` becomes `ucm:cache_load_duration_ms`. Prometheus also
exports histogram helper series such as `_bucket`, `_sum`, and `_count`.
Metric names are exported per consumer. For example, `cache_load_duration_ms`
is exported as `ucm:cache_load_duration_ms` by the default `vllm_connector`
consumer. If `multiproc` is also enabled, use a separate prefix such as
`ucm_multiproc:` so both consumers do not register the same Prometheus metric.
Prometheus also exports histogram helper series such as `_bucket`, `_sum`,
and `_count`.

Counter values are increments. Gauge values replace the current value.
Histogram values are observations that are immediately assigned to configured
buckets in the C++ metrics library.

## Available Metrics

The default metrics configuration contains the following UCM metrics.
The default metrics configuration contains the following UCM metric names. The
table uses the default `vllm_connector_prefix`. UCM duration metrics are
exported in milliseconds.

### Counters

Expand Down Expand Up @@ -240,30 +253,45 @@ The default metrics configuration contains the following UCM metrics.
| `ucm:load_speed` | Speed of loading from UCM in GB/s. |
| `ucm:save_requests_num` | Number of requests saved to UCM. |
| `ucm:save_blocks_num` | Number of blocks saved to UCM. |
| `ucm:save_duration` | Time to save to UCM in milliseconds. |
| `ucm:save_speed` | Speed of saving to UCM in GB/s. |
| `ucm:save_duration` | Async total time from UCM connector dump submit to completion in milliseconds. |
| `ucm:save_speed` | Speed of saving to UCM based on async submit-to-completion time in GB/s. |
| `ucm:save_completion_wait_duration` | Time spent blocked while confirming async UCM connector dump completion in milliseconds. |
| `ucm:save_completion_wait_speed` | Speed of saving to UCM based on async completion blocking wait time in GB/s. |
| `ucm:interval_lookup_hit_rates` | Hit rates of UCM lookup requests. |
| `ucm:cache_lookup_duration_ms` | Cache buffer lookup wall-clock time. |
| `ucm:cache_lookup_backend_duration_ms` | Backend lookup wall-clock time when descending due to no buffer or buffer miss. |
| `ucm:cache_load_duration_ms` | End-to-end Cache stage load task duration in milliseconds. |
| `ucm:cache_dump_duration_ms` | End-to-end Cache stage dump task duration in milliseconds. |
| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load bandwidth in GB/s. |
| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump bandwidth in GB/s. |
| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load throughput in GB/s over the whole task lifetime, including queue/backend waits. Not a DMA bandwidth (see `cache_h2d_bandwidth_gbps`). |
| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump throughput in GB/s over the whole task lifetime, including queue and compute-event waits. Not a DMA bandwidth (see `cache_d2h_bandwidth_gbps`). |
| `ucm:cache_load_queue_wait_duration_ms` | Time a Cache load task spent queued before dispatch worker pickup. |
| `ucm:cache_dump_queue_wait_duration_ms` | Time a Cache dump task spent queued before dispatch worker pickup. |
| `ucm:cache_load_dispatch_duration_ms` | Cache load dispatch cost: buffer allocation plus backend submission. |
| `ucm:cache_load_backend_submit_duration_ms` | Cache load backend submit duration: buffer allocation plus backend load submission. |
| `ucm:cache_shard_backend_wait_ms` | Cache load per-shard `WaitBackendTaskReady()` duration. |
| `ucm:cache_shard_h2d_ms` | Cache load per-shard H2D async submit duration. |
| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration. |
| `ucm:cache_d2h_duration_ms` | Cache dump D2H stream sync phase duration. |
| `ucm:cache_h2d_submit_ms` | Cache load per-shard H2D async submit CPU cost. Submission only, not the transfer. |
| `ucm:cache_h2d_sync_ms` | Cache load residual H2D stream drain after the last shard submit. Large values mean H2D copy is the bottleneck. |
| `ucm:cache_h2d_bandwidth_gbps` | Cache load pure H2D copy bandwidth, directly comparable to memcpy microbenchmarks. |
| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration (buffer allocation/reuse plus D2H async submit). |
| `ucm:cache_dump_prereq_wait_ms` | Cache dump wait for the prerequisite compute event before D2H can start. Large values mean dump is compute-gated. |
| `ucm:cache_d2h_duration_ms` | Cache dump pure D2H copy drain, compute-event wait excluded. |
| `ucm:cache_d2h_bandwidth_gbps` | Cache dump pure D2H copy bandwidth, directly comparable to memcpy microbenchmarks. |
| `ucm:cache_dump_backend_submit_duration_ms` | Cache dump synchronous backend submit duration. |
| `ucm:cache_dump_backend_wait_duration_ms` | Cache dump wait for the lower tier to finish writing. Large values mean storage write is the bottleneck. |
| `ucm:posix_load_task_duration_ms` | End-to-end Posix load task duration. |
| `ucm:posix_dump_task_duration_ms` | End-to-end Posix dump task duration. |
| `ucm:posix_s2h_bandwidth_gbps` | Posix stage read bandwidth per task in GB/s. |
| `ucm:posix_h2s_bandwidth_gbps` | Posix stage write bandwidth per task in GB/s. |
| `ucm:posix_load_queue_wait_duration_ms` | Time a Posix load task spent queued before first worker pickup. |
| `ucm:posix_dump_queue_wait_duration_ms` | Time a Posix dump task spent queued before first worker pickup. |
| `ucm:layerwise_batch_total_ms` | Layerwise batch wall-clock time from `start_load_kv()` entry to `wait_for_save()` return. |
| `ucm:layerwise_batch_total_load_only_ms` | Layerwise load-only batch wall-clock time. |
| `ucm:layerwise_batch_total_save_only_ms` | Layerwise save-only batch wall-clock time. |
| `ucm:layerwise_batch_total_load_save_ms` | Layerwise load-and-save batch wall-clock time. |
| `ucm:layerwise_batch_total_no_transfer_ms` | Layerwise batch wall-clock time with neither load nor save work. |
| `ucm:layerwise_batch_load_wait_total_load_only_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-only layerwise batch. |
| `ucm:layerwise_batch_load_wait_total_load_save_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-and-save layerwise batch. |
| `ucm:layerwise_batch_save_tail_save_only_ms` | `wait_for_save()` tail duration within one save-only layerwise batch. |
| `ucm:layerwise_batch_save_tail_load_save_ms` | `wait_for_save()` tail duration within one load-and-save layerwise batch. |
| `ucm:layerwise_wait_blocking_ms` | Time `wait_for_layer_load()` blocked before returning. |
| `ucm:layerwise_wait_tasks_count` | Number of per-request load tasks awaited in a single layer wait. |
| `ucm:layerwise_inter_wait_interval_ms` | Interval between consecutive `wait_for_layer_load()` calls. |
Expand Down
Loading
Loading