[Feat] Add opt-in reshape cache event sync#1025
Conversation
a37355f to
8167f53
Compare
| return 0 | ||
| self.borrowed_events.append(event) | ||
| return int(handle) | ||
| except Exception as e: |
There was a problem hiding this comment.
Using except Exception catches all exceptions including system exceptions like KeyboardInterrupt, SystemExit. This could mask critical errors.
Recommended specific exceptions:
except (AttributeError, TypeError, ValueError) as e:This applies to both CudaDevice (line 128) and NpuDevice (line 279) implementations.
| return 0 | ||
| self.borrowed_events.append(event) | ||
| return int(handle) | ||
| except Exception as e: |
There was a problem hiding this comment.
Same issue as CudaDevice - using except Exception is too broad. Should catch specific exceptions like AttributeError, TypeError, ValueError.
| self.events.clear() | ||
| self.borrowed_events.clear() | ||
|
|
||
| def destroy_event_handle(self, handle: int): |
There was a problem hiding this comment.
The destroy_event_handle method only removes events from self.events dict but does not remove corresponding entries from self.borrowed_events.
If individual event handles are destroyed (rather than batch clearing via destroy_event_handles), borrowed event references could accumulate, potentially causing memory growth over time.
Consider either:
- Adding logic to track handle-to-event mapping for borrowed events
- Documenting that
destroy_event_handleis only for UCM-created events, not borrowed ones - Using a different structure for
borrowed_eventswith proper tracking
| class Device(ABC): | ||
| def __init__(self): | ||
| self.events = {} | ||
| self.borrowed_events = [] |
There was a problem hiding this comment.
💡 Suggestion - Thread safety consideration
The borrowed_events list is accessed without synchronization. If this device is used in a multi-threaded context (e.g., multiple workers sharing the same device instance), there could be race conditions when:
- Multiple threads call
get_event_handle_from_eventconcurrently (append) - One thread clears while another is appending
Consider adding:
- A lock for
borrowed_eventsoperations - Documentation about thread-safety assumptions (single-threaded per device)
- A thread-safe collection if multi-threaded access is expected
| self.tp_rank = self._vllm_config.parallel_config.rank | ||
| self.block_size = self._vllm_config.cache_config.block_size | ||
| self.is_mla = self._vllm_config.model_config.is_deepseek_mla | ||
| self.use_mla = getattr(self._vllm_config.model_config, "use_mla", self.is_mla) |
There was a problem hiding this comment.
💡 Suggestion - Unused variable
The variable self.use_mla is set here but never used in the visible code changes. If this is intentional (for future use), consider adding a comment explaining the purpose. If not needed, it should be removed to avoid confusion.
Note: self.is_mla is already used elsewhere in the code.
| def _get_reshape_cache_event( | ||
| self, layer_name: Optional[str], attn_metadata: Optional["AttentionMetadata"] | ||
| ) -> tuple[Optional[Any], str]: | ||
| if attn_metadata is None: | ||
| return None, "none" | ||
|
|
||
| if layer_name: | ||
| try: | ||
| layer_metadata = attn_metadata[layer_name] | ||
| event = getattr(layer_metadata, "reshape_cache_event", None) | ||
| if event is not None: | ||
| return event, "layer" | ||
| except (KeyError, TypeError, AttributeError): | ||
| pass | ||
|
|
||
| event = getattr(attn_metadata, "reshape_cache_event", None) | ||
| if event is not None: | ||
| return event, "direct" | ||
| return None, "direct_missing" |
There was a problem hiding this comment.
When enable_reshape_cache_event_sync is True but the reshape cache event is missing (event_source = "none" or "direct_missing"), and we fall back to current stream or device sync, there is no specific metric tracking this "expected-but-missing" case.
This creates an observability gap - operators cannot distinguish between:
- Intentional fallback (feature disabled)
- Expected reshape event missing (vLLM version/model mismatch)
Consider adding metrics like:
dump_event_reshape_cache_none_total(attn_metadata is None)dump_event_reshape_cache_missing_total(event_source = "direct_missing")
| if self.enable_reshape_cache_event_sync: | ||
| reshape_cache_event, event_source = self._get_reshape_cache_event( | ||
| layer_name, attn_metadata | ||
| ) | ||
| event_handle = ( | ||
| self.device.get_event_handle_from_event(reshape_cache_event) | ||
| if reshape_cache_event is not None | ||
| else 0 | ||
| ) | ||
| if event_handle != 0: | ||
| if event_source == "direct": | ||
| ucmmetrics.update_stats( | ||
| "dump_event_reshape_cache_direct_used_total", 1.0 | ||
| ) | ||
| elif event_source == "layer": | ||
| ucmmetrics.update_stats( | ||
| "dump_event_reshape_cache_layer_used_total", 1.0 | ||
| ) | ||
| return event_handle |
There was a problem hiding this comment.
When enable_reshape_cache_event_sync is True and we successfully get a reshape_cache_event but get_event_handle_from_event returns 0 (handle extraction failed), we fall back to the current stream path and track "dump_event_current_stream_used_total".
The metric doesn't reflect that we attempted reshape cache sync first but failed at handle extraction. Consider:
- Adding a metric for "dump_event_reshape_cache_handle_failed_total"
- Or updating the existing metric logic to distinguish this fallback path
Summary
This PR splits the cache dump reshape-event synchronization optimization into its own branch, without the dump queue pipeline refactor from
cache-dump-event-optimize.Changes:
enable_reshape_cache_event_syncconfig, defaulting tofalseto preserve current behavior.reshape_cache_eventfromattn_metadata[layer_name], then falls back toattn_metadata.reshape_cache_event.Deviceso raw event handles remain valid until UCM clears event handles.dump_event_reshape_cache_direct_used_totaldump_event_reshape_cache_layer_used_totaldump_event_current_stream_used_totaldump_event_sync_fallback_used_totalWhy
For layerwise cache dump, vLLM-Ascend may already record a
reshape_cache_eventimmediately after KV cache writes are ready. Waiting on that event lets UCM's C++ cache copy stream synchronize at the KV-ready point instead of recording a later current-stream event in Python.The switch is opt-in because model/version behavior differs. The default remains the pre-existing path.
Scope
This PR intentionally does not include dump queue optimization or
DumpQueuechanges. The diff is limited to connector/device Python code plus config, metric config, and docs.Verification
Ran locally on Windows worktree
cache-dump-event-sync-only:git diff --check origin/develop...HEADpython -m py_compile ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.pypython -m black --check ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.pypython -m isort --check-only --profile=black ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.pycodespell docs/source/user-guide/metrics/metrics.md docs/source/user-guide/prefix-cache/pipeline_store.md examples/metrics/metrics_configs.yaml examples/ucm_config_example.yaml ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.py --skip ucm/csrc/**,ucm.egg-info/**,.github/**,ucm/sparse/gsa_on_device/csrc/** -L CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,Collet,DElt,re-useexamples/metrics/metrics_configs.yamlandexamples/ucm_config_example.yamlwith PyYAML.Note: full local
pre-commit run --files ...could not complete on this machine because the actionlint hook attempted to download Go fromgo.devand hit a local SSL/network failure. The changed files do not include GitHub Actions files; black/isort/codespell were run directly with matching configured versions.