Skip to content

[Feat] Add opt-in reshape cache event sync#1025

Open
dante159753 wants to merge 4 commits into
ModelEngine-Group:developfrom
dante159753:cache-dump-event-sync-only
Open

[Feat] Add opt-in reshape cache event sync#1025
dante159753 wants to merge 4 commits into
ModelEngine-Group:developfrom
dante159753:cache-dump-event-sync-only

Conversation

@dante159753

Copy link
Copy Markdown
Contributor

Summary

This PR splits the cache dump reshape-event synchronization optimization into its own branch, without the dump queue pipeline refactor from cache-dump-event-optimize.

Changes:

  • Add opt-in enable_reshape_cache_event_sync config, defaulting to false to preserve current behavior.
  • When enabled, layerwise dump first tries to reuse vLLM/vLLM-Ascend reshape_cache_event from attn_metadata[layer_name], then falls back to attn_metadata.reshape_cache_event.
  • If no usable reshape event exists, UCM keeps the existing current-stream event fallback; if event creation fails, it falls back to device synchronization.
  • Preserve borrowed torch event references in Device so raw event handles remain valid until UCM clears event handles.
  • Add focused metrics for event-source selection:
    • dump_event_reshape_cache_direct_used_total
    • dump_event_reshape_cache_layer_used_total
    • dump_event_current_stream_used_total
    • dump_event_sync_fallback_used_total
  • Update config example and docs.

Why

For layerwise cache dump, vLLM-Ascend may already record a reshape_cache_event immediately after KV cache writes are ready. Waiting on that event lets UCM's C++ cache copy stream synchronize at the KV-ready point instead of recording a later current-stream event in Python.

The switch is opt-in because model/version behavior differs. The default remains the pre-existing path.

Scope

This PR intentionally does not include dump queue optimization or DumpQueue changes. The diff is limited to connector/device Python code plus config, metric config, and docs.

Verification

Ran locally on Windows worktree cache-dump-event-sync-only:

  • git diff --check origin/develop...HEAD
  • python -m py_compile ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.py
  • python -m black --check ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.py
  • python -m isort --check-only --profile=black ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.py
  • codespell docs/source/user-guide/metrics/metrics.md docs/source/user-guide/prefix-cache/pipeline_store.md examples/metrics/metrics_configs.yaml examples/ucm_config_example.yaml ucm/integration/vllm/device.py ucm/integration/vllm/ucm_connector.py --skip ucm/csrc/**,ucm.egg-info/**,.github/**,ucm/sparse/gsa_on_device/csrc/** -L CANN,cann,NNAL,nnal,ASCEND,ascend,EnQue,CopyIn,Collet,DElt,re-use
  • Parsed examples/metrics/metrics_configs.yaml and examples/ucm_config_example.yaml with PyYAML.
  • Confirmed PR diff does not include dump queue files.

Note: full local pre-commit run --files ... could not complete on this machine because the actionlint hook attempted to download Go from go.dev and hit a local SSL/network failure. The changed files do not include GitHub Actions files; black/isort/codespell were run directly with matching configured versions.

@dante159753 dante159753 force-pushed the cache-dump-event-sync-only branch from a37355f to 8167f53 Compare June 12, 2026 09:11
return 0
self.borrowed_events.append(event)
return int(handle)
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning - Exception handling too broad

Using except Exception catches all exceptions including system exceptions like KeyboardInterrupt, SystemExit. This could mask critical errors.

Recommended specific exceptions:

except (AttributeError, TypeError, ValueError) as e:

This applies to both CudaDevice (line 128) and NpuDevice (line 279) implementations.

return 0
self.borrowed_events.append(event)
return int(handle)
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning - Exception handling too broad

Same issue as CudaDevice - using except Exception is too broad. Should catch specific exceptions like AttributeError, TypeError, ValueError.

self.events.clear()
self.borrowed_events.clear()

def destroy_event_handle(self, handle: int):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning - Inconsistent cleanup in destroy_event_handle

The destroy_event_handle method only removes events from self.events dict but does not remove corresponding entries from self.borrowed_events.

If individual event handles are destroyed (rather than batch clearing via destroy_event_handles), borrowed event references could accumulate, potentially causing memory growth over time.

Consider either:

  1. Adding logic to track handle-to-event mapping for borrowed events
  2. Documenting that destroy_event_handle is only for UCM-created events, not borrowed ones
  3. Using a different structure for borrowed_events with proper tracking

class Device(ABC):
def __init__(self):
self.events = {}
self.borrowed_events = []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion - Thread safety consideration

The borrowed_events list is accessed without synchronization. If this device is used in a multi-threaded context (e.g., multiple workers sharing the same device instance), there could be race conditions when:

  • Multiple threads call get_event_handle_from_event concurrently (append)
  • One thread clears while another is appending

Consider adding:

  1. A lock for borrowed_events operations
  2. Documentation about thread-safety assumptions (single-threaded per device)
  3. A thread-safe collection if multi-threaded access is expected

self.tp_rank = self._vllm_config.parallel_config.rank
self.block_size = self._vllm_config.cache_config.block_size
self.is_mla = self._vllm_config.model_config.is_deepseek_mla
self.use_mla = getattr(self._vllm_config.model_config, "use_mla", self.is_mla)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion - Unused variable

The variable self.use_mla is set here but never used in the visible code changes. If this is intentional (for future use), consider adding a comment explaining the purpose. If not needed, it should be removed to avoid confusion.

Note: self.is_mla is already used elsewhere in the code.

Comment on lines +833 to +851
def _get_reshape_cache_event(
self, layer_name: Optional[str], attn_metadata: Optional["AttentionMetadata"]
) -> tuple[Optional[Any], str]:
if attn_metadata is None:
return None, "none"

if layer_name:
try:
layer_metadata = attn_metadata[layer_name]
event = getattr(layer_metadata, "reshape_cache_event", None)
if event is not None:
return event, "layer"
except (KeyError, TypeError, AttributeError):
pass

event = getattr(attn_metadata, "reshape_cache_event", None)
if event is not None:
return event, "direct"
return None, "direct_missing"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning - Metric gap for missing reshape cache events

When enable_reshape_cache_event_sync is True but the reshape cache event is missing (event_source = "none" or "direct_missing"), and we fall back to current stream or device sync, there is no specific metric tracking this "expected-but-missing" case.

This creates an observability gap - operators cannot distinguish between:

  • Intentional fallback (feature disabled)
  • Expected reshape event missing (vLLM version/model mismatch)

Consider adding metrics like:

  • dump_event_reshape_cache_none_total (attn_metadata is None)
  • dump_event_reshape_cache_missing_total (event_source = "direct_missing")

Comment on lines +862 to +880
if self.enable_reshape_cache_event_sync:
reshape_cache_event, event_source = self._get_reshape_cache_event(
layer_name, attn_metadata
)
event_handle = (
self.device.get_event_handle_from_event(reshape_cache_event)
if reshape_cache_event is not None
else 0
)
if event_handle != 0:
if event_source == "direct":
ucmmetrics.update_stats(
"dump_event_reshape_cache_direct_used_total", 1.0
)
elif event_source == "layer":
ucmmetrics.update_stats(
"dump_event_reshape_cache_layer_used_total", 1.0
)
return event_handle

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning - Metric inconsistency in handle extraction failure

When enable_reshape_cache_event_sync is True and we successfully get a reshape_cache_event but get_event_handle_from_event returns 0 (handle extraction failed), we fall back to the current stream path and track "dump_event_current_stream_used_total".

The metric doesn't reflect that we attempted reshape cache sync first but failed at handle extraction. Consider:

  1. Adding a metric for "dump_event_reshape_cache_handle_failed_total"
  2. Or updating the existing metric logic to distinguish this fallback path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants