Skip to content

Add Ascend Cache IO aggregation copy path#1038

Open
NaganooMei wants to merge 10 commits into
ModelEngine-Group:developfrom
NaganooMei:io-aggregation-upstream-pr-v2
Open

Add Ascend Cache IO aggregation copy path#1038
NaganooMei wants to merge 10 commits into
ModelEngine-Group:developfrom
NaganooMei:io-aggregation-upstream-pr-v2

Conversation

@NaganooMei

Copy link
Copy Markdown
Contributor

Purpose

This PR adds an optional Ascend CacheStore IO aggregation path for small H2D/D2H shard copies.

The default CacheStore path submits one async copy per tensor fragment. For workloads with many small fragments per shard, the submission overhead can become noticeable. This PR introduces an Ascend-only aggregation stream that treats one CacheStore shard as one IO object, stages the contiguous host buffer once, and uses FFTS SDMA descriptors to scatter/gather between the staging buffer and device tensor addresses.

The feature is disabled by default and only available when building with PLATFORM=ascend / RUNTIME_ENVIRONMENT=ascend.

Modifications

  • Add StreamOptions and MakeSharedStream(options) so CacheStore can request a feature-specific Ascend stream while keeping the original stream interface available.
  • Keep CopyStream as the outer stream pool owner, but allow special CacheStore streams to manage their own internal lanes.
  • Move CacheStore H2D/D2H fragment submission from explicit LoadQueue/DumpQueue tensor loops into the trans stream multi-fragment async path.
  • Add AscendIoAggregationStream and AscendShardIOAggregator.
    • H2D: host shard -> staging buffer -> device tensor fragments.
    • D2H: device tensor fragments -> staging buffer -> host shard.
    • Internal lanes and pipeline depth are controlled by CacheStore config.
  • Add FftsSdmaDispatcher for building and launching FFTS SDMA copy descriptors.
  • Add config:
    • cache_io_aggregation
    • cache_io_aggregation_pipeline_depth
    • cache_io_aggregation_max_ready_lanes
  • Add runtime/build isolation:
    • PLATFORM=ascend enables ordinary Ascend backend plus IO aggregation code.
    • Other runtimes do not compile or allow this feature.
  • Add config validation so invalid or unsupported combinations fail early instead of silently falling back.
  • Preserve the default behavior:
    • cache_io_aggregation: false keeps the original Ascend copy path.

Test

  • Verified build with PLATFORM=ascend.
  • Verified default CacheStore load/dump path with cache_io_aggregation=false.
  • Verified CacheStore load H2D and dump D2H path with cache_io_aggregation=true.
  • Verified the feature is controlled by config and remains disabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant