Skip to content

feat: Turbomind linear gdn prefix caching#4465

Open
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:turbomind-linear-gdn-prefix-caching
Open

feat: Turbomind linear gdn prefix caching#4465
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:turbomind-linear-gdn-prefix-caching

Conversation

@lapy
Copy link
Contributor

@lapy lapy commented Mar 25, 2026

Qwen3.5 Hybrid Prefix Caching in TurboMind

Summary

This change adds prefix caching support for Qwen3.5 hybrid-attention models in TurboMind.

  • Full-attention layers keep using the existing KV prefix cache.
  • Gated DeltaNet linear-attention layers now store checkpointed recurrent state at configurable KV-block boundaries.
  • Prefix matches for hybrid models restore both:
    • shared KV blocks for full-attention layers
    • the closest compatible GDN checkpoint state

The implementation keeps quant_policy scoped to KV cache quantization. GDN prefix checkpoints remain in the model/state dtypes in this version.

User-Facing Changes

  • Added linear_prefix_cache_interval_blocks to TurbomindEngineConfig
  • Added --linear-prefix-cache-interval-blocks to the TurboMind CLI surface
  • Default interval is 2 KV blocks
  • Validation rejects values < 1

Runtime Design

Hybrid cache structure

  • Existing KV prefix cache is unchanged for full-attention layers.
  • A second cache family stores GDN prefix checkpoints.
  • Each checkpoint stores:
    • convolution state
    • recurrent state
  • Checkpoints are attached to trie nodes at the configured interval.

Prefix matching

On a prefix hit:

  • TurboMind matches normal KV blocks as before.
  • For hybrid models it also finds the deepest trie node with a valid GDN checkpoint.
  • Reusable prefix length is clamped to the deepest compatible linear checkpoint.
  • The matched GDN state is restored into the live per-sequence GDN buffers before decode continues.

Cache maintenance

  • GDN checkpoint slots are released when trie nodes are invalidated.
  • When KV cached blocks are freed or evicted, the trie is verified immediately so the corresponding GDN checkpoints are pruned in the same path.
  • If the GDN checkpoint pool is exhausted, TurboMind skips storing deeper checkpoints instead of aborting the request.
  • Warm-up requests never allocate GDN prefix checkpoint staging.
  • Large real batches that cannot afford checkpoint staging continue to run; they simply skip storing new GDN checkpoints for that batch.

Main Code Areas

  • CLI and engine config
    • lmdeploy/messages.py
    • lmdeploy/cli/utils.py
    • lmdeploy/cli/cli.py
    • lmdeploy/cli/serve.py
    • src/turbomind/turbomind.cc
    • src/turbomind/models/llama/llama_params.h
    • src/turbomind/engine/engine.cc
  • Core hybrid prefix-cache logic
    • src/turbomind/models/llama/BlockTrie.h
    • src/turbomind/models/llama/BlockTrie.cc
    • src/turbomind/models/llama/SequenceManager.h
    • src/turbomind/models/llama/SequenceManager.cc
    • src/turbomind/models/llama/GatedDeltaNetLayer.h
    • src/turbomind/models/llama/GatedDeltaNetLayer.cc
    • src/turbomind/models/llama/gated_delta_net_kernels.h
    • src/turbomind/models/llama/gated_delta_net_kernels.cu

Test Coverage Added

Python tests

  • tests/test_lmdeploy/test_turbomind/test_engine_config.py
    • default interval value
    • validation for invalid interval values
    • explicit override handling
  • tests/test_lmdeploy/test_turbomind/test_api_server.py
    • TurboMind api_server forwards hybrid prefix-cache options into TurbomindEngineConfig
    • TurboMind api_server uses the normal default CUDA max_batch_size when the user does not set one explicitly

Test commands run

python -m pytest -q tests/test_lmdeploy/test_turbomind/test_engine_config.py tests/test_lmdeploy/test_turbomind/test_api_server.py
python -m pytest -q tests/test_lmdeploy/test_turbomind/test_converter.py
cmake --build /root/lmdeploy/build --target _turbomind -j4

Observed results

  • test_engine_config.py + test_api_server.py: 5 passed
  • test_converter.py: 5 passed
  • _turbomind rebuilt successfully

Real-Model Validation

Model:

  • QuantTrio/Qwen3.5-27B-AWQ

Command used:

TM_LOG_LEVEL=INFO CUDA_VISIBLE_DEVICES=1,2 lmdeploy serve api_server \
  QuantTrio/Qwen3.5-27B-AWQ \
  --tp 2 \
  --server-port 23335 \
  --reasoning-parser qwen-qwq \
  --tool-call-parser qwen3coder \
  --quant-policy 8 \
  --enable-prefix-caching

Observed startup details:

  • Server reached full Uvicorn startup successfully.
  • TurboMind reported max cached tokens: 533248.
  • Warm-up completed successfully through 8320 tokens.

Observed hybrid prefix-cache hit on repeated request:

[TM][INFO] [SeqMgr][match] ID 2, hit blocks 8, linear_cache_len 512, cache_len 0
[TM][INFO] [SeqMgr][match] ID 2, after matching, blocks 8, cache_len 512

Request details:

  • request 1: prompt_tokens=626, completion_tokens=24
  • request 2: prompt_tokens=626, completion_tokens=24

This confirms that the second request reused both normal cached KV blocks and a compatible linear-attention checkpoint.

Notes

  • quant_policy remains KV-only in this PR.

lapy added 5 commits March 25, 2026 08:26
@lapy
Copy link
Contributor Author

lapy commented Mar 25, 2026

TurboMind now treats Qwen3.5 hybrid attention as two cache families:

  • standard KV prefix cache for full-attention layers
  • checkpointed Gated DeltaNet (GDN) state for linear-attention layers

On a prefix hit, TurboMind restores both:

  • shared KV blocks for the matched full-attention prefix
  • the deepest compatible cached GDN checkpoint for the matched linear-attention prefix

Key changes

  • Changed the default linear_prefix_cache_interval_blocks from 2 to 64.
  • Updated the CLI/help text to describe the tradeoff more clearly: larger values reduce GDN checkpoint memory usage but increase recompute after a prefix hit.

New findings

The important new finding is that hybrid prefix caching must budget for three separate memory buckets:

  • KV cache blocks
  • live per-sequence GDN state
  • cached GDN prefix checkpoints

Before this change, TurboMind only effectively budgeted KV blocks plus live GDN state. The cached GDN checkpoint pool was lazy and not included in the initial capacity estimate.

For Qwen3.5-27B AWQ on tp=2 with quant_policy=8, a single cached GDN checkpoint is much larger than it first appears:

  • one int8 KV block of 64 items: about 1.016 MiB
  • one GDN checkpoint snapshot slot: about 37.9 MiB per rank

So while a single live GDN state is constant-size, dense GDN checkpointing becomes linear in cached prefix length and can significantly reduce available context capacity if it is budgeted conservatively.

linear_prefix_cache_interval_blocks is controlling every how many KV blocks do we save a GDN snapshot.

Example: default interval is 64, it means we save a GDN snapshot (37.9 MiB) every 64 int8 KV blocks (65 MiB). Reasonable tradeoff between reuse and memory footprint.

Real-model observations

Validated on QuantTrio/Qwen3.5-27B-AWQ with TurboMind and real repeated requests.

Observed hybrid prefix hit on repeated prompt:

  • hit blocks 8
  • linear_cache_len 512
  • after matching, blocks 8, cache_len 512

Observed context-capacity impact when GDN checkpoint memory is included in the budget:

  • interval 2: max cached tokens = 27200
  • interval 64: max cached tokens = 337600
  • interval 128: max cached tokens = 413952

This showed that interval 2 is far too dense for this model on 32GB V100s, while 64 and 128 are both practical. Based on these results, the default was changed to 64.

Runtime behavior

  • Huge prompts can still run even if new GDN checkpoints cannot be stored.
  • If GDN checkpoint staging would exceed the per-batch budget, TurboMind skips storing new GDN checkpoints for that batch instead of aborting the request.
  • If the GDN checkpoint slot pool is exhausted, deeper checkpoints are skipped until cached entries are evicted.
  • Prefix caching remains opportunistic acceleration data rather than a hard requirement for forward progress.

Validation

  • QuantTrio/Qwen3.5-27B-AWQ
  • CUDA_VISIBLE_DEVICES=1,2
  • tp=2
  • quant_policy=8
  • prefix caching enabled

@lvhan028 lvhan028 requested a review from lzhangzz March 26, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant