feat: Turbomind linear gdn prefix caching by lapy · Pull Request #4465 · InternLM/lmdeploy

lapy · 2026-03-25T08:29:51Z

Qwen3.5 Hybrid Prefix Caching in TurboMind

Summary

This change adds prefix caching support for Qwen3.5 hybrid-attention models in TurboMind.

Full-attention layers keep using the existing KV prefix cache.
Gated DeltaNet linear-attention layers now store checkpointed recurrent state at configurable KV-block boundaries.
Prefix matches for hybrid models restore both:
- shared KV blocks for full-attention layers
- the closest compatible GDN checkpoint state

The implementation keeps quant_policy scoped to KV cache quantization. GDN prefix checkpoints remain in the model/state dtypes in this version.

User-Facing Changes

Added linear_prefix_cache_interval_blocks to TurbomindEngineConfig
Added --linear-prefix-cache-interval-blocks to the TurboMind CLI surface
Default interval is 2 KV blocks
Validation rejects values < 1

Runtime Design

Hybrid cache structure

Existing KV prefix cache is unchanged for full-attention layers.
A second cache family stores GDN prefix checkpoints.
Each checkpoint stores:
- convolution state
- recurrent state
Checkpoints are attached to trie nodes at the configured interval.

Prefix matching

On a prefix hit:

TurboMind matches normal KV blocks as before.
For hybrid models it also finds the deepest trie node with a valid GDN checkpoint.
Reusable prefix length is clamped to the deepest compatible linear checkpoint.
The matched GDN state is restored into the live per-sequence GDN buffers before decode continues.

Cache maintenance

GDN checkpoint slots are released when trie nodes are invalidated.
When KV cached blocks are freed or evicted, the trie is verified immediately so the corresponding GDN checkpoints are pruned in the same path.
If the GDN checkpoint pool is exhausted, TurboMind skips storing deeper checkpoints instead of aborting the request.
Warm-up requests never allocate GDN prefix checkpoint staging.
Large real batches that cannot afford checkpoint staging continue to run; they simply skip storing new GDN checkpoints for that batch.

Main Code Areas

CLI and engine config
- lmdeploy/messages.py
- lmdeploy/cli/utils.py
- lmdeploy/cli/cli.py
- lmdeploy/cli/serve.py
- src/turbomind/turbomind.cc
- src/turbomind/models/llama/llama_params.h
- src/turbomind/engine/engine.cc
Core hybrid prefix-cache logic
- src/turbomind/models/llama/BlockTrie.h
- src/turbomind/models/llama/BlockTrie.cc
- src/turbomind/models/llama/SequenceManager.h
- src/turbomind/models/llama/SequenceManager.cc
- src/turbomind/models/llama/GatedDeltaNetLayer.h
- src/turbomind/models/llama/GatedDeltaNetLayer.cc
- src/turbomind/models/llama/gated_delta_net_kernels.h
- src/turbomind/models/llama/gated_delta_net_kernels.cu

Test Coverage Added

Python tests

tests/test_lmdeploy/test_turbomind/test_engine_config.py
- default interval value
- validation for invalid interval values
- explicit override handling
tests/test_lmdeploy/test_turbomind/test_api_server.py
- TurboMind api_server forwards hybrid prefix-cache options into TurbomindEngineConfig
- TurboMind api_server uses the normal default CUDA max_batch_size when the user does not set one explicitly

Test commands run

python -m pytest -q tests/test_lmdeploy/test_turbomind/test_engine_config.py tests/test_lmdeploy/test_turbomind/test_api_server.py
python -m pytest -q tests/test_lmdeploy/test_turbomind/test_converter.py
cmake --build /root/lmdeploy/build --target _turbomind -j4

Observed results

test_engine_config.py + test_api_server.py: 5 passed
test_converter.py: 5 passed
_turbomind rebuilt successfully

Real-Model Validation

Model:

QuantTrio/Qwen3.5-27B-AWQ

Command used:

TM_LOG_LEVEL=INFO CUDA_VISIBLE_DEVICES=1,2 lmdeploy serve api_server \
  QuantTrio/Qwen3.5-27B-AWQ \
  --tp 2 \
  --server-port 23335 \
  --reasoning-parser qwen-qwq \
  --tool-call-parser qwen3coder \
  --quant-policy 8 \
  --enable-prefix-caching

Observed startup details:

Server reached full Uvicorn startup successfully.
TurboMind reported max cached tokens: 533248.
Warm-up completed successfully through 8320 tokens.

Observed hybrid prefix-cache hit on repeated request:

[TM][INFO] [SeqMgr][match] ID 2, hit blocks 8, linear_cache_len 512, cache_len 0
[TM][INFO] [SeqMgr][match] ID 2, after matching, blocks 8, cache_len 512

Request details:

request 1: prompt_tokens=626, completion_tokens=24
request 2: prompt_tokens=626, completion_tokens=24

This confirms that the second request reused both normal cached KV blocks and a compatible linear-attention checkpoint.

Notes

quant_policy remains KV-only in this PR.

…ntation; add related tests. This change enhances memory management for hybrid models by increasing the checkpoint interval, which may reduce memory usage but requires more recompute after prefix hits.

lapy · 2026-03-25T23:49:30Z

TurboMind now treats Qwen3.5 hybrid attention as two cache families:

standard KV prefix cache for full-attention layers
checkpointed Gated DeltaNet (GDN) state for linear-attention layers

On a prefix hit, TurboMind restores both:

shared KV blocks for the matched full-attention prefix
the deepest compatible cached GDN checkpoint for the matched linear-attention prefix

Key changes

Changed the default linear_prefix_cache_interval_blocks from 2 to 64.
Updated the CLI/help text to describe the tradeoff more clearly: larger values reduce GDN checkpoint memory usage but increase recompute after a prefix hit.

New findings

The important new finding is that hybrid prefix caching must budget for three separate memory buckets:

KV cache blocks
live per-sequence GDN state
cached GDN prefix checkpoints

Before this change, TurboMind only effectively budgeted KV blocks plus live GDN state. The cached GDN checkpoint pool was lazy and not included in the initial capacity estimate.

For Qwen3.5-27B AWQ on tp=2 with quant_policy=8, a single cached GDN checkpoint is much larger than it first appears:

one int8 KV block of 64 items: about 1.016 MiB
one GDN checkpoint snapshot slot: about 37.9 MiB per rank

So while a single live GDN state is constant-size, dense GDN checkpointing becomes linear in cached prefix length and can significantly reduce available context capacity if it is budgeted conservatively.

linear_prefix_cache_interval_blocks is controlling every how many KV blocks do we save a GDN snapshot.

Example: default interval is 64, it means we save a GDN snapshot (37.9 MiB) every 64 int8 KV blocks (65 MiB). Reasonable tradeoff between reuse and memory footprint.

Real-model observations

Validated on QuantTrio/Qwen3.5-27B-AWQ with TurboMind and real repeated requests.

Observed hybrid prefix hit on repeated prompt:

hit blocks 8
linear_cache_len 512
after matching, blocks 8, cache_len 512

Observed context-capacity impact when GDN checkpoint memory is included in the budget:

interval 2: max cached tokens = 27200
interval 64: max cached tokens = 337600
interval 128: max cached tokens = 413952

This showed that interval 2 is far too dense for this model on 32GB V100s, while 64 and 128 are both practical. Based on these results, the default was changed to 64.

Runtime behavior

Huge prompts can still run even if new GDN checkpoints cannot be stored.
If GDN checkpoint staging would exceed the per-batch budget, TurboMind skips storing new GDN checkpoints for that batch instead of aborting the request.
If the GDN checkpoint slot pool is exhausted, deeper checkpoints are skipped until cached entries are evicted.
Prefix caching remains opportunistic acceleration data rather than a hard requirement for forward progress.

Validation

QuantTrio/Qwen3.5-27B-AWQ
CUDA_VISIBLE_DEVICES=1,2
tp=2
quant_policy=8
prefix caching enabled

lapy added 5 commits March 25, 2026 08:26

Add TurboMind hybrid prefix cache config plumbing

891a40c

Implement Qwen3.5 hybrid prefix cache runtime

0397f35

Add tests for hybrid prefix cache config plumbing

6c38cf9

Merge branch 'InternLM:main' into turbomind-linear-gdn-prefix-caching

ec5e8aa

Update linear prefix cache interval to 64 in configuration and docume…

ad6573b

…ntation; add related tests. This change enhances memory management for hybrid models by increasing the checkpoint interval, which may reduce memory usage but requires more recompute after prefix hits.

lvhan028 requested a review from lzhangzz March 26, 2026 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Turbomind linear gdn prefix caching#4465

feat: Turbomind linear gdn prefix caching#4465
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:turbomind-linear-gdn-prefix-caching

lapy commented Mar 25, 2026

Uh oh!

lapy commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lapy commented Mar 25, 2026

Qwen3.5 Hybrid Prefix Caching in TurboMind

Summary

User-Facing Changes

Runtime Design

Hybrid cache structure

Prefix matching

Cache maintenance

Main Code Areas

Test Coverage Added

Python tests

Test commands run

Observed results

Real-Model Validation

Notes

Uh oh!

lapy commented Mar 25, 2026

Key changes

New findings

Real-model observations

Runtime behavior

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant