feat: Turbomind linear gdn prefix caching#4465
Conversation
…ntation; add related tests. This change enhances memory management for hybrid models by increasing the checkpoint interval, which may reduce memory usage but requires more recompute after prefix hits.
|
TurboMind now treats Qwen3.5 hybrid attention as two cache families:
On a prefix hit, TurboMind restores both:
Key changes
New findingsThe important new finding is that hybrid prefix caching must budget for three separate memory buckets:
Before this change, TurboMind only effectively budgeted KV blocks plus live GDN state. The cached GDN checkpoint pool was lazy and not included in the initial capacity estimate. For Qwen3.5-27B AWQ on
So while a single live GDN state is constant-size, dense GDN checkpointing becomes linear in cached prefix length and can significantly reduce available context capacity if it is budgeted conservatively. linear_prefix_cache_interval_blocks is controlling every how many KV blocks do we save a GDN snapshot. Example: default interval is 64, it means we save a GDN snapshot (37.9 MiB) every 64 int8 KV blocks (65 MiB). Reasonable tradeoff between reuse and memory footprint. Real-model observationsValidated on Observed hybrid prefix hit on repeated prompt:
Observed context-capacity impact when GDN checkpoint memory is included in the budget:
This showed that interval Runtime behavior
Validation
|
Qwen3.5 Hybrid Prefix Caching in TurboMind
Summary
This change adds prefix caching support for Qwen3.5 hybrid-attention models in TurboMind.
The implementation keeps
quant_policyscoped to KV cache quantization. GDN prefix checkpoints remain in the model/state dtypes in this version.User-Facing Changes
linear_prefix_cache_interval_blockstoTurbomindEngineConfig--linear-prefix-cache-interval-blocksto the TurboMind CLI surface2KV blocks< 1Runtime Design
Hybrid cache structure
Prefix matching
On a prefix hit:
Cache maintenance
Main Code Areas
lmdeploy/messages.pylmdeploy/cli/utils.pylmdeploy/cli/cli.pylmdeploy/cli/serve.pysrc/turbomind/turbomind.ccsrc/turbomind/models/llama/llama_params.hsrc/turbomind/engine/engine.ccsrc/turbomind/models/llama/BlockTrie.hsrc/turbomind/models/llama/BlockTrie.ccsrc/turbomind/models/llama/SequenceManager.hsrc/turbomind/models/llama/SequenceManager.ccsrc/turbomind/models/llama/GatedDeltaNetLayer.hsrc/turbomind/models/llama/GatedDeltaNetLayer.ccsrc/turbomind/models/llama/gated_delta_net_kernels.hsrc/turbomind/models/llama/gated_delta_net_kernels.cuTest Coverage Added
Python tests
tests/test_lmdeploy/test_turbomind/test_engine_config.pytests/test_lmdeploy/test_turbomind/test_api_server.pyapi_serverforwards hybrid prefix-cache options intoTurbomindEngineConfigapi_serveruses the normal default CUDAmax_batch_sizewhen the user does not set one explicitlyTest commands run
Observed results
test_engine_config.py+test_api_server.py:5 passedtest_converter.py:5 passed_turbomindrebuilt successfullyReal-Model Validation
Model:
QuantTrio/Qwen3.5-27B-AWQCommand used:
Observed startup details:
max cached tokens: 533248.8320tokens.Observed hybrid prefix-cache hit on repeated request:
Request details:
prompt_tokens=626,completion_tokens=24prompt_tokens=626,completion_tokens=24This confirms that the second request reused both normal cached KV blocks and a compatible linear-attention checkpoint.
Notes
quant_policyremains KV-only in this PR.