[BugFix] lookup for all ranks#1033
Conversation
|
|
||
| min_hit_idx = len(external_block_ids) - 1 | ||
| for rank, rank_hasher in enumerate(self._rank_hashers): | ||
| rank_block_ids = ( |
There was a problem hiding this comment.
This loop increases the time spent on Lookup operations. In scenarios with high TP rank, will the performance test pass?
| if role == KVConnectorRole.SCHEDULER: | ||
| self.request_hasher = RequestHasher(vllm_config, 0) | ||
| self._rank_hashers = [ | ||
| RequestHasher(vllm_config, rank % self.tp_size) |
There was a problem hiding this comment.
💡 Suggestion: The _rank_hashers creation uses rank % self.tp_size, but this is redundant since rank is already in range(1, self.tp_size). Consider simplifying to for rank in range(1, self.tp_size) directly, unless there's a specific reason for the modulo operation.
| for block_idx in range(len(candidate_block_ids)): | ||
| begin = block_idx * num_other_ranks | ||
| end = begin + num_other_ranks | ||
| if not all(founds[begin:end]): |
There was a problem hiding this comment.
all() check iterates through founds[begin:end]. If founds is empty or the slice is empty, all([]) returns True, which might not be the intended behavior. Consider adding an explicit check for empty slices.
|
|
||
| try { | ||
| results = std::make_shared<std::vector<uint8_t>>(num, false); | ||
| status = std::make_shared<std::atomic<int32_t>>(ok); |
There was a problem hiding this comment.
prefixLookupSrv_.NWorker() returns 0, waiter->Set(0) would cause waiter->Wait() to return immediately without doing any lookup. The results would remain all false. Consider adding a check for zero workers.
|
|
||
| if role == KVConnectorRole.SCHEDULER: | ||
| self.request_hasher = RequestHasher(vllm_config, 0) | ||
| rank_ids = list( |
There was a problem hiding this comment.
💡 Suggestion: The rank_ids calculation uses dict.fromkeys to deduplicate. Consider documenting the expected behavior when dcp_world_size > tp_size - this could produce unexpected results.
| const auto index = res.Value(); | ||
| for (ssize_t i = 0; i <= index; ++i) { results[i] = true; } | ||
| return results; | ||
| if (num == 0) { return std::vector<uint8_t>{}; } |
There was a problem hiding this comment.
Lookup implementation uses shared_ptr for thread-safe access. However, if prefixLookupSrv_.NWorker() returns 0, waiter->Set(0) would return immediately. Consider adding: if (nWorker == 0) { return std::vector<uint8_t>(num, false); }
Purpose
Fix GQA cache lookup correctness by checking cached block availability across all TP ranks instead of only rank 0. This prevents scheduler-side false hits where rank 0 files exist but other rank-specific files have been removed by GC, which can later cause worker load failures.
Modifications
Test
Tested with 2tp/8tp/2dp 2tp/4tp 2dcp/2tp 2pcp with QwQ-32B or Qwen2.5-1.5B