APITest 测试修复:修复被外部强杀时的 checkpoint 遗留问题#646
Open
cangtianhuang wants to merge 4 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
当前 PaddleAPITest 的 engineV2 / engineV4 在批量执行 API case 时,原先会在 worker 开始执行 case 后较早写入 checkpoint。这样在一些非正常中断场景下会产生不一致:
SIGKILL/SIGTERM杀掉后,下一次 resume 会错误跳过该 case本 PR 调整了 checkpoint 写入语义,只有主进程确认 case 已经完成结果归档/分类后,才将该 case 写入 checkpoint。如果发生了外部 kill,将不会写入 checkpoint,使得未完成 case 可以被重试
主要变更
1. 调整 engineV2 checkpoint 写入时机
run_test_case()开始阶段的 checkpoint 写入。SIGKILL/SIGTERM杀掉的场景特殊处理:2. 调整 engineV4 checkpoint 与 external kill 处理
run_test_case()开始阶段的 checkpoint 写入。done/timeout/crashed/error分类后统一写 checkpoint。SIGKILL/SIGTERM场景:3. 改进 engineV4 compute-sanitizer crash 语义
crash_source="child"。4. 新增 stale result log 清理逻辑
在
tester/api_config/log_writer.py中新增cleanup_uncheckpointed_result_logs():.tmp日志;checkpoint.txt的终态日志行;清理范围覆盖除 checkpoint 外的各类结果日志,例如 pass、paddle error、accuracy error、timeout、crash、OOM、CUDA error 等。
行为变化
checkpoint 语义变化
变更前:
变更后:
这能避免 worker 被外部 kill 后,未完成 case 被错误跳过。
external kill 恢复策略
result log 恢复策略