Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749
Fix misleading data_time / batch_time / data_wait metrics on CUDA (async bleed-through)#749DLemming wants to merge 7 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/review |
mrpositron
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I left a few comments.
|
Thanks for the feedback. Implemented your suggestions. The progress bar now reads One honest limitation worth noting: When the GPU hasn't reached the end event yet we skip that step's update rather than block, resulting in batch_time/data_time going stale, and metrics being slightly biased toward less compute-bound steps. I think that's an acceptable trade-off vs. the old behavior, which reported flat-out wrong numbers (~60% data_wait instad of ~10%). In practice I haven't been able to trigger the stale path once even on clearly compute-bound runs (e.g. DINOv2), so something in the loop seems to synchronize often enough that the end event is usually ready by the next step start anyway. Happy to look more into that edge case later, but for now this seems like a solid improvement over the currently available version. |
What has changed and why?
On CUDA,
on_train_batch_endfires before the GPU finishes the backward pass and optimizer step. The previous implementation recordedbatch_end_timeat that point, so the remaining async GPU work bled into the next step's data_time, rendering all data/batch-time realted metrics misleading / wildly inaccurate.The result: data_wait reported ~60% when the true value was ~10%. This is actively misleading — users would chase data-loading bottlenecks that don't exist while the GPU was fully utilized the entire time.
_callbacks/tqdm_progress_bar.py:profiling/data_timeandprofiling/batch_timefromtrainer.callback_metrics_methods/method.pytorch.cuda.Eventaton_train_batch_start/on_train_batch_end, measuring real gpu timedata_time = wall_gap (start → start) − GPU duration (from events)time.perf_counter()approach, which remains accurate when compute is synchronous.How has it been tested?
Changes have been tested on cuda running a SimCLR-Pretraining (bs=1232, res=224px) and a DINOv2-Pretraining (bs=128, global_crop_res=196px).
data_waitdropped from ~60% to ~10% for SimCLR, and increased from ~0.8% to ~2.5% for DINOv2.Did you update CHANGELOG.md?
Did you update the documentation?