-
Notifications
You must be signed in to change notification settings - Fork 834
[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nsors Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17309
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 3 Unrelated FailuresAs of commit 0c048bb with merge base ba89c69 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…nsors Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) ghstack-source-id: 339490477 Pull Request resolved: #17309
This PR needs a
|
… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]
…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339541957 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]
…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339541957 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]
…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339884030 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
d78828d
into
gh/SS-JIA/412/base
…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339884030 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
Stack from ghstack (oldest at bottom):
Tensors sharing physical memory via SharedObject each track their own
last_access_independently. When a tensor's first access is a write,prev_stageisNO_STAGE, causingtransition()to useTOP_OF_PIPE_BITassrcStageMaskwith nosrcAccessMask— effectivelya no-op barrier. If the same physical memory was previously written
through a different aliased tensor handle, this creates a WAW hazard
where the new write may execute before or concurrently with the prior
write, producing non-deterministic results.
This was observed as non-deterministic q8ta_conv2d output in ResNet50:
running the model twice with the same input produced slightly different
quantized int8 values. Adding a debug print shader after each conv2d
dispatch masked the issue because the print node's read-after-write
barrier serialized GPU work.
The fix: when
prev_stageisNO_STAGEand the current access is awrite, use
COMPUTE_SHADER_BITwithSHADER_WRITE_BITinstead ofTOP_OF_PIPE_BITwith no access flags. This ensures all prior computeshader work completes and its writes are made visible before the new
write begins.
Authored with Claude.
Differential Revision: D92715369