[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

SS-JIA · 2026-02-09T15:57:33Z

Stack from ghstack (oldest at bottom):

-> [ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

Tensors sharing physical memory via SharedObject each track their own
last_access_ independently. When a tensor's first access is a write,
prev_stage is NO_STAGE, causing transition() to use
TOP_OF_PIPE_BIT as srcStageMask with no srcAccessMask — effectively
a no-op barrier. If the same physical memory was previously written
through a different aliased tensor handle, this creates a WAW hazard
where the new write may execute before or concurrently with the prior
write, producing non-deterministic results.

This was observed as non-deterministic q8ta_conv2d output in ResNet50:
running the model twice with the same input produced slightly different
quantized int8 values. Adding a debug print shader after each conv2d
dispatch masked the issue because the print node's read-after-write
barrier serialized GPU work.

The fix: when prev_stage is NO_STAGE and the current access is a
write, use COMPUTE_SHADER_BIT with SHADER_WRITE_BIT instead of
TOP_OF_PIPE_BIT with no access flags. This ensures all prior compute
shader work completes and its writes are made visible before the new
write begins.

Authored with Claude.

Differential Revision: D92715369

…nsors Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]

pytorch-bot · 2026-02-09T15:57:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17309

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 3 Unrelated Failures

As of commit 0c048bb with merge base ba89c69 ():

NEW FAILURES - The following jobs have failed:

pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t d13b3cb7c7fc556cdc9dcf68bcff7c78c2a7e51e1786f2c235bde2d3be1fb7a4 /exec failed with exit code 1
Test CUDA Builds / test-cuda-pybind (gemma3-4b) / linux-job (gh)
Unable to download artifact(s): Failed to ListArtifacts: Received non-retryable error: Failed request: (403) Forbidden: Error from intermediary with HTTP status code 403 "Forbidden"
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 63855e6557ebced7510e00ecf69f153fa85a5b8f53af2dda2e77559af0e4ac05 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t da41b3e814acf077af9257123eac03f68ae6e205cb2d067a6fe2070562ac8f04 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 87444fafe3795ac27079c7da05f7699bb4e00d2b763602354022a5dcaf0b540a /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t fcfa51317dbd59308b7cbc7ef16e2c3e6db4aaa86ff79b8cbde2ed6ab7e06734 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 33345f3c36c7558f99f6a3fdc838f28265831b2c3a5e88268aeed239d721badc /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test Metal Backend / test-model-metal-e2e (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test Metal Backend / test-model-metal-e2e (openai, whisper-small, non-quantized) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…nsors Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) ghstack-source-id: 339490477 Pull Request resolved: #17309

github-actions · 2026-02-09T15:58:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]

…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339541957 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)

… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]

…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339541957 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)

… aliased tensors" Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/) [ghstack-poisoned]

…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339884030 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 9, 2026

SS-JIA mentioned this pull request Feb 9, 2026

YOLO-NAS model diverges with Vulkan backend #15700

Open

manuelcandales approved these changes Feb 9, 2026

View reviewed changes

meta-codesync bot merged commit d78828d into gh/SS-JIA/412/base Feb 10, 2026
179 of 189 checks passed

meta-codesync bot deleted the gh/SS-JIA/412/head branch February 10, 2026 18:38

meta-codesync bot temporarily deployed to cherry-pick-bot February 10, 2026 18:38 Inactive

pytorchbot mentioned this pull request Feb 10, 2026

[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

SS-JIA commented Feb 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

[ET-VK] Fix missing memory barrier for first-use writes on aliased tensors #17309

Conversation

SS-JIA commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17309

❌ 7 New Failures, 3 Unrelated Failures

Uh oh!

github-actions bot commented Feb 9, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Feb 9, 2026 •

edited

Loading

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

This PR needs a `release notes:` label