Summary
The plan executor in execplan.go can produce a reconciliation plan that contains both a "delete Deployment" and "update Deployment" step for the same k8s Deployment resource. The delete succeeds, then the update fails on a stale UID precondition. The controller retries the same plan indefinitely without recomputing from current
state.
Environment
- Controller image: v1.5.x (exact TBD)
- Kubernetes: EKS 1.30
- Temporal: self-hosted (Cassandra-backed)
What happened
A helm release changed the desired image tag while an existing version was being sunset. The controller's reconcile plan included:
deleting deployment for account-temporal-default-worker-17-235-1-9755 ✅
deleted worker resource on version sunset (HPA) ✅
updating deployment for account-temporal-default-worker-17-232-4-559b ✅
updating deployment for account-temporal-default-worker-17-235-1-9755 ❌
Step 4 failed:
StorageError: invalid object, Code: 4, Key: /registry/deployments/.../account-temporal-default-worker-17-235-1-9755,
ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: be213a24-..., UID in object meta: ""
The controller retried this exact plan for ~3 hours. Restarting the controller produced zero reconcile activity for this TWD — it didn't re-enqueue or recompute.
Impact
The Worker Deployment's current build (17.235.1-9755) had its k8s Deployment deleted but was never recreated. Zero pods existed for the current version. All new workflows routed to it were stranded.
Proposed fix
- Plan generator should not emit both delete and update for the same Deployment UID in a single plan.
- On plan execution failure, recompute the plan from current state rather than retrying the stale plan.
Summary
The plan executor in
execplan.gocan produce a reconciliation plan that contains both a "delete Deployment" and "update Deployment" step for the same k8s Deployment resource. The delete succeeds, then the update fails on a stale UID precondition. The controller retries the same plan indefinitely without recomputing from currentstate.
Environment
What happened
A helm release changed the desired image tag while an existing version was being sunset. The controller's reconcile plan included:
deleting deploymentforaccount-temporal-default-worker-17-235-1-9755✅deleted worker resource on version sunset(HPA) ✅updating deploymentforaccount-temporal-default-worker-17-232-4-559b✅updating deploymentforaccount-temporal-default-worker-17-235-1-9755❌Step 4 failed:
StorageError: invalid object, Code: 4, Key: /registry/deployments/.../account-temporal-default-worker-17-235-1-9755,
ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: be213a24-..., UID in object meta: ""
The controller retried this exact plan for ~3 hours. Restarting the controller produced zero reconcile activity for this TWD — it didn't re-enqueue or recompute.
Impact
The Worker Deployment's current build (
17.235.1-9755) had its k8s Deployment deleted but was never recreated. Zero pods existed for the current version. All new workflows routed to it were stranded.Proposed fix