Skip to content

[Bug] set-ramping-version and set-current-version return success without persisting state when backing system workflow is damaged #354

@tristal

Description

@tristal

Summary

When a Worker Deployment's backing system workflow (temporal-sys-worker-deployment:*) is in a damaged or divergent state, CLI operations (set-ramping-version --delete, set-current-version --unversioned, delete-version) return success messages but don't actually persist state changes. describe shows identical values and unchanged timestamps before and after.

Note: This may be a Temporal server bug rather than a controller bug, but filing here since it was discovered in the context of the controller's reconcile failures and affects the same recovery workflow.

Environment

  • Temporal CLI: latest (as of 2026-05-27)
  • Temporal server: self-hosted (Cassandra-backed)
  • Controller image: v1.5.x

What happened

After the controller's plan executor bug left a Worker Deployment in a corrupted state (see: plan executor delete+update bug), we attempted manual cleanup:

$ temporal worker deployment set-ramping-version --deployment-name '...' --build-id '17.233.1-497c' --delete --yes

Successfully set the ramping worker deployment version

$ temporal worker deployment describe --name '...' --namespace account

RampingVersionBuildID: 17.233.1-497c    # unchanged
RampingVersionPercentage: 50            # unchanged
RampingVersionPercentageChangedTime: 4 days ago  # unchanged

This pattern reproduced across multiple operations:

  • set-ramping-version --delete → "Successfully set" → no change
  • set-ramping-version --percentage 0 → "Successfully set" → no change
  • set-current-version --unversioned → "Successfully set" → no change
  • delete-version → "Successfully deleted" → version still present in describe

The deployment had no active pollers (workers had been sunset). The --allow-no-pollers and --ignore-missing-task-queues flags were used.

The system workflow (temporal-sys-worker-deployment:*) existed in the namespace and was reachable via temporal workflow describe, but its internal state was divergent from what deployment describe reported.

Recovery

Deleted the backing system workflow directly:

temporal workflow delete 
  --workflow-id 'temporal-sys-worker-deployment:'
  --namespace account

This scrubbed the deployment from worker deployment list and allowed clean re-registration.

Expected behavior

Operations should either persist the change or return an error. Returning success without mutation is dangerous during incident recovery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions