Skip to content

Deployment stuck in "running" forever after the build OOMs/saturates the host - no way to cancel or clear the queue without rebooting #4461

@hunterxp

Description

@hunterxp

To Reproduce

Environment: a multi-server setup - Dokploy panel on the manager node, app deployed to a remote server. Small nodes, no swap.

  1. Create a Docker Compose application whose docker-compose.yml has TWO services that both build: from the SAME Dockerfile (a common pattern: a Next.js app with a web service and a worker service sharing one image).
  2. Deploy it. Dokploy runs docker compose ... up -d --build, and Compose/BuildKit builds BOTH services in parallel - i.e. two simultaneous npm ci + next build from the same Dockerfile.
  3. On a resource-constrained host with no swap, the two parallel builds exhaust CPU/RAM. The host goes to ~100% CPU and becomes unreachable (SSH times out during banner exchange), the build process is killed.
  4. Regain access (only possible via a hard reboot from the hosting dashboard).
  5. Observe the deployment in the Dokploy UI: it is stuck in "running" state and never finishes.

Under the hood after the crash:

  • Postgres deployment table has a row stuck at status = running.
  • Redis still has the BullMQ job for it: bull:deployments:<id> present, listed in bull:deployments:active, with a leftover bull:deployments:<id>:lock. The worker that held it was killed, so the job never completes and is not recovered.

There is no documented UI/CLI way to cancel or clear this. The only workarounds I found:

  • reboot the server, or
  • manually: UPDATE deployment SET status='error' WHERE status='running' in the dokploy Postgres, delete the bull:deployments:* keys in the dokploy Redis, then docker service update --force dokploy.

Current vs. Expected behavior

Current: After a build is killed because it exhausted host resources, the deployment stays "running" indefinitely. The BullMQ job remains active+locked and is never marked stalled/failed. There is no working "Cancel" path in the UI, so the queue is wedged until the server is rebooted or Redis/Postgres are edited by hand. New deploys queue behind the stuck one.

Expected:

  1. An orphaned/stalled deployment whose worker died should be auto-recovered (marked failed via BullMQ stalled-job handling) and/or there should be a reliable "Cancel deployment" button in the UI that works even when the build process/worker is gone.
  2. Builds should have guardrails so a single deploy cannot take down the host: e.g. build an image only ONCE when multiple compose services share the same build/Dockerfile (instead of building every service in parallel), and/or a configurable build concurrency / memory limit.

Provide environment information

Operating System:
OS: <manager node: Ubuntu 24.04; target node: Ubuntu 24.04>
Arch: x86_64
Dokploy version: 0.29.4
VPS Provider: Hetzner (manager) + Another Provider (target)
Manager node: 4 vCPU / 7.6 GB RAM / no swap
Target node:  2 vCPU / 1.9 GB RAM / no swap

Which area(s) are affected? (Select all that apply)

Remote server, Application

Are you deploying the applications where Dokploy is installed or on a remote server?

Remote server

Additional context

  • The host became unreachable specifically during the build step of a redeploy; CPU hit 100% and SSH timed out at banner exchange.
  • Trigger appears to be parallel builds of two compose services that share one Dockerfile (BuildKit builds both at once). Switching the compose so the image builds once (one service build:, the other reuses the same image: tag) greatly reduces the build footprint.
  • Neither /docs/core nor /docs/core/troubleshooting documents how to cancel/clear a stuck deployment, which is why this is hard to recover from.

Will you send a PR to fix it?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions