To Reproduce
Environment: a multi-server setup - Dokploy panel on the manager node, app deployed to a remote server. Small nodes, no swap.
- Create a Docker Compose application whose docker-compose.yml has TWO services that both
build: from the SAME Dockerfile (a common pattern: a Next.js app with a web service and a worker service sharing one image).
- Deploy it. Dokploy runs
docker compose ... up -d --build, and Compose/BuildKit builds BOTH services in parallel - i.e. two simultaneous npm ci + next build from the same Dockerfile.
- On a resource-constrained host with no swap, the two parallel builds exhaust CPU/RAM. The host goes to ~100% CPU and becomes unreachable (SSH times out during banner exchange), the build process is killed.
- Regain access (only possible via a hard reboot from the hosting dashboard).
- Observe the deployment in the Dokploy UI: it is stuck in "running" state and never finishes.
Under the hood after the crash:
- Postgres
deployment table has a row stuck at status = running.
- Redis still has the BullMQ job for it:
bull:deployments:<id> present, listed in bull:deployments:active, with a leftover bull:deployments:<id>:lock. The worker that held it was killed, so the job never completes and is not recovered.
There is no documented UI/CLI way to cancel or clear this. The only workarounds I found:
- reboot the server, or
- manually:
UPDATE deployment SET status='error' WHERE status='running' in the dokploy Postgres, delete the bull:deployments:* keys in the dokploy Redis, then docker service update --force dokploy.
Current vs. Expected behavior
Current: After a build is killed because it exhausted host resources, the deployment stays "running" indefinitely. The BullMQ job remains active+locked and is never marked stalled/failed. There is no working "Cancel" path in the UI, so the queue is wedged until the server is rebooted or Redis/Postgres are edited by hand. New deploys queue behind the stuck one.
Expected:
- An orphaned/stalled deployment whose worker died should be auto-recovered (marked failed via BullMQ stalled-job handling) and/or there should be a reliable "Cancel deployment" button in the UI that works even when the build process/worker is gone.
- Builds should have guardrails so a single deploy cannot take down the host: e.g. build an image only ONCE when multiple compose services share the same build/Dockerfile (instead of building every service in parallel), and/or a configurable build concurrency / memory limit.
Provide environment information
Operating System:
OS: <manager node: Ubuntu 24.04; target node: Ubuntu 24.04>
Arch: x86_64
Dokploy version: 0.29.4
VPS Provider: Hetzner (manager) + Another Provider (target)
Manager node: 4 vCPU / 7.6 GB RAM / no swap
Target node: 2 vCPU / 1.9 GB RAM / no swap
Which area(s) are affected? (Select all that apply)
Remote server, Application
Are you deploying the applications where Dokploy is installed or on a remote server?
Remote server
Additional context
- The host became unreachable specifically during the build step of a redeploy; CPU hit 100% and SSH timed out at banner exchange.
- Trigger appears to be parallel builds of two compose services that share one Dockerfile (BuildKit builds both at once). Switching the compose so the image builds once (one service
build:, the other reuses the same image: tag) greatly reduces the build footprint.
- Neither /docs/core nor /docs/core/troubleshooting documents how to cancel/clear a stuck deployment, which is why this is hard to recover from.
Will you send a PR to fix it?
No
To Reproduce
Environment: a multi-server setup - Dokploy panel on the manager node, app deployed to a remote server. Small nodes, no swap.
build:from the SAME Dockerfile (a common pattern: a Next.js app with awebservice and aworkerservice sharing one image).docker compose ... up -d --build, and Compose/BuildKit builds BOTH services in parallel - i.e. two simultaneousnpm ci+next buildfrom the same Dockerfile.Under the hood after the crash:
deploymenttable has a row stuck atstatus = running.bull:deployments:<id>present, listed inbull:deployments:active, with a leftoverbull:deployments:<id>:lock. The worker that held it was killed, so the job never completes and is not recovered.There is no documented UI/CLI way to cancel or clear this. The only workarounds I found:
UPDATE deployment SET status='error' WHERE status='running'in the dokploy Postgres, delete thebull:deployments:*keys in the dokploy Redis, thendocker service update --force dokploy.Current vs. Expected behavior
Current: After a build is killed because it exhausted host resources, the deployment stays "running" indefinitely. The BullMQ job remains active+locked and is never marked stalled/failed. There is no working "Cancel" path in the UI, so the queue is wedged until the server is rebooted or Redis/Postgres are edited by hand. New deploys queue behind the stuck one.
Expected:
Provide environment information
Which area(s) are affected? (Select all that apply)
Remote server, Application
Are you deploying the applications where Dokploy is installed or on a remote server?
Remote server
Additional context
build:, the other reuses the sameimage:tag) greatly reduces the build footprint.Will you send a PR to fix it?
No