fix(setup.sh): set Temporal namespace retention on create#2675
Conversation
Summary by CodeRabbit
WalkthroughThe ChangesTemporal Namespace Retention Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 632-638: The kubectl exec commands for temporal namespace creation
(lines 633, 635, 638, and line 766) are unconditionally suppressing all stderr
output and errors with `2>/dev/null || true`, which masks real setup failures
like TLS, authentication, or connectivity issues. Instead of blanket error
suppression, implement idempotent namespace creation by first checking if each
namespace already exists before attempting to create it, or by using a
conditional approach that only ignores "already exists" errors while allowing
actual failures to be logged and propagated. This maintains idempotency while
preserving fail-fast behavior and enabling clear error diagnostics as required
by the helm-prereqs guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a21fe10a-5f06-450e-92fd-0775dc5e8e04
📒 Files selected for processing (1)
helm-prereqs/setup.sh
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| # flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent. | ||
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n flow --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true |
There was a problem hiding this comment.
Do not suppress all namespace-creation failures.
--retention 72h is a good fix, but Line 633, Line 635, Line 638, and Line 766 still unconditionally hide failures (2>/dev/null || true). This can mask real setup faults (TLS/auth/connectivity/CLI errors) and breaks fail-fast behavior.
Suggested hardening (idempotent + fail-fast)
+create_temporal_namespace() {
+ local ns="$1"
+ if kubectl exec -n temporal deploy/temporal-admintools -- \
+ sh -c "temporal operator namespace describe -n '${ns}' --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" \
+ >/dev/null 2>&1; then
+ echo "Temporal namespace '${ns}' already exists"
+ return 0
+ fi
+
+ kubectl exec -n temporal deploy/temporal-admintools -- \
+ sh -c "temporal operator namespace create -n '${ns}' --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}"
+}
+
-kubectl exec -n temporal deploy/temporal-admintools -- \
- sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
-kubectl exec -n temporal deploy/temporal-admintools -- \
- sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
-kubectl exec -n temporal deploy/temporal-admintools -- \
- sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
+create_temporal_namespace "cloud"
+create_temporal_namespace "site"
+create_temporal_namespace "flow"
...
-kubectl exec -n temporal deploy/temporal-admintools -- \
- sh -c "temporal operator namespace create -n '${NICO_SITE_UUID}' --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
+create_temporal_namespace "${NICO_SITE_UUID}"As per coding guidelines, "**/*.sh: Review shell scripts for ... error propagation ..." and "helm-prereqs/**: ... idempotency, and clear failure messages."
Also applies to: 765-766
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm-prereqs/setup.sh` around lines 632 - 638, The kubectl exec commands for
temporal namespace creation (lines 633, 635, 638, and line 766) are
unconditionally suppressing all stderr output and errors with `2>/dev/null ||
true`, which masks real setup failures like TLS, authentication, or connectivity
issues. Instead of blanket error suppression, implement idempotent namespace
creation by first checking if each namespace already exists before attempting to
create it, or by using a conditional approach that only ignores "already exists"
errors while allowing actual failures to be logged and propagated. This
maintains idempotency while preserving fail-fast behavior and enabling clear
error diagnostics as required by the helm-prereqs guidelines.
Source: Coding guidelines
There was a problem hiding this comment.
@coderabbitai can you create issues for those things, thanks.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
gh issue create \
--repo "NVIDIA/infra-controller" \
--title "helm-prereqs/setup.sh: Temporal namespace creation silently suppresses all errors" \
--assignee "ajf" \
--body "## Summary
In \`helm-prereqs/setup.sh\`, the four \`temporal operator namespace create\` invocations for the \`cloud\`, \`site\`, \`flow\`, and per-site UUID namespaces unconditionally suppress all stderr output and exit codes via \`2>/dev/null || true\`:
\`\`\`sh
kubectl exec -n temporal deploy/temporal-admintools -- \\
sh -c \"temporal operator namespace create -n cloud --retention 72h ...\" 2>/dev/null || true
\`\`\`
This pattern masks real setup faults such as TLS misconfiguration, authentication failures, and connectivity issues, breaking fail-fast behaviour and producing no actionable diagnostics. The script may proceed as if namespaces were created when they were not, causing downstream worker pods to enter a CrashLoop with \`Namespace '<ns>' is not found\`.
## Affected lines
- \`helm-prereqs/setup.sh\` lines 633, 635, 638 (cloud / site / flow namespaces)
- \`helm-prereqs/setup.sh\` line 766 (per-site \`NICO_SITE_UUID\` namespace)
## Expected behaviour
Namespace creation commands should:
1. Be idempotent — skip creation if the namespace already exists (via \`temporal operator namespace describe\`).
2. Propagate genuine errors (TLS, auth, connectivity) rather than suppressing them.
3. Emit a clear diagnostic message on failure, consistent with the \`helm-prereqs/**\` coding guidelines requiring *idempotency* and *clear failure messages*.
## References
- PR: https://github.com/NVIDIA/infra-controller/pull/2675
- Review comment: https://github.com/NVIDIA/infra-controller/pull/2675#discussion_r3428992931
Raised by `@ajf`."Length of output: 1876
@ajf, the issue has been opened and assigned to you:
helm-prereqs/setup.sh: Temporal namespace creation silently suppresses all errors — #2768
It covers all four affected lines (the cloud, site, flow, and per-site UUID namespace creation commands) and references this PR and review comment for full traceability.
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| # flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent. | ||
| kubectl exec -n temporal deploy/temporal-admintools -- \ | ||
| sh -c "temporal operator namespace create -n flow --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true | ||
| sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true |
There was a problem hiding this comment.
@coderabbitai can you create issues for those things, thanks.
|
/ok to test d199af0 |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
thossain-nv
left a comment
There was a problem hiding this comment.
Thanks for the improvement.
The Temporal namespaces (cloud, site, flow, <site-uuid>) are created with `temporal operator namespace create -n <ns> --address ... 2>/dev/null || true` with no --retention. Depending on the Temporal server's default-retention policy the create can be rejected/incomplete, and the `2>/dev/null || true` hides the failure. The namespace then never exists, and nico-rest cloud/site/ flow workers CrashLoop with 'Namespace "<ns>" is not found'. Set an explicit --retention (72h) on each namespace create so it succeeds deterministically regardless of server defaults. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: kirson-git <ekirson@nvidia.com>
a309480 to
6161dce
Compare
|
/ok to test 6161dce |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2675.docs.buildwithfern.com/infra-controller |
Problem
helm-prereqs/setup.shcreates the Temporal namespaces (cloud,site,flow,<site-uuid>) with:No
--retentionis passed, and2>/dev/null || trueswallows any failure. Depending on the Temporal server's default-retention policy the create can be rejected/incomplete; the namespace then never exists and the nico-rest cloud/site/flow workers CrashLoop withNamespace "<ns>" is not found. The silent|| truemakes this invisible during install.Fix
Pass an explicit
--retention 72hon each namespace create so it succeeds deterministically regardless of server defaults. (Reproduced on a clean v0.10.3 install; the same code is still present onmain.)🤖 Generated with Claude Code