Skip to content

fix(setup.sh): set Temporal namespace retention on create#2675

Merged
osu merged 2 commits into
NVIDIA:mainfrom
kirson-git:fix/temporal-namespace-retention
Jun 25, 2026
Merged

fix(setup.sh): set Temporal namespace retention on create#2675
osu merged 2 commits into
NVIDIA:mainfrom
kirson-git:fix/temporal-namespace-retention

Conversation

@kirson-git

Copy link
Copy Markdown
Contributor

Problem

helm-prereqs/setup.sh creates the Temporal namespaces (cloud, site, flow, <site-uuid>) with:

temporal operator namespace create -n <ns> --address ... 2>/dev/null || true

No --retention is passed, and 2>/dev/null || true swallows any failure. Depending on the Temporal server's default-retention policy the create can be rejected/incomplete; the namespace then never exists and the nico-rest cloud/site/flow workers CrashLoop with Namespace "<ns>" is not found. The silent || true makes this invisible during install.

Fix

Pass an explicit --retention 72h on each namespace create so it succeeds deterministically regardless of server defaults. (Reproduced on a clean v0.10.3 install; the same code is still present on main.)

🤖 Generated with Claude Code

@kirson-git kirson-git requested review from a team and shayan1995 as code owners June 17, 2026 14:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Chores
    • Updated deployment prerequisites script to configure Temporal namespaces with standardized 72-hour retention policies. This change ensures that cloud, site, flow, and per-site Temporal namespace instances consistently retain data for 72 hours, establishing uniform and predictable data lifecycle behavior across all environment configurations during setup.

Walkthrough

The helm-prereqs/setup.sh script is updated to append --retention 72h to all temporal operator namespace create invocations executed via kubectl exec. This affects the cloud, site, and flow namespaces (lines 633–638) and the per-site namespace identified by NICO_SITE_UUID (line 766).

Changes

Temporal Namespace Retention Configuration

Layer / File(s) Summary
Add --retention 72h to all Temporal namespace create commands
helm-prereqs/setup.sh
Appends --retention 72h to the temporal operator namespace create command strings for the cloud, site, flow, and NICO_SITE_UUID namespaces within kubectl exec invocations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding explicit Temporal namespace retention policy on creation.
Description check ✅ Passed The description comprehensively explains the problem (missing retention policy causing silent failures), the fix (adding --retention 72h), and provides context from a clean installation test.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 632-638: The kubectl exec commands for temporal namespace creation
(lines 633, 635, 638, and line 766) are unconditionally suppressing all stderr
output and errors with `2>/dev/null || true`, which masks real setup failures
like TLS, authentication, or connectivity issues. Instead of blanket error
suppression, implement idempotent namespace creation by first checking if each
namespace already exists before attempting to create it, or by using a
conditional approach that only ignores "already exists" errors while allowing
actual failures to be logged and propagated. This maintains idempotency while
preserving fail-fast behavior and enabling clear error diagnostics as required
by the helm-prereqs guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a21fe10a-5f06-450e-92fd-0775dc5e8e04

📥 Commits

Reviewing files that changed from the base of the PR and between bd899c6 and d199af0.

📒 Files selected for processing (1)
  • helm-prereqs/setup.sh

Comment thread helm-prereqs/setup.sh
Comment on lines 632 to +638
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
# flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent.
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n flow --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true

@coderabbitai coderabbitai Bot Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not suppress all namespace-creation failures.

--retention 72h is a good fix, but Line 633, Line 635, Line 638, and Line 766 still unconditionally hide failures (2>/dev/null || true). This can mask real setup faults (TLS/auth/connectivity/CLI errors) and breaks fail-fast behavior.

Suggested hardening (idempotent + fail-fast)
+create_temporal_namespace() {
+    local ns="$1"
+    if kubectl exec -n temporal deploy/temporal-admintools -- \
+        sh -c "temporal operator namespace describe -n '${ns}' --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" \
+        >/dev/null 2>&1; then
+        echo "Temporal namespace '${ns}' already exists"
+        return 0
+    fi
+
+    kubectl exec -n temporal deploy/temporal-admintools -- \
+        sh -c "temporal operator namespace create -n '${ns}' --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}"
+}
+
-kubectl exec -n temporal deploy/temporal-admintools -- \
-    sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
-kubectl exec -n temporal deploy/temporal-admintools -- \
-    sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
-kubectl exec -n temporal deploy/temporal-admintools -- \
-    sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
+create_temporal_namespace "cloud"
+create_temporal_namespace "site"
+create_temporal_namespace "flow"
...
-kubectl exec -n temporal deploy/temporal-admintools -- \
-    sh -c "temporal operator namespace create -n '${NICO_SITE_UUID}' --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
+create_temporal_namespace "${NICO_SITE_UUID}"

As per coding guidelines, "**/*.sh: Review shell scripts for ... error propagation ..." and "helm-prereqs/**: ... idempotency, and clear failure messages."

Also applies to: 765-766

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 632 - 638, The kubectl exec commands for
temporal namespace creation (lines 633, 635, 638, and line 766) are
unconditionally suppressing all stderr output and errors with `2>/dev/null ||
true`, which masks real setup failures like TLS, authentication, or connectivity
issues. Instead of blanket error suppression, implement idempotent namespace
creation by first checking if each namespace already exists before attempting to
create it, or by using a conditional approach that only ignores "already exists"
errors while allowing actual failures to be logged and propagated. This
maintains idempotency while preserving fail-fast behavior and enabling clear
error diagnostics as required by the helm-prereqs guidelines.

Source: Coding guidelines

@ajf ajf Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai can you create issues for those things, thanks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

gh issue create \
  --repo "NVIDIA/infra-controller" \
  --title "helm-prereqs/setup.sh: Temporal namespace creation silently suppresses all errors" \
  --assignee "ajf" \
  --body "## Summary

In \`helm-prereqs/setup.sh\`, the four \`temporal operator namespace create\` invocations for the \`cloud\`, \`site\`, \`flow\`, and per-site UUID namespaces unconditionally suppress all stderr output and exit codes via \`2>/dev/null || true\`:

\`\`\`sh
kubectl exec -n temporal deploy/temporal-admintools -- \\
    sh -c \"temporal operator namespace create -n cloud --retention 72h ...\" 2>/dev/null || true
\`\`\`

This pattern masks real setup faults such as TLS misconfiguration, authentication failures, and connectivity issues, breaking fail-fast behaviour and producing no actionable diagnostics. The script may proceed as if namespaces were created when they were not, causing downstream worker pods to enter a CrashLoop with \`Namespace '<ns>' is not found\`.

## Affected lines

- \`helm-prereqs/setup.sh\` lines 633, 635, 638 (cloud / site / flow namespaces)
- \`helm-prereqs/setup.sh\` line 766 (per-site \`NICO_SITE_UUID\` namespace)

## Expected behaviour

Namespace creation commands should:
1. Be idempotent — skip creation if the namespace already exists (via \`temporal operator namespace describe\`).
2. Propagate genuine errors (TLS, auth, connectivity) rather than suppressing them.
3. Emit a clear diagnostic message on failure, consistent with the \`helm-prereqs/**\` coding guidelines requiring *idempotency* and *clear failure messages*.

## References

- PR: https://github.com/NVIDIA/infra-controller/pull/2675
- Review comment: https://github.com/NVIDIA/infra-controller/pull/2675#discussion_r3428992931

Raised by `@ajf`."

Length of output: 1876



@ajf, the issue has been opened and assigned to you:

helm-prereqs/setup.sh: Temporal namespace creation silently suppresses all errors#2768

It covers all four affected lines (the cloud, site, flow, and per-site UUID namespace creation commands) and references this PR and review comment for full traceability.

@ajf ajf left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok to test d199af0

Comment thread helm-prereqs/setup.sh
Comment on lines 632 to +638
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n cloud --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n site --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
# flow Temporal namespace — required by NICo Flow workers; pod panics on startup if absent.
kubectl exec -n temporal deploy/temporal-admintools -- \
sh -c "temporal operator namespace create -n flow --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
sh -c "temporal operator namespace create -n flow --retention 72h --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true

@ajf ajf Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai can you create issues for those things, thanks.

@ajf

ajf commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

/ok to test d199af0

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 265 6 24 98 7 130
machine-validation-runner 717 32 188 267 36 194
machine_validation 717 32 188 267 36 194
machine_validation-aarch64 717 32 188 267 36 194
nvmetal-carbide 717 32 188 267 36 194
TOTAL 3139 134 776 1172 151 906

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@thossain-nv thossain-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement.

kirson-git and others added 2 commits June 24, 2026 21:23
The Temporal namespaces (cloud, site, flow, <site-uuid>) are created with
`temporal operator namespace create -n <ns> --address ... 2>/dev/null || true`
with no --retention. Depending on the Temporal server's default-retention
policy the create can be rejected/incomplete, and the `2>/dev/null || true`
hides the failure. The namespace then never exists, and nico-rest cloud/site/
flow workers CrashLoop with 'Namespace "<ns>" is not found'.

Set an explicit --retention (72h) on each namespace create so it succeeds
deterministically regardless of server defaults.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: kirson-git <ekirson@nvidia.com>
@osu osu force-pushed the fix/temporal-namespace-retention branch from a309480 to 6161dce Compare June 25, 2026 04:23
@osu

osu commented Jun 25, 2026

Copy link
Copy Markdown
Member

/ok to test 6161dce

@github-actions

Copy link
Copy Markdown

@osu osu merged commit 2041ccf into NVIDIA:main Jun 25, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants