Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds four new documentation guides describing quorum rules and structured recovery procedures for a pgEdge Control Plane with embedded etcd, covering partial, majority, and total quorum loss scenarios, step-by-step recovery workflows, commands, verification steps, and operator checklists. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/Quorum` Recovery.md:
- Line 634: Replace the incorrect step label "Step 11d" with "Step 13d" in the
rejoining client-mode hosts section (the line currently showing "###### Step
11d: Get join token from the recovery host"); update any matching subheading
text to read "Step 13d" so it correctly falls under the Step 13 sequence for
rejoining client-mode hosts.
🧹 Nitpick comments (1)
docs/Quorum Recovery.md (1)
118-120: Optional: Consider minor rephrasing for variety (not blocking).Static analysis flags repetitive sentence beginnings at lines 118-120 ("You will...") and lines 314-316 ("Rejoin..."). While these are stylistically flagged, they're actually acceptable in technical documentation contexts—the first is in an assumptions list, and the second is in a procedural step list where parallel structure aids clarity.
Also applies to: 314-316
| sleep 10 | ||
| ``` | ||
|
|
||
| ###### Step 11d: Get join token from the recovery host |
There was a problem hiding this comment.
Fix incorrect step numbering.
Line 634 shows "Step 11d" but should be "Step 13d" (this is within the Step 13 section for rejoining client-mode hosts). This appears to be a copy-paste error.
📝 Proposed fix
-###### Step 11d: Get join token from the recovery host
+###### Step 13d: Get join token from the recovery host📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ###### Step 11d: Get join token from the recovery host | |
| ###### Step 13d: Get join token from the recovery host |
🤖 Prompt for AI Agents
In `@docs/Quorum` Recovery.md at line 634, Replace the incorrect step label "Step
11d" with "Step 13d" in the rejoining client-mode hosts section (the line
currently showing "###### Step 11d: Get join token from the recovery host");
update any matching subheading text to read "Step 13d" so it correctly falls
under the Step 13 sequence for rejoining client-mode hosts.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md:
- Line 1: The directory name contains a trailing space ("Disaster recovery ")
which breaks tooling; rename the directory and update the file
MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name
(e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure
any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated
accordingly.
- Around line 121-132: After extracting RESP and TASK_ID, validate that TASK_ID
is non-empty and abort with a clear error message if empty (refer to RESP and
TASK_ID) to avoid polling an invalid endpoint; in the polling loop inspect
STATUS (derived from the tasks endpoint) and if it equals "failed" print/log the
task details from RESP or the task query, exit with a non-zero code, and avoid
continuing recovery as if deletion succeeded; ensure both error paths provide
actionable messages including task id and status so operators can debug.
- Around line 182-185: The etcd credentials are being passed on the command line
(ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead
export them as environment variables and call etcdctl without the
--user/--password flags so the secret never appears in the process
list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and
ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status
command that populates SERVER_COUNT (and apply the same change to the similar
etcdctl call in Step 11).
- Line 101: The docs claim Control Plane auto-detects restored snapshots and
handles the --force-new-cluster flag, but the code paths IsInitialized(),
start(), and embedConfig() lack any snapshot detection or flag handling; either
remove that note from the docs or implement detection: update
IsInitialized()/start() to inspect snapshot metadata vs WAL presence (detect a
restored snapshot with missing WAL), and modify embedConfig() (or the etcd start
logic) to set the equivalent ForceNewCluster behavior or inject the
--force-new-cluster setting when a restored snapshot is detected so etcd starts
correctly after snapshot restore.
🧹 Nitpick comments (4)
docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md (4)
6-8: Clarify the emoji indicators.Lines 6-8 use ❌ (negative indicator) but describe capabilities that "may be accessible" or "cannot proceed", which creates inconsistency. The ❌ typically means "not available" or "blocked", but line 6 says the API "may be accessible", suggesting partial availability rather than complete failure.
Consider using
⚠️ for partial/uncertain states or rewriting to clearly indicate whether these are always blocked or conditionally available.
91-91: Provide guidance for the optional production flags.The optional flags
--bump-revisionand--mark-compactedare mentioned as "recommended for production" but lack guidance on what value to use for<revision-bump-value>. Operators following this guide won't know what value is appropriate.Consider adding a note explaining how to determine the revision bump value (e.g., "use a large value like 1000000 to ensure new revisions are higher than client watch revisions").
321-321: Use safer process termination sequence.The command
sudo pkill -9 etcdforcefully kills all etcd processes with SIGKILL, preventing graceful shutdown and potentially leaving corrupted state. It may also affect unrelated etcd installations on the same system.⚡ Proposed safer sequence
-Kill any remaining etcd processes: `sudo pkill -9 etcd` +Check for and gracefully terminate any remaining etcd processes: +```bash +# Check if etcd processes are running +pgrep -a etcd + +# Try graceful termination first +sudo pkill -TERM etcd +sleep 5 + +# Force kill only if still running +if pgrep etcd > /dev/null; then + echo "⚠️ Etcd processes still running, forcing termination..." + sudo pkill -9 etcd +fi +```
39-55: Add snapshot verification to prerequisites.The guide assumes the snapshot exists and is valid but doesn't include verification steps. A corrupted or stale snapshot will cause recovery to fail, potentially after significant progress through the procedure.
📋 Proposed snapshot verification steps
Add after line 54 (before "## Recovery Steps"):
### Verify Snapshot Before starting recovery, verify the snapshot is valid and recent: ```bash # Check snapshot exists ls -lh "${SNAPSHOT_PATH}" # Verify snapshot integrity etcdutl snapshot status "${SNAPSHOT_PATH}" -w table # Check snapshot age (optional) SNAPSHOT_DATE=$(stat -c %y "${SNAPSHOT_PATH}" 2>/dev/null || stat -f %Sm "${SNAPSHOT_PATH}") echo "Snapshot created: ${SNAPSHOT_DATE}"Decision point: Confirm the snapshot age is acceptable for your data loss tolerance before proceeding.
</details> </blockquote></details> </blockquote></details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| @@ -0,0 +1,339 @@ | |||
| # Majority Quorum Loss Recovery Guide | |||
There was a problem hiding this comment.
Fix directory name with trailing space.
The file path docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md contains a directory name with a trailing space ("Disaster recovery "). This can cause issues with shell scripts, tab completion, and automated tooling.
Rename to: docs/Disaster_Recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md or docs/disaster-recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md (following common naming conventions with underscores or hyphens, no spaces).
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 1, The
directory name contains a trailing space ("Disaster recovery ") which breaks
tooling; rename the directory and update the file
MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name
(e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure
any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated
accordingly.
| sleep 10 | ||
| ``` | ||
|
|
||
| **Note:** Control Plane automatically detects the restored snapshot and handles `--force-new-cluster` internally. You don't need to run etcd manually. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n -C3 "force-new-cluster|forceNewCluster" --type=goRepository: pgEdge/control-plane
Length of output: 46
🏁 Script executed:
# Also search for snapshot detection and recovery logic
rg -n -C3 "snapshot.*detect|detect.*snapshot|MAJORITY_QUORUM" --type=goRepository: pgEdge/control-plane
Length of output: 46
🏁 Script executed:
# Search for etcd initialization and cluster startup patterns
rg -n -C3 "cluster.*init|initial.*cluster|new-cluster" --type=goRepository: pgEdge/control-plane
Length of output: 50377
🏁 Script executed:
# Search for etcd initialization and where command flags are built
rg -n "start.*etcd|StartEtcd|cmd.*etcd|--force" --type=go -A5 | head -200Repository: pgEdge/control-plane
Length of output: 12065
🏁 Script executed:
# Look for recovery or snapshot detection logic
rg -n -i "snapshot|recovery|restore" --type=go server/internal/etcd | head -100Repository: pgEdge/control-plane
Length of output: 46
🏁 Script executed:
# Search for where etcd is started in the actual code
find . -name "*.go" -exec rg -l "etcd.*start|start.*etcd" {} \; | head -20Repository: pgEdge/control-plane
Length of output: 160
🏁 Script executed:
cat -n server/internal/etcd/embedded.go | head -150Repository: pgEdge/control-plane
Length of output: 5332
🏁 Script executed:
# Look for the embedConfig function that builds etcd config
sed -n '200,400p' server/internal/etcd/embedded.goRepository: pgEdge/control-plane
Length of output: 5422
🏁 Script executed:
# Search for ForceNewCluster or InitialCluster configuration
rg -n "ForceNewCluster|InitialCluster|initial_cluster" --type=goRepository: pgEdge/control-plane
Length of output: 401
🏁 Script executed:
# Find initializationConfig and embedConfig functions
rg -n "func initializationConfig|func embedConfig" --type=go -A 20Repository: pgEdge/control-plane
Length of output: 3442
🏁 Script executed:
# Search for snapshot, restore, or WAL related operations
rg -n -i "snapshot|restore.*etcd|wal.*restore" --type=go | head -50Repository: pgEdge/control-plane
Length of output: 1879
🏁 Script executed:
# Check the disaster recovery documentation itself to see what it claims
cat docs/Disaster\ recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md | head -120Repository: pgEdge/control-plane
Length of output: 153
🏁 Script executed:
# Find the disaster recovery documentation
find . -name "*MAJORITY_QUORUM*" -o -name "*Disaster*recovery*" 2>/dev/nullRepository: pgEdge/control-plane
Length of output: 148
🏁 Script executed:
# List files in docs directory to understand structure
ls -la docs/ 2>/dev/null || ls -la "docs/" 2>/dev/null || find docs -maxdepth 2 -type f -name "*.md" 2>/dev/null | head -20Repository: pgEdge/control-plane
Length of output: 1228
🏁 Script executed:
# Search for content matching the review comment
rg -n "automatically detects|force-new-cluster" --type=mdRepository: pgEdge/control-plane
Length of output: 1025
🏁 Script executed:
cat "docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md" | sed -n '90,110p'Repository: pgEdge/control-plane
Length of output: 715
🏁 Script executed:
# Check the start() function to see if there's any snapshot or force-new-cluster handling
sed -n '150,250p' server/internal/etcd/embedded.goRepository: pgEdge/control-plane
Length of output: 2943
🏁 Script executed:
# Check the embedConfig function more thoroughly
sed -n '626,695p' server/internal/etcd/embedded.goRepository: pgEdge/control-plane
Length of output: 2937
The automatic force-new-cluster detection claim is not implemented in the Control Plane codebase.
The documentation states that Control Plane "automatically detects the restored snapshot and handles --force-new-cluster internally," but the code does not support this claim. The IsInitialized() method only checks for the existence of the WAL directory to determine if etcd has been started before. The start() and embedConfig() functions create a standard etcd configuration without any snapshot detection logic or --force-new-cluster flag handling. When etcd starts on a restored snapshot, it will not automatically apply --force-new-cluster unless the WAL directory is missing—meaning the recovery procedure as documented will not work as described.
Either remove this note from the documentation, or implement the snapshot detection and automatic flag handling in the Control Plane code.
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 101, The
docs claim Control Plane auto-detects restored snapshots and handles the
--force-new-cluster flag, but the code paths IsInitialized(), start(), and
embedConfig() lack any snapshot detection or flag handling; either remove that
note from the docs or implement detection: update IsInitialized()/start() to
inspect snapshot metadata vs WAL presence (detect a restored snapshot with
missing WAL), and modify embedConfig() (or the etcd start logic) to set the
equivalent ForceNewCluster behavior or inject the --force-new-cluster setting
when a restored snapshot is detected so etcd starts correctly after snapshot
restore.
| RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true") | ||
| TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty') | ||
|
|
||
| # Wait for completion | ||
| while true; do | ||
| STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty') | ||
| if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then | ||
| break | ||
| fi | ||
| sleep 5 | ||
| done | ||
| ``` |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add error handling for task failures and missing task IDs.
The script has two error handling gaps:
- Empty TASK_ID: If all jq paths fail to extract a task ID (line 122),
TASK_IDwill be empty, causing invalid API calls in the polling loop. - Failed tasks: The loop exits on "failed" status (line 127) but doesn't handle the failure - recovery continues as if deletion succeeded.
🛡️ Proposed fix to add validation and failure handling
RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true")
TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
+# Validate task ID exists
+if [ -z "${TASK_ID}" ]; then
+ echo "❌ Failed to get task ID from response: ${RESP}"
+ exit 1
+fi
+
# Wait for completion
while true; do
STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty')
if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then
break
fi
sleep 5
done
+
+# Check if task failed
+if [ "${STATUS}" = "failed" ]; then
+ echo "❌ Host removal task failed for ${LOST_HOST_ID}"
+ exit 1
+fi🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 121 -
132, After extracting RESP and TASK_ID, validate that TASK_ID is non-empty and
abort with a clear error message if empty (refer to RESP and TASK_ID) to avoid
polling an invalid endpoint; in the polling loop inspect STATUS (derived from
the tasks endpoint) and if it equals "failed" print/log the task details from
RESP or the task query, exit with a non-zero code, and avoid continuing recovery
as if deletion succeeded; ensure both error paths provide actionable messages
including task id and status so operators can debug.
| ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") | ||
| ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") | ||
| SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length') | ||
| [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts" |
There was a problem hiding this comment.
Security risk: Credentials exposed in process list.
Lines 182-184 pass etcd credentials via command-line arguments (--user and --password), which are visible in the process list (ps aux) and shell history. This exposes sensitive credentials to any user on the system.
🔒 Proposed fix using environment variables
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
-SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+export ETCDCTL_USER="${ETCD_USER}"
+export ETCDCTL_PASSWORD="${ETCD_PASS}"
+SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \
+ --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \
+ --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \
+ --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \
+ --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \
+ -w json | jq 'length')
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"Note: Apply the same pattern to lines 260-270 in Step 11.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") | |
| ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") | |
| SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length') | |
| [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts" | |
| ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") | |
| ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") | |
| export ETCDCTL_USER="${ETCD_USER}" | |
| export ETCDCTL_PASSWORD="${ETCD_PASS}" | |
| SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \ | |
| --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \ | |
| --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \ | |
| --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \ | |
| --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \ | |
| -w json | jq 'length') | |
| [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts" |
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 182 -
185, The etcd credentials are being passed on the command line
(ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead
export them as environment variables and call etcdctl without the
--user/--password flags so the secret never appears in the process
list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and
ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status
command that populates SERVER_COUNT (and apply the same change to the similar
etcdctl call in Step 11).
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md:
- Around line 140-152: After capturing RESP and deriving TASK_ID, validate that
TASK_ID is non-empty before entering the wait loop: check RESP for an error or
missing task fields (the curl response stored in RESP and extraction into
TASK_ID) and if TASK_ID is empty, log or echo the RESP and a clear error message
and exit nonzero (or skip waiting) to avoid hanging; otherwise proceed to the
existing wait loop using TASK_ID. Ensure the validation references the RESP
variable and the TASK_ID extraction logic so you handle unexpected DELETE
failure formats consistently.
- Around line 59-100: The recovery steps lack error handling and proper quoting:
update the rm command to quote ${PGEDGE_DATA_DIR} (target: generated.config.json
removal), wrap SERVER_HOST and CLIENT_HOST expansions in quotes when calling
curl, capture the join token into JOIN_TOKEN and validate it is non-empty/not
"null" before proceeding, replace the blind sleep after docker service scale
control-plane_<lost-client-host-id>=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.
In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md:
- Line 1: Rename the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the
duplicate extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal
references or links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they
reference TOTAL_QUORUM_LOSS_RECOVERY.md instead.
- Around line 199-206: The one-liner exposing ETCD_PASS on the command line and
hard to read should be refactored: read ETCD_USER/ETCD_PASS from
generated.config.json (already present), export a combined ETCDCTL_USER or use
an env var for credentials and set ETCDCTL_API=3, call etcdctl endpoint status
across multiple lines (using --endpoints, --cacert, --cert, --key) without
passing --user/--password, capture the JSON length into SERVER_COUNT, then
immediately unset the credential env var; update the check that compares
SERVER_COUNT to QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER,
ETCD_PASS, ETCDCTL_USER and ETCDCTL_API symbols should be used to locate and
change the code).
- Around line 280-290: The etcdctl invocation is exposing ETCD_PASS on the
command line; instead export the secret into an environment variable (e.g.,
export ETCDCTL_PASSWORD="${ETCD_PASS}") and remove the --password "${ETCD_PASS}"
argument from the etcdctl endpoint status call (keep ETCDCTL_API, --endpoints,
--cacert, --cert, --key, and --user as-is), then immediately unset
ETCDCTL_PASSWORD after the command to avoid leaving credentials in the process
environment.
🧹 Nitpick comments (3)
docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md (1)
322-322: Hyphenate compound modifier for clarity.The phrase "zero downtime add node" functions as a compound modifier and should be hyphenated as "zero-downtime-add-node" for grammatical correctness.
📝 Proposed fix
-**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero downtime add node. +**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero-downtime-add-node.docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md (2)
86-90: Clarify how to verify certificate identity match.The instruction "Ensure the host ID matches the certificate identity" is vague and lacks actionable verification steps. During a critical outage, operators need explicit commands to validate this requirement.
📋 Suggested addition for verification steps
Add a verification subsection:
**Verify certificate identity:** ```bash # Extract Common Name from certificate CERT_CN=$(openssl x509 -in "${PGEDGE_DATA_DIR}/certificates/etcd-server.crt" -noout -subject | grep -oP 'CN\s*=\s*\K[^,]+') # Compare with host ID if [ "${CERT_CN}" != "${RECOVERY_HOST_ID}" ]; then echo "ERROR: Certificate CN (${CERT_CN}) does not match host ID (${RECOVERY_HOST_ID})" exit 1 fi echo "✅ Certificate identity matches host ID"</details> --- `341-344`: **Recommend graceful shutdown before SIGKILL.** Using `pkill -9` (SIGKILL) immediately prevents etcd from cleanly shutting down. In a troubleshooting scenario, attempting a graceful shutdown first reduces risk of data corruption. <details> <summary>🔧 Proposed improvement</summary> ```diff **Fix:** - Stop all Control Plane services (Step 1) -- Kill any remaining etcd processes: `sudo pkill -9 etcd` +- Try graceful shutdown first: `sudo pkill etcd` (wait 10 seconds) +- If processes remain, force kill: `sudo pkill -9 etcd` - Verify certificate files match the snapshot identity
| ```bash | ||
| PGEDGE_DATA_DIR="<path-to-data-dir>" | ||
| API_PORT=<api-port> | ||
| SERVER_HOST="http://<server-mode-host-ip>:${API_PORT}" | ||
| CLIENT_HOST="http://<client-mode-host-ip>:${API_PORT}" | ||
| ``` | ||
|
|
||
| #### Step 1: Stop the Control Plane Service | ||
|
|
||
| ```bash | ||
| # On Swarm manager node | ||
| docker service scale control-plane_<lost-client-host-id>=0 | ||
| ``` | ||
|
|
||
| #### Step 2: Remove generated.config.json | ||
|
|
||
| ```bash | ||
| # On lost client-mode host node | ||
| rm -f ${PGEDGE_DATA_DIR}/generated.config.json | ||
| ``` | ||
|
|
||
| #### Step 3: Start the Control Plane Service | ||
|
|
||
| ```bash | ||
| # On Swarm manager node | ||
| docker service scale control-plane_<lost-client-host-id>=1 | ||
| sleep 10 | ||
| ``` | ||
|
|
||
| #### Step 4: Request a Join Token | ||
|
|
||
| ```bash | ||
| JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token") | ||
| ``` | ||
|
|
||
| #### Step 5: Join the Host to the Cluster | ||
|
|
||
| ```bash | ||
| curl -sS -X POST ${CLIENT_HOST}/v1/cluster/join \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "{\"token\":\"${JOIN_TOKEN}\",\"server_url\":\"${SERVER_HOST}\"}" | ||
| ``` |
There was a problem hiding this comment.
Add error handling and quote variable expansions.
The recovery commands lack error handling, which could cause silent failures during a critical outage. Additionally, unquoted path variables could break if paths contain spaces.
Issues:
- Line 91:
curlcould fail but the script continues - Line 77: Unquoted
${PGEDGE_DATA_DIR}inrmcommand - Lines 85-86:
sleep 10doesn't verify service readiness
🛡️ Proposed improvements for safety and reliability
Add error handling for API calls:
#### Step 4: Request a Join Token
```bash
-JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
+JOIN_TOKEN=$(curl -sS "${SERVER_HOST}/v1/cluster/join-token" | jq -r ".token")
+if [ -z "${JOIN_TOKEN}" ] || [ "${JOIN_TOKEN}" = "null" ]; then
+ echo "ERROR: Failed to obtain join token"
+ exit 1
+fi
**Quote path variables:**
```diff
#### Step 2: Remove generated.config.json
```bash
# On lost client-mode host node
-rm -f ${PGEDGE_DATA_DIR}/generated.config.json
+rm -f "${PGEDGE_DATA_DIR}/generated.config.json"
**Add service readiness check:**
```diff
#### Step 3: Start the Control Plane Service
```bash
# On Swarm manager node
docker service scale control-plane_<lost-client-host-id>=1
-sleep 10
+
+# Wait for service to be ready
+until curl -sS "${CLIENT_HOST}/v1/health" >/dev/null 2>&1; do
+ sleep 2
+done
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md around lines 59 - 100,
The recovery steps lack error handling and proper quoting: update the rm command
to quote ${PGEDGE_DATA_DIR} (target: generated.config.json removal), wrap
SERVER_HOST and CLIENT_HOST expansions in quotes when calling curl, capture the
join token into JOIN_TOKEN and validate it is non-empty/not "null" before
proceeding, replace the blind sleep after docker service scale
control-plane_=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.
</details>
<!-- fingerprinting:phantom:triton:puma -->
<!-- This is an auto-generated comment by CodeRabbit -->
| #### Step 1: Remove the Lost Host Record | ||
|
|
||
| **Run from a node with API access:** | ||
|
|
||
| ```bash | ||
| RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true") | ||
| TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty') | ||
| echo "${RESP}" | jq '.' | ||
| ``` | ||
|
|
||
| #### Step 2: Stop the Host Service | ||
|
|
||
| **Run on a Swarm manager node. Skip this step if the service name does not exist.** |
There was a problem hiding this comment.
Validate task ID before waiting for completion.
If the DELETE request fails or returns an unexpected format, TASK_ID could be empty, causing the wait loop to fail or behave unexpectedly during a critical recovery operation.
🛡️ Proposed fix to validate task ID
RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true")
TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
-echo "${RESP}" | jq '.'
+
+if [ -z "${TASK_ID}" ] || [ "${TASK_ID}" = "null" ]; then
+ echo "ERROR: Failed to delete host or obtain task ID"
+ echo "${RESP}" | jq '.'
+ exit 1
+fi
+
+echo "Task ID: ${TASK_ID}"🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md around lines 140 - 152,
After capturing RESP and deriving TASK_ID, validate that TASK_ID is non-empty
before entering the wait loop: check RESP for an error or missing task fields
(the curl response stored in RESP and extraction into TASK_ID) and if TASK_ID is
empty, log or echo the RESP and a clear error message and exit nonzero (or skip
waiting) to avoid hanging; otherwise proceed to the existing wait loop using
TASK_ID. Ensure the validation references the RESP variable and the TASK_ID
extraction logic so you handle unexpected DELETE failure formats consistently.
| @@ -0,0 +1,366 @@ | |||
| # Total Quorum Loss Recovery Guide | |||
There was a problem hiding this comment.
Fix double file extension in filename.
The file is named TOTAL_QUORUM_LOSS_RECOVERY.md.md with a duplicate .md extension. This should be corrected to TOTAL_QUORUM_LOSS_RECOVERY.md.
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md at line 1, Rename
the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the duplicate
extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal references or
links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they reference
TOTAL_QUORUM_LOSS_RECOVERY.md instead.
| ```bash | ||
| TOTAL_SERVER_HOSTS=<total-server-mode-hosts> | ||
| QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 )) | ||
| ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") | ||
| ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") | ||
| SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length') | ||
| [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts" | ||
| ``` |
There was a problem hiding this comment.
Refactor long command and avoid password on command line.
Line 204 contains a 300+ character command that is difficult to read and exposes credentials on the command line. During a critical outage, operators need clear, debuggable commands.
Issues:
- Security: Password visible in process listings (
ps aux) - Readability: Extremely long one-liner is hard to verify
- Maintainability: Difficult to modify or troubleshoot
♻️ Proposed refactor for clarity and security
#### 8f. Check Quorum Status
```bash
TOTAL_SERVER_HOSTS=<total-server-mode-hosts>
QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 ))
+
+# Read credentials from config
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
-SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+
+# Use environment variable to avoid password on command line
+export ETCDCTL_USER="${ETCD_USER}:${ETCD_PASS}"
+export ETCDCTL_API=3
+
+# Check server count
+SERVER_COUNT=$(etcdctl endpoint status \
+ --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \
+ --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \
+ --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \
+ --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \
+ -w json | jq 'length')
+
+# Clear credential from environment
+unset ETCDCTL_USER
+
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"
</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md around lines 199 -
206, The one-liner exposing ETCD_PASS on the command line and hard to read
should be refactored: read ETCD_USER/ETCD_PASS from generated.config.json
(already present), export a combined ETCDCTL_USER or use an env var for
credentials and set ETCDCTL_API=3, call etcdctl endpoint status across multiple
lines (using --endpoints, --cacert, --cert, --key) without passing
--user/--password, capture the JSON length into SERVER_COUNT, then immediately
unset the credential env var; update the check that compares SERVER_COUNT to
QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER, ETCD_PASS, ETCDCTL_USER
and ETCDCTL_API symbols should be used to locate and change the code).
356fb56 to
38fbae8
Compare
| rm -rf "${PGEDGE_DATA_DIR}/etcd" | ||
|
|
||
| # Copy etcd data from backup directory | ||
| cp -r "${BACKUP_ETCD_DIR}" "${PGEDGE_DATA_DIR}/etcd" |
There was a problem hiding this comment.
This will not be sufficient if the cluster previously had more than one node. You'll need to run etcdutl snapshot restore to reset the cluster membership like we did on the call yesterday.
There was a problem hiding this comment.
Thanks I'll update Step 4 to use etcdutl snapshot restore to reset cluster membership to a single node, which will match the approach we discussed.
| - etcd data backup directory (from before outage) | ||
| - Certificate files and `generated.config.json` from a server-mode host (from before outage) |
There was a problem hiding this comment.
I think we can combine these first two items and just say "A snapshot or other backup of the Control Plane data volume from before the outage". We're going to have to recommend some type of backups in our installation guide, and volume-level snapshots will be the easiest to recommend and easiest for customers to implement.
There was a problem hiding this comment.
Thanks! I'll update it to combine those into a single volume-level backup requirement. The restore process will still use etcdutl snapshot restore to reset cluster membership for multi-node clusters.
| Copy certificate files and `generated.config.json` from before outage to: | ||
| - `${PGEDGE_DATA_DIR}/certificates/` | ||
| - `${PGEDGE_DATA_DIR}/generated.config.json` | ||
|
|
There was a problem hiding this comment.
I think users should always restore the entire data volume, right? Otherwise, they're missing their Postgres instance data. If you've encountered scenarios where that doesn't work properly, you'll need to be specific here and describe when they should or shouldn't restore the entire directory/volume.
There was a problem hiding this comment.
I'll add specific guidance on when to restore the entire volume vs. selective restoration. The document will clarify that full volume restoration is recommended in most cases, with selective restoration only for specific scenarios like recovering only Control Plane state without Postgres data.
|
|
||
| ```bash | ||
| # Verify API is accessible | ||
| curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/cluster" | jq '.' |
There was a problem hiding this comment.
Could you please remove jq from these commands? It is a nice suggestion and it's my preference as well, but we can't assume that it's installed on every system.
| - One host should show `status: "reachable"` and `etcd_mode: "server"` | ||
| - Server count should be **1** | ||
|
|
||
| **Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**. |
There was a problem hiding this comment.
Could you please remove this:
Quorum is NOT YET RESTORED.
This process resets the cluster membership to one server, so you do have quorum. What's more accurate is to say that you need to re-add the other Control Plane servers to the cluster.
| # Check databases | ||
| curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases" | jq '.[] | {id, state}' | ||
|
|
||
| # Check etcd cluster |
There was a problem hiding this comment.
I'm hesitant to recommend that users try to doing anything with our internal Etcd other than using etcdutl to restore the instance. The host check above is already sufficient for validating that Etcd is working properly, so could you please remove these etcdctl instructions?
|
|
||
| **Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**. Continue to restore quorum. | ||
|
|
||
| ### Step 6: Remove Lost Host Records |
There was a problem hiding this comment.
This step onward is identical to the partial loss process, right? Do we need to duplicate all of those instructions in multiple places? I'm now wondering if all of this could be combined into a single document.
There was a problem hiding this comment.
I’ll wait until all the documentation is fully accurate and finalized. Then @tsivaprasad and I will collaborate to merge his content and mine into a single, well-structured document.
| JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token") | ||
| ``` | ||
|
|
||
| #### Step 5: Join the Host to the Cluster |
There was a problem hiding this comment.
I think I made this same comment on the Google Doc version: this will break unless you remove the host first, because we validate that host IDs are unique (#217). The steps for restoring a client-mode host should be identical to restoring a server-mode host, so I don't think we need two sets of instructions for these.
|
|
||
| ```bash | ||
| # On lost client-mode host node | ||
| rm -f ${PGEDGE_DATA_DIR}/generated.config.json |
There was a problem hiding this comment.
What's the scenario that you're thinking of where the lost host still has data?
There was a problem hiding this comment.
I'll consolidate client-mode and server-mode recovery into a single unified process, and add a step to remove the host record before joining to prevent duplicate host ID errors.
|
|
||
| ```bash | ||
| # On Swarm manager node | ||
| docker service scale control-plane_<lost-client-host-id>=0 |
There was a problem hiding this comment.
What's the scenario that you're thinking of where the control plane is still running on the lost host?
There was a problem hiding this comment.
The host is accessible , but Control Plane has lost track of it or it's in a bad state. The Docker service may still be running from before the loss. We stop it, clear state, and restart to rejoin cleanly.
If the host is truly unreachable, you can't SSH to run Step 3, so this recovery process doesn't apply.
- PARTIAL_QUORUM_RECOVERY.md: Recovery when quorum remains intact - MAJORITY_QUORUM_LOSS_RECOVERY.md: Recovery when majority server-mode hosts lost - TOTAL_QUORUM_LOSS_RECOVERY.md.md: Recovery when all server-mode hosts lost Key improvements: unified recovery processes, etcdutl snapshot restore, removed jq/etcdctl deps
PR Review: Disaster Recovery Documentation
Overview
This PR introduces three comprehensive disaster recovery guides for the pgEdge Control Plane, covering different quorum loss scenarios. The documentation provides step-by-step recovery procedures for operators to restore the Control Plane cluster after various failure scenarios.
Documents Reviewed
Document 1: TOTAL_QUORUM_LOSS_RECOVERY.md
Purpose
Recovery guide for the most severe disaster scenario: all server-mode hosts are lost (100% server-mode host loss). This is the starting point when no server-mode hosts are online and the Control Plane API is completely inaccessible.
What's Included
Prerequisites:
generated.config.jsonfrom a server-mode hostRecovery Steps (6 steps):
Key Features:
MAJORITY_QUORUM_LOSS_RECOVERY.mdfor next stepsAssessment
✅ Strengths:
Recommendation
✅ Approve - Well-structured, focused guide that correctly restores one server-mode host from backup.
Document 2: MAJORITY_QUORUM_LOSS_RECOVERY.md
Purpose
Recovery guide for when quorum is lost but at least one server-mode host is still online (>50% but not 100% of server-mode hosts offline). This guide is used after TOTAL_QUORUM_LOSS_RECOVERY.md has restored one host, or when some hosts remain online but quorum is lost.
What's Included
Prerequisites:
Recovery Steps (11 steps):
etcdutl snapshot restoreto restore etcd stateKey Features:
etcdutl)Assessment
✅ Strengths:
--bump-revisionand--mark-compactedflags but doesn't show them in the commandRecommendation
✅ Approve with minor suggestions:
--bump-revisionand--mark-compactedto the restore command if recommended for productionDocument 3: PARTIAL_QUORUM_RECOVERY.md
Purpose
Recovery guide for when quorum remains intact but one or more hosts are lost. This is the simplest recovery scenario since etcd cluster can still accept writes and elect leaders.
What's Included
Two Main Scenarios:
Scenario 1: Client-Mode Host Recovery
generated.config.jsonScenario 2: Server-Mode Host Recovery (Quorum Intact)
Key Features:
Assessment
✅ Strengths:
✅ Notable Features:
source_nodefor database restorationRecommendation
✅ Approve - Excellent documentation that covers the simpler recovery scenarios comprehensively. Well-structured and practical.
Overall Assessment
Documentation Quality
✅ Strengths:
Consistency Observations
TOTAL_QUORUM_LOSS_RECOVERY.md.mduses backup directory approachMAJORITY_QUORUM_LOSS_RECOVERY.mduses snapshot restore approachQuestion: Is this intentional? Should both use the same approach, or are these different valid recovery methods?
Suggestions for Improvement
Cross-References: Add clearer navigation between documents
Consistency Check: Clarify when to use backup directory vs snapshot restore
Prerequisites Section:
Example Values:
PGEDGE_DATA_DIR="/data/control-plane")Error Handling:
|| exit 1for critical steps)Minor Issues
File Naming:
TOTAL_QUORUM_LOSS_RECOVERY.md.mdhas double extension - consider renaming toTOTAL_QUORUM_LOSS_RECOVERY.mdStep Numbering: In MAJORITY_QUORUM_LOSS_RECOVERY, Step 7 has sub-steps (7a-7f) which is good, but Step 8 references "Steps 7a-7e" - could be clearer
Variable Consistency: Some documents use
${VAR}and others use$VAR- consider standardizingSummary
What's Included
This PR adds three comprehensive disaster recovery guides:
Key Highlights
Recommendation
✅ Approve - This is excellent documentation that fills a critical gap in operational procedures. The guides are well-structured, comprehensive, and practical.
Suggested Follow-ups:
TOTAL_QUORUM_LOSS_RECOVERY.md.mdto remove double extensionPlat-314