Create Quorum Recovery.md by moizpgedge · Pull Request #262 · pgEdge/control-plane

moizpgedge · 2026-02-04T14:28:06Z

PR Review: Disaster Recovery Documentation

Overview

This PR introduces three comprehensive disaster recovery guides for the pgEdge Control Plane, covering different quorum loss scenarios. The documentation provides step-by-step recovery procedures for operators to restore the Control Plane cluster after various failure scenarios.

Documents Reviewed

TOTAL_QUORUM_LOSS_RECOVERY.md - Recovery when all server-mode hosts are lost
MAJORITY_QUORUM_LOSS_RECOVERY.md - Recovery when majority of server-mode hosts are lost (but at least one remains)
PARTIAL_QUORUM_RECOVERY.md - Recovery when quorum remains intact but hosts are lost

Document 1: TOTAL_QUORUM_LOSS_RECOVERY.md

Purpose

Recovery guide for the most severe disaster scenario: all server-mode hosts are lost (100% server-mode host loss). This is the starting point when no server-mode hosts are online and the Control Plane API is completely inaccessible.

What's Included

Prerequisites:

etcd data backup directory (from before outage)
Certificate files and generated.config.json from a server-mode host
Recovery host with matching certificates/config

Recovery Steps (6 steps):

Stop All Control Plane Services - Ensures clean state before recovery
Prepare Recovery Host - Copies certificates and config files
Backup Existing etcd Data - Safety measure to backup any existing data before restore
Restore from Backup - Copies etcd backup directory to recovery host
Start Control Plane on Recovery Host - Starts only the recovery host (one server-mode host)
Verify Recovery Host - Confirms API accessibility and host status

Key Features:

Uses backup directory (not snapshot) for recovery
Restores one server-mode host only
Clear verification steps to confirm one host is online
References MAJORITY_QUORUM_LOSS_RECOVERY.md for next steps

Assessment

✅ Strengths:

Concise and focused (108 lines)
Clear step-by-step instructions
Proper safety measures (backup before restore)
Good verification steps
Clear handoff to majority quorum loss recovery guide

⚠️ Observations:

Document uses backup directory approach (not snapshot restore)
Step 3 backs up existing data, which may not exist if all hosts are truly lost (but good safety practice)
Could benefit from a note about where to get the backup directory if not already available

Recommendation

✅ Approve - Well-structured, focused guide that correctly restores one server-mode host from backup.

Document 2: MAJORITY_QUORUM_LOSS_RECOVERY.md

Purpose

Recovery guide for when quorum is lost but at least one server-mode host is still online (>50% but not 100% of server-mode hosts offline). This guide is used after TOTAL_QUORUM_LOSS_RECOVERY.md has restored one host, or when some hosts remain online but quorum is lost.

What's Included

Prerequisites:

etcd snapshot (taken before outage)
Recovery host: One of the remaining online server-mode hosts (already has certificates/config)

Recovery Steps (11 steps):

Stop All Control Plane Services - Ensures clean state
Backup Existing etcd Data - Safety backup
Restore Snapshot - Uses etcdutl snapshot restore to restore etcd state
Start Control Plane - Starts recovery host
Verify Recovery Host - Confirms one host is online
Remove Lost Host Records - Removes stale host records one at a time
Rejoin Server-Mode Hosts Until Quorum Restored - Critical step-by-step process:
- Stop service → Clear state → Start service → Get join token → Join host → Check quorum status
- Repeats until quorum threshold is reached
Rejoin Remaining Server-Mode Hosts - Adds redundancy after quorum restored
Rejoin Client-Mode Hosts - Only after quorum is restored
Restart All Hosts - Final restart cycle
Final Verification - Comprehensive checks (hosts, databases, etcd cluster)

Key Features:

Uses snapshot restore (via etcdutl)
Detailed quorum calculation and verification
Step-by-step host rejoining with quorum checking
Comprehensive troubleshooting section
Best practices included

Assessment

✅ Strengths:

Comprehensive and detailed (340 lines)
Excellent quorum calculation explanation
Clear decision points (when to stop rejoining server-mode hosts)
Good separation between server-mode and client-mode recovery
Strong troubleshooting section
Emphasizes working one host at a time

⚠️ Observations:

Uses snapshot restore (different from TOTAL_QUORUM_LOSS which uses backup directory)
Step 3 mentions optional --bump-revision and --mark-compacted flags but doesn't show them in the command
Could clarify the relationship with TOTAL_QUORUM_LOSS_RECOVERY (this is the continuation)

Recommendation

✅ Approve with minor suggestions:

Consider adding --bump-revision and --mark-compacted to the restore command if recommended for production
Add a note at the beginning clarifying this can be used standalone (if one host remains) or as a continuation of TOTAL_QUORUM_LOSS_RECOVERY

Document 3: PARTIAL_QUORUM_RECOVERY.md

Purpose

Recovery guide for when quorum remains intact but one or more hosts are lost. This is the simplest recovery scenario since etcd cluster can still accept writes and elect leaders.

What's Included

Two Main Scenarios:

Scenario 1: Client-Mode Host Recovery

When: One or more client-mode hosts are lost, quorum intact
Impact: Lost hosts unreachable, database instances offline, no quorum impact
Recovery Steps (5 steps):
1. Stop Control Plane service
2. Remove generated.config.json
3. Start Control Plane service
4. Request join token
5. Join host to cluster
Repeatable: Can be done for each lost client-mode host independently

Scenario 2: Server-Mode Host Recovery (Quorum Intact)

When: One server-mode host lost, but enough others remain for quorum
Impact: Lost host unreachable, database instances offline, quorum remains intact
Recovery Steps (8 steps):
1. Remove lost host record
2. Stop host service
3. Clear server-mode state (etcd, certificates, config)
4. Start host service (or redeploy stack)
5. Request join token
6. Join lost host
7. Verify host health
8. Restore database instances (with ZODAN explanation)

Key Features:

Two distinct scenarios with different procedures
Clear quorum calculation examples
Database instance restoration with ZODAN (Zero Downtime Add Node) explanation
Recovery order guidance for multiple hosts
Comprehensive verification checklist
Troubleshooting section

Assessment

✅ Strengths:

Well-organized with clear scenario separation
Excellent explanation of ZODAN for database restoration
Good quorum calculation examples
Clear recovery order guidance
Comprehensive verification checklist
Practical troubleshooting

✅ Notable Features:

Explains that Control Plane API remains accessible (key advantage)
Documents ZODAN automatic behavior
Shows how to specify source_node for database restoration
Clear distinction between client-mode and server-mode recovery

Recommendation

✅ Approve - Excellent documentation that covers the simpler recovery scenarios comprehensively. Well-structured and practical.

Overall Assessment

Documentation Quality

✅ Strengths:

Clear Progression: The three documents form a logical progression from worst-case (total loss) → majority loss → partial loss
Comprehensive Coverage: All major disaster scenarios are covered
Practical Commands: All steps include executable commands with proper variable usage
Safety First: Backup steps included before destructive operations
Verification: Each guide includes verification steps
Troubleshooting: Common issues and solutions documented
Best Practices: Guidance on recovery order and procedures

Consistency Observations

⚠️ Inconsistency:

TOTAL_QUORUM_LOSS_RECOVERY.md.md uses backup directory approach
MAJORITY_QUORUM_LOSS_RECOVERY.md uses snapshot restore approach

Question: Is this intentional? Should both use the same approach, or are these different valid recovery methods?

Suggestions for Improvement

Cross-References: Add clearer navigation between documents
- TOTAL → MAJORITY (already done ✅)
- Consider adding links from PARTIAL to MAJORITY if quorum is lost during recovery
Consistency Check: Clarify when to use backup directory vs snapshot restore
- Document the difference or standardize on one approach
Prerequisites Section:
- Consider adding a "Before You Begin" section explaining:
  - How to identify which guide to use
  - How to verify current cluster state
  - How to take snapshots/backups proactively
Example Values:
- Consider adding example values for variables (e.g., PGEDGE_DATA_DIR="/data/control-plane")
Error Handling:
- Add more explicit error handling in commands (e.g., || exit 1 for critical steps)

Minor Issues

File Naming: TOTAL_QUORUM_LOSS_RECOVERY.md.md has double extension - consider renaming to TOTAL_QUORUM_LOSS_RECOVERY.md
Step Numbering: In MAJORITY_QUORUM_LOSS_RECOVERY, Step 7 has sub-steps (7a-7f) which is good, but Step 8 references "Steps 7a-7e" - could be clearer
Variable Consistency: Some documents use ${VAR} and others use $VAR - consider standardizing

Summary

What's Included

This PR adds three comprehensive disaster recovery guides:

Total Quorum Loss Recovery - Restores one server-mode host from backup when all are lost
Majority Quorum Loss Recovery - Restores quorum by rejoining hosts after snapshot restore
Partial Quorum Recovery - Recovers lost hosts when quorum remains intact

Key Highlights

Complete Coverage: All disaster scenarios from total loss to partial loss
Step-by-Step: Detailed, executable commands for each scenario
Safety First: Backup procedures before destructive operations
Verification: Comprehensive checks at each stage
Troubleshooting: Common issues and solutions documented
Best Practices: Recovery order and procedural guidance

Recommendation

✅ Approve - This is excellent documentation that fills a critical gap in operational procedures. The guides are well-structured, comprehensive, and practical.

Suggested Follow-ups:

Clarify backup directory vs snapshot restore approach
Add cross-references and navigation improvements
Consider renaming TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove double extension
Add a "Quick Reference" or "Decision Tree" to help operators choose the right guide

Plat-314

https://docs.google.com/document/d/1aKXOJYZc9AfRiuKp5sqCDQscx3uasj90CMIAgB_VD1Y/edit?tab=t.g9uy1ua471x3

coderabbitai · 2026-02-04T14:28:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds four new documentation guides describing quorum rules and structured recovery procedures for a pgEdge Control Plane with embedded etcd, covering partial, majority, and total quorum loss scenarios, step-by-step recovery workflows, commands, verification steps, and operator checklists.

Changes

Cohort / File(s)	Summary
Quorum reference `docs/Quorum Recovery.md`	New guide defining server-mode vs client-mode hosts, quorum calculation (floor(N/2)+1), failure scenarios matrix, and targeted recovery actions with examples and verification steps.
Disaster recovery workflows `docs/Disaster recovery/PARTIAL_QUORUM_RECOVERY.md`, `docs/Disaster recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md`, `docs/Disaster recovery/TOTAL_QUORUM_LOSS_RECOVERY.md.md`	Three new recovery guides: partial-quorum procedures when quorum remains; majority-quorum-loss workflow restoring server-mode hosts from snapshots; total-quorum-loss workflow for full server-mode outage including recovery-host steps, etcd snapshot/restore, and transition to majority recovery. Each includes prerequisites, commands, verification, and operator checklist.

Poem

🐇 I nibble logs beneath the moon,

When votes are sparse and hearts are strewn,
One hop, one host, we mend the ring,
Snapshots hum and joiners sing,
A rabbit bounces—quorum's boon.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Create Quorum Recovery.md' clearly and specifically describes the main change: addition of a new Quorum Recovery documentation file.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description is comprehensive and well-structured, providing detailed reviews of three disaster recovery guides with clear assessment and recommendations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/Quorum` Recovery.md:
- Line 634: Replace the incorrect step label "Step 11d" with "Step 13d" in the
rejoining client-mode hosts section (the line currently showing "###### Step
11d: Get join token from the recovery host"); update any matching subheading
text to read "Step 13d" so it correctly falls under the Step 13 sequence for
rejoining client-mode hosts.

🧹 Nitpick comments (1)

docs/Quorum Recovery.md (1)

118-120: Optional: Consider minor rephrasing for variety (not blocking).

Static analysis flags repetitive sentence beginnings at lines 118-120 ("You will...") and lines 314-316 ("Rejoin..."). While these are stylistically flagged, they're actually acceptable in technical documentation contexts—the first is in an assumptions list, and the second is in a procedural step list where parallel structure aids clarity.

Also applies to: 314-316

coderabbitai · 2026-02-04T14:32:07Z

docs/Quorum Recovery.md

+sleep 10
+```
+
+###### Step 11d: Get join token from the recovery host


⚠️ Potential issue | 🟡 Minor

Fix incorrect step numbering.

Line 634 shows "Step 11d" but should be "Step 13d" (this is within the Step 13 section for rejoining client-mode hosts). This appears to be a copy-paste error.

📝 Proposed fix

-###### Step 11d: Get join token from the recovery host +###### Step 13d: Get join token from the recovery host

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

###### Step 11d: Get join token from the recovery host

###### Step 13d: Get join token from the recovery host

🤖 Prompt for AI Agents

In `@docs/Quorum` Recovery.md at line 634, Replace the incorrect step label "Step 11d" with "Step 13d" in the rejoining client-mode hosts section (the line currently showing "###### Step 11d: Get join token from the recovery host"); update any matching subheading text to read "Step 13d" so it correctly falls under the Step 13 sequence for rejoining client-mode hosts.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md:
- Line 1: The directory name contains a trailing space ("Disaster recovery ")
which breaks tooling; rename the directory and update the file
MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name
(e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure
any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated
accordingly.
- Around line 121-132: After extracting RESP and TASK_ID, validate that TASK_ID
is non-empty and abort with a clear error message if empty (refer to RESP and
TASK_ID) to avoid polling an invalid endpoint; in the polling loop inspect
STATUS (derived from the tasks endpoint) and if it equals "failed" print/log the
task details from RESP or the task query, exit with a non-zero code, and avoid
continuing recovery as if deletion succeeded; ensure both error paths provide
actionable messages including task id and status so operators can debug.
- Around line 182-185: The etcd credentials are being passed on the command line
(ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead
export them as environment variables and call etcdctl without the
--user/--password flags so the secret never appears in the process
list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and
ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status
command that populates SERVER_COUNT (and apply the same change to the similar
etcdctl call in Step 11).
- Line 101: The docs claim Control Plane auto-detects restored snapshots and
handles the --force-new-cluster flag, but the code paths IsInitialized(),
start(), and embedConfig() lack any snapshot detection or flag handling; either
remove that note from the docs or implement detection: update
IsInitialized()/start() to inspect snapshot metadata vs WAL presence (detect a
restored snapshot with missing WAL), and modify embedConfig() (or the etcd start
logic) to set the equivalent ForceNewCluster behavior or inject the
--force-new-cluster setting when a restored snapshot is detected so etcd starts
correctly after snapshot restore.

🧹 Nitpick comments (4)

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md (4)
6-8: Clarify the emoji indicators.

Lines 6-8 use ❌ (negative indicator) but describe capabilities that "may be accessible" or "cannot proceed", which creates inconsistency. The ❌ typically means "not available" or "blocked", but line 6 says the API "may be accessible", suggesting partial availability rather than complete failure.

Consider using ⚠️ for partial/uncertain states or rewriting to clearly indicate whether these are always blocked or conditionally available.

91-91: Provide guidance for the optional production flags.

The optional flags --bump-revision and --mark-compacted are mentioned as "recommended for production" but lack guidance on what value to use for <revision-bump-value>. Operators following this guide won't know what value is appropriate.

Consider adding a note explaining how to determine the revision bump value (e.g., "use a large value like 1000000 to ensure new revisions are higher than client watch revisions").

321-321: Use safer process termination sequence.

The command sudo pkill -9 etcd forcefully kills all etcd processes with SIGKILL, preventing graceful shutdown and potentially leaving corrupted state. It may also affect unrelated etcd installations on the same system.
⚡ Proposed safer sequence
-Kill any remaining etcd processes: `sudo pkill -9 etcd`
+Check for and gracefully terminate any remaining etcd processes:
+```bash
+# Check if etcd processes are running
+pgrep -a etcd
+
+# Try graceful termination first
+sudo pkill -TERM etcd
+sleep 5
+
+# Force kill only if still running
+if pgrep etcd > /dev/null; then
+    echo "⚠️ Etcd processes still running, forcing termination..."
+    sudo pkill -9 etcd
+fi
+```
39-55: Add snapshot verification to prerequisites.

The guide assumes the snapshot exists and is valid but doesn't include verification steps. A corrupted or stale snapshot will cause recovery to fail, potentially after significant progress through the procedure.
📋 Proposed snapshot verification steps

Add after line 54 (before "## Recovery Steps"):
### Verify Snapshot

Before starting recovery, verify the snapshot is valid and recent:

```bash
# Check snapshot exists
ls -lh "${SNAPSHOT_PATH}"

# Verify snapshot integrity
etcdutl snapshot status "${SNAPSHOT_PATH}" -w table

# Check snapshot age (optional)
SNAPSHOT_DATE=$(stat -c %y "${SNAPSHOT_PATH}" 2>/dev/null || stat -f %Sm "${SNAPSHOT_PATH}")
echo "Snapshot created: ${SNAPSHOT_DATE}"
Decision point: Confirm the snapshot age is acceptable for your data loss tolerance before proceeding.
</details>

</blockquote></details>

</blockquote></details>

coderabbitai · 2026-02-11T11:30:11Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

@@ -0,0 +1,339 @@
+# Majority Quorum Loss Recovery Guide


⚠️ Potential issue | 🟡 Minor

Fix directory name with trailing space.

The file path docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md contains a directory name with a trailing space ("Disaster recovery "). This can cause issues with shell scripts, tab completion, and automated tooling.

Rename to: docs/Disaster_Recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md or docs/disaster-recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md (following common naming conventions with underscores or hyphens, no spaces).

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 1, The directory name contains a trailing space ("Disaster recovery ") which breaks tooling; rename the directory and update the file MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name (e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated accordingly.

coderabbitai · 2026-02-11T11:30:11Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

+sleep 10
+```
+
+**Note:** Control Plane automatically detects the restored snapshot and handles `--force-new-cluster` internally. You don't need to run etcd manually.


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n -C3 "force-new-cluster|forceNewCluster" --type=go

Repository: pgEdge/control-plane

Length of output: 46

🏁 Script executed:

# Also search for snapshot detection and recovery logic rg -n -C3 "snapshot.*detect|detect.*snapshot|MAJORITY_QUORUM" --type=go

Repository: pgEdge/control-plane

Length of output: 46

🏁 Script executed:

# Search for etcd initialization and cluster startup patterns rg -n -C3 "cluster.*init|initial.*cluster|new-cluster" --type=go

Repository: pgEdge/control-plane

Length of output: 50377

🏁 Script executed:

# Search for etcd initialization and where command flags are built rg -n "start.*etcd|StartEtcd|cmd.*etcd|--force" --type=go -A5 | head -200

Repository: pgEdge/control-plane

Length of output: 12065

🏁 Script executed:

# Look for recovery or snapshot detection logic rg -n -i "snapshot|recovery|restore" --type=go server/internal/etcd | head -100

Repository: pgEdge/control-plane

Length of output: 46

🏁 Script executed:

# Search for where etcd is started in the actual code find . -name "*.go" -exec rg -l "etcd.*start|start.*etcd" {} \; | head -20

Repository: pgEdge/control-plane

Length of output: 160

🏁 Script executed:

cat -n server/internal/etcd/embedded.go | head -150

Repository: pgEdge/control-plane

Length of output: 5332

🏁 Script executed:

# Look for the embedConfig function that builds etcd config sed -n '200,400p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 5422

🏁 Script executed:

# Search for ForceNewCluster or InitialCluster configuration rg -n "ForceNewCluster|InitialCluster|initial_cluster" --type=go

Repository: pgEdge/control-plane

Length of output: 401

🏁 Script executed:

# Find initializationConfig and embedConfig functions rg -n "func initializationConfig|func embedConfig" --type=go -A 20

Repository: pgEdge/control-plane

Length of output: 3442

🏁 Script executed:

# Search for snapshot, restore, or WAL related operations rg -n -i "snapshot|restore.*etcd|wal.*restore" --type=go | head -50

Repository: pgEdge/control-plane

Length of output: 1879

🏁 Script executed:

# Check the disaster recovery documentation itself to see what it claims cat docs/Disaster\ recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md | head -120

Repository: pgEdge/control-plane

Length of output: 153

🏁 Script executed:

# Find the disaster recovery documentation find . -name "*MAJORITY_QUORUM*" -o -name "*Disaster*recovery*" 2>/dev/null

Repository: pgEdge/control-plane

Length of output: 148

🏁 Script executed:

# List files in docs directory to understand structure ls -la docs/ 2>/dev/null || ls -la "docs/" 2>/dev/null || find docs -maxdepth 2 -type f -name "*.md" 2>/dev/null | head -20

Repository: pgEdge/control-plane

Length of output: 1228

🏁 Script executed:

# Search for content matching the review comment rg -n "automatically detects|force-new-cluster" --type=md

Repository: pgEdge/control-plane

Length of output: 1025

🏁 Script executed:

cat "docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md" | sed -n '90,110p'

Repository: pgEdge/control-plane

Length of output: 715

🏁 Script executed:

# Check the start() function to see if there's any snapshot or force-new-cluster handling sed -n '150,250p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2943

🏁 Script executed:

# Check the embedConfig function more thoroughly sed -n '626,695p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2937

The automatic force-new-cluster detection claim is not implemented in the Control Plane codebase.

The documentation states that Control Plane "automatically detects the restored snapshot and handles --force-new-cluster internally," but the code does not support this claim. The IsInitialized() method only checks for the existence of the WAL directory to determine if etcd has been started before. The start() and embedConfig() functions create a standard etcd configuration without any snapshot detection logic or --force-new-cluster flag handling. When etcd starts on a restored snapshot, it will not automatically apply --force-new-cluster unless the WAL directory is missing—meaning the recovery procedure as documented will not work as described.

Either remove this note from the documentation, or implement the snapshot detection and automatic flag handling in the Control Plane code.

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 101, The docs claim Control Plane auto-detects restored snapshots and handles the --force-new-cluster flag, but the code paths IsInitialized(), start(), and embedConfig() lack any snapshot detection or flag handling; either remove that note from the docs or implement detection: update IsInitialized()/start() to inspect snapshot metadata vs WAL presence (detect a restored snapshot with missing WAL), and modify embedConfig() (or the etcd start logic) to set the equivalent ForceNewCluster behavior or inject the --force-new-cluster setting when a restored snapshot is detected so etcd starts correctly after snapshot restore.

coderabbitai · 2026-02-11T11:30:12Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

+RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true")
+TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
+
+# Wait for completion
+while true; do
+    STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty')
+    if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then
+        break
+    fi
+    sleep 5
+done
+```


🛠️ Refactor suggestion | 🟠 Major

Add error handling for task failures and missing task IDs.

The script has two error handling gaps:

Empty TASK_ID: If all jq paths fail to extract a task ID (line 122), TASK_ID will be empty, causing invalid API calls in the polling loop.

Failed tasks: The loop exits on "failed" status (line 127) but doesn't handle the failure - recovery continues as if deletion succeeded.

🛡️ Proposed fix to add validation and failure handling

RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true") TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty') +# Validate task ID exists +if [ -z "${TASK_ID}" ]; then + echo "❌ Failed to get task ID from response: ${RESP}" + exit 1 +fi + # Wait for completion while true; do STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty') if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then break fi sleep 5 done + +# Check if task failed +if [ "${STATUS}" = "failed" ]; then + echo "❌ Host removal task failed for ${LOST_HOST_ID}" + exit 1 +fi

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 121 - 132, After extracting RESP and TASK_ID, validate that TASK_ID is non-empty and abort with a clear error message if empty (refer to RESP and TASK_ID) to avoid polling an invalid endpoint; in the polling loop inspect STATUS (derived from the tasks endpoint) and if it equals "failed" print/log the task details from RESP or the task query, exit with a non-zero code, and avoid continuing recovery as if deletion succeeded; ensure both error paths provide actionable messages including task id and status so operators can debug.

coderabbitai · 2026-02-11T11:30:12Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

+ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
+ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
+SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️  Continue rejoining server-mode hosts"


⚠️ Potential issue | 🟠 Major

Security risk: Credentials exposed in process list.

Lines 182-184 pass etcd credentials via command-line arguments (--user and --password), which are visible in the process list (ps aux) and shell history. This exposes sensitive credentials to any user on the system.

🔒 Proposed fix using environment variables

ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") -SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length') +export ETCDCTL_USER="${ETCD_USER}" +export ETCDCTL_PASSWORD="${ETCD_PASS}" +SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \ + --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \ + --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \ + --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \ + --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \ + -w json | jq 'length') [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"

Note: Apply the same pattern to lines 260-270 in Step 11.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")

ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")

SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')

[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"

ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")

ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")

export ETCDCTL_USER="${ETCD_USER}"

export ETCDCTL_PASSWORD="${ETCD_PASS}"

SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \

--endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \

--cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \

--cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \

--key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \

-w json | jq 'length')

[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 182 - 185, The etcd credentials are being passed on the command line (ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead export them as environment variables and call etcdctl without the --user/--password flags so the secret never appears in the process list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status command that populates SERVER_COUNT (and apply the same change to the similar etcdctl call in Step 11).

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md:
- Around line 140-152: After capturing RESP and deriving TASK_ID, validate that
TASK_ID is non-empty before entering the wait loop: check RESP for an error or
missing task fields (the curl response stored in RESP and extraction into
TASK_ID) and if TASK_ID is empty, log or echo the RESP and a clear error message
and exit nonzero (or skip waiting) to avoid hanging; otherwise proceed to the
existing wait loop using TASK_ID. Ensure the validation references the RESP
variable and the TASK_ID extraction logic so you handle unexpected DELETE
failure formats consistently.
- Around line 59-100: The recovery steps lack error handling and proper quoting:
update the rm command to quote ${PGEDGE_DATA_DIR} (target: generated.config.json
removal), wrap SERVER_HOST and CLIENT_HOST expansions in quotes when calling
curl, capture the join token into JOIN_TOKEN and validate it is non-empty/not
"null" before proceeding, replace the blind sleep after docker service scale
control-plane_<lost-client-host-id>=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.

In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md:
- Line 1: Rename the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the
duplicate extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal
references or links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they
reference TOTAL_QUORUM_LOSS_RECOVERY.md instead.
- Around line 199-206: The one-liner exposing ETCD_PASS on the command line and
hard to read should be refactored: read ETCD_USER/ETCD_PASS from
generated.config.json (already present), export a combined ETCDCTL_USER or use
an env var for credentials and set ETCDCTL_API=3, call etcdctl endpoint status
across multiple lines (using --endpoints, --cacert, --cert, --key) without
passing --user/--password, capture the JSON length into SERVER_COUNT, then
immediately unset the credential env var; update the check that compares
SERVER_COUNT to QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER,
ETCD_PASS, ETCDCTL_USER and ETCDCTL_API symbols should be used to locate and
change the code).
- Around line 280-290: The etcdctl invocation is exposing ETCD_PASS on the
command line; instead export the secret into an environment variable (e.g.,
export ETCDCTL_PASSWORD="${ETCD_PASS}") and remove the --password "${ETCD_PASS}"
argument from the etcdctl endpoint status call (keep ETCDCTL_API, --endpoints,
--cacert, --cert, --key, and --user as-is), then immediately unset
ETCDCTL_PASSWORD after the command to avoid leaving credentials in the process
environment.

🧹 Nitpick comments (3)

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md (1)
322-322: Hyphenate compound modifier for clarity.

The phrase "zero downtime add node" functions as a compound modifier and should be hyphenated as "zero-downtime-add-node" for grammatical correctness.
📝 Proposed fix
-**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero downtime add node.
+**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero-downtime-add-node.
docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md (2)
86-90: Clarify how to verify certificate identity match.

The instruction "Ensure the host ID matches the certificate identity" is vague and lacks actionable verification steps. During a critical outage, operators need explicit commands to validate this requirement.
📋 Suggested addition for verification steps

Add a verification subsection:
**Verify certificate identity:**

```bash
# Extract Common Name from certificate
CERT_CN=$(openssl x509 -in "${PGEDGE_DATA_DIR}/certificates/etcd-server.crt" -noout -subject | grep -oP 'CN\s*=\s*\K[^,]+')

# Compare with host ID
if [ "${CERT_CN}" != "${RECOVERY_HOST_ID}" ]; then
    echo "ERROR: Certificate CN (${CERT_CN}) does not match host ID (${RECOVERY_HOST_ID})"
    exit 1
fi
echo "✅ Certificate identity matches host ID"
</details>

---

`341-344`: **Recommend graceful shutdown before SIGKILL.**

Using `pkill -9` (SIGKILL) immediately prevents etcd from cleanly shutting down. In a troubleshooting scenario, attempting a graceful shutdown first reduces risk of data corruption.



<details>
<summary>🔧 Proposed improvement</summary>

```diff
 **Fix:** 
 - Stop all Control Plane services (Step 1)
-- Kill any remaining etcd processes: `sudo pkill -9 etcd`
+- Try graceful shutdown first: `sudo pkill etcd` (wait 10 seconds)
+- If processes remain, force kill: `sudo pkill -9 etcd`
 - Verify certificate files match the snapshot identity

coderabbitai · 2026-02-11T16:36:43Z

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md

+```bash
+PGEDGE_DATA_DIR="<path-to-data-dir>"
+API_PORT=<api-port>
+SERVER_HOST="http://<server-mode-host-ip>:${API_PORT}"
+CLIENT_HOST="http://<client-mode-host-ip>:${API_PORT}"
+```
+
+#### Step 1: Stop the Control Plane Service
+
+```bash
+# On Swarm manager node
+docker service scale control-plane_<lost-client-host-id>=0
+```
+
+#### Step 2: Remove generated.config.json
+
+```bash
+# On lost client-mode host node
+rm -f ${PGEDGE_DATA_DIR}/generated.config.json
+```
+
+#### Step 3: Start the Control Plane Service
+
+```bash
+# On Swarm manager node
+docker service scale control-plane_<lost-client-host-id>=1
+sleep 10
+```
+
+#### Step 4: Request a Join Token
+
+```bash
+JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
+```
+
+#### Step 5: Join the Host to the Cluster
+
+```bash
+curl -sS -X POST ${CLIENT_HOST}/v1/cluster/join \
+    -H "Content-Type: application/json" \
+    -d "{\"token\":\"${JOIN_TOKEN}\",\"server_url\":\"${SERVER_HOST}\"}"
+```


⚠️ Potential issue | 🟠 Major

Add error handling and quote variable expansions.

The recovery commands lack error handling, which could cause silent failures during a critical outage. Additionally, unquoted path variables could break if paths contain spaces.

Issues:

Line 91: curl could fail but the script continues

Line 77: Unquoted ${PGEDGE_DATA_DIR} in rm command

Lines 85-86: sleep 10 doesn't verify service readiness

🛡️ Proposed improvements for safety and reliability

Add error handling for API calls:

#### Step 4: Request a Join Token ```bash -JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token") +JOIN_TOKEN=$(curl -sS "${SERVER_HOST}/v1/cluster/join-token" | jq -r ".token") +if [ -z "${JOIN_TOKEN}" ] || [ "${JOIN_TOKEN}" = "null" ]; then + echo "ERROR: Failed to obtain join token" + exit 1 +fi

**Quote path variables:** ```diff #### Step 2: Remove generated.config.json ```bash # On lost client-mode host node -rm -f ${PGEDGE_DATA_DIR}/generated.config.json +rm -f "${PGEDGE_DATA_DIR}/generated.config.json"

**Add service readiness check:** ```diff #### Step 3: Start the Control Plane Service ```bash # On Swarm manager node docker service scale control-plane_<lost-client-host-id>=1 -sleep 10 + +# Wait for service to be ready +until curl -sS "${CLIENT_HOST}/v1/health" >/dev/null 2>&1; do + sleep 2 +done

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

In @docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md around lines 59 - 100,
The recovery steps lack error handling and proper quoting: update the rm command
to quote ${PGEDGE_DATA_DIR} (target: generated.config.json removal), wrap
SERVER_HOST and CLIENT_HOST expansions in quotes when calling curl, capture the
join token into JOIN_TOKEN and validate it is non-empty/not "null" before
proceeding, replace the blind sleep after docker service scale
control-plane_=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.

</details>  

coderabbitai · 2026-02-11T16:36:43Z

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md

+#### Step 1: Remove the Lost Host Record
+
+**Run from a node with API access:**
+
+```bash
+RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true")
+TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
+echo "${RESP}" | jq '.'
+```
+
+#### Step 2: Stop the Host Service
+
+**Run on a Swarm manager node. Skip this step if the service name does not exist.**


⚠️ Potential issue | 🟠 Major

Validate task ID before waiting for completion.

If the DELETE request fails or returns an unexpected format, TASK_ID could be empty, causing the wait loop to fail or behave unexpectedly during a critical recovery operation.

🛡️ Proposed fix to validate task ID

RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true") TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty') -echo "${RESP}" | jq '.' + +if [ -z "${TASK_ID}" ] || [ "${TASK_ID}" = "null" ]; then + echo "ERROR: Failed to delete host or obtain task ID" + echo "${RESP}" | jq '.' + exit 1 +fi + +echo "Task ID: ${TASK_ID}"

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md around lines 140 - 152, After capturing RESP and deriving TASK_ID, validate that TASK_ID is non-empty before entering the wait loop: check RESP for an error or missing task fields (the curl response stored in RESP and extraction into TASK_ID) and if TASK_ID is empty, log or echo the RESP and a clear error message and exit nonzero (or skip waiting) to avoid hanging; otherwise proceed to the existing wait loop using TASK_ID. Ensure the validation references the RESP variable and the TASK_ID extraction logic so you handle unexpected DELETE failure formats consistently.

coderabbitai · 2026-02-11T16:36:43Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md

@@ -0,0 +1,366 @@
+# Total Quorum Loss Recovery Guide


⚠️ Potential issue | 🟡 Minor

Fix double file extension in filename.

The file is named TOTAL_QUORUM_LOSS_RECOVERY.md.md with a duplicate .md extension. This should be corrected to TOTAL_QUORUM_LOSS_RECOVERY.md.

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md at line 1, Rename the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the duplicate extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal references or links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they reference TOTAL_QUORUM_LOSS_RECOVERY.md instead.

coderabbitai · 2026-02-11T16:36:43Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md

+```bash
+TOTAL_SERVER_HOSTS=<total-server-mode-hosts>
+QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 ))
+ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
+ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
+SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️  Continue rejoining server-mode hosts"
+```


⚠️ Potential issue | 🟠 Major

Refactor long command and avoid password on command line.

Line 204 contains a 300+ character command that is difficult to read and exposes credentials on the command line. During a critical outage, operators need clear, debuggable commands.

Issues:

Security: Password visible in process listings (ps aux)

Readability: Extremely long one-liner is hard to verify

Maintainability: Difficult to modify or troubleshoot

♻️ Proposed refactor for clarity and security

#### 8f. Check Quorum Status ```bash TOTAL_SERVER_HOSTS=<total-server-mode-hosts> QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 )) + +# Read credentials from config ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json") ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json") -SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length') + +# Use environment variable to avoid password on command line +export ETCDCTL_USER="${ETCD_USER}:${ETCD_PASS}" +export ETCDCTL_API=3 + +# Check server count +SERVER_COUNT=$(etcdctl endpoint status \ + --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \ + --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \ + --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \ + --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \ + -w json | jq 'length') + +# Clear credential from environment +unset ETCDCTL_USER + [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🤖 Prompt for AI Agents

In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md around lines 199 - 206, The one-liner exposing ETCD_PASS on the command line and hard to read should be refactored: read ETCD_USER/ETCD_PASS from generated.config.json (already present), export a combined ETCDCTL_USER or use an env var for credentials and set ETCDCTL_API=3, call etcdctl endpoint status across multiple lines (using --endpoints, --cacert, --cert, --key) without passing --user/--password, capture the JSON length into SERVER_COUNT, then immediately unset the credential env var; update the check that compares SERVER_COUNT to QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER, ETCD_PASS, ETCDCTL_USER and ETCDCTL_API symbols should be used to locate and change the code).

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md

…SS_RECOVERY.md

jason-lynch · 2026-02-12T13:02:38Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md

+rm -rf "${PGEDGE_DATA_DIR}/etcd"
+
+# Copy etcd data from backup directory
+cp -r "${BACKUP_ETCD_DIR}" "${PGEDGE_DATA_DIR}/etcd"


This will not be sufficient if the cluster previously had more than one node. You'll need to run etcdutl snapshot restore to reset the cluster membership like we did on the call yesterday.

Thanks I'll update Step 4 to use etcdutl snapshot restore to reset cluster membership to a single node, which will match the approach we discussed.

jason-lynch · 2026-02-12T13:10:48Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md

+- etcd data backup directory (from before outage)
+- Certificate files and `generated.config.json` from a server-mode host (from before outage)


I think we can combine these first two items and just say "A snapshot or other backup of the Control Plane data volume from before the outage". We're going to have to recommend some type of backups in our installation guide, and volume-level snapshots will be the easiest to recommend and easiest for customers to implement.

Thanks! I'll update it to combine those into a single volume-level backup requirement. The restore process will still use etcdutl snapshot restore to reset cluster membership for multi-node clusters.

jason-lynch · 2026-02-12T13:17:21Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md

+Copy certificate files and `generated.config.json` from before outage to:
+- `${PGEDGE_DATA_DIR}/certificates/`
+- `${PGEDGE_DATA_DIR}/generated.config.json`
+


I think users should always restore the entire data volume, right? Otherwise, they're missing their Postgres instance data. If you've encountered scenarios where that doesn't work properly, you'll need to be specific here and describe when they should or shouldn't restore the entire directory/volume.

I'll add specific guidance on when to restore the entire volume vs. selective restoration. The document will clarify that full volume restoration is recommended in most cases, with selective restoration only for specific scenarios like recovering only Control Plane state without Postgres data.

jason-lynch · 2026-02-12T13:21:27Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md

+
+```bash
+# Verify API is accessible
+curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/cluster" | jq '.'


Could you please remove jq from these commands? It is a nice suggestion and it's my preference as well, but we can't assume that it's installed on every system.

jason-lynch · 2026-02-12T13:26:28Z

docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md

+- One host should show `status: "reachable"` and `etcd_mode: "server"`
+- Server count should be **1**
+
+**Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**.


Could you please remove this:

Quorum is NOT YET RESTORED.

This process resets the cluster membership to one server, so you do have quorum. What's more accurate is to say that you need to re-add the other Control Plane servers to the cluster.

jason-lynch · 2026-02-12T13:42:00Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

+# Check databases
+curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases" | jq '.[] | {id, state}'
+
+# Check etcd cluster


I'm hesitant to recommend that users try to doing anything with our internal Etcd other than using etcdutl to restore the instance. The host check above is already sufficient for validating that Etcd is working properly, so could you please remove these etcdctl instructions?

jason-lynch · 2026-02-12T13:44:06Z

docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md

+
+**Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**. Continue to restore quorum.
+
+### Step 6: Remove Lost Host Records


This step onward is identical to the partial loss process, right? Do we need to duplicate all of those instructions in multiple places? I'm now wondering if all of this could be combined into a single document.

I’ll wait until all the documentation is fully accurate and finalized. Then @tsivaprasad and I will collaborate to merge his content and mine into a single, well-structured document.

jason-lynch · 2026-02-12T13:48:28Z

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md

+JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
+```
+
+#### Step 5: Join the Host to the Cluster


I think I made this same comment on the Google Doc version: this will break unless you remove the host first, because we validate that host IDs are unique (#217). The steps for restoring a client-mode host should be identical to restoring a server-mode host, so I don't think we need two sets of instructions for these.

jason-lynch · 2026-02-12T13:50:12Z

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md

+
+```bash
+# On lost client-mode host node
+rm -f ${PGEDGE_DATA_DIR}/generated.config.json


What's the scenario that you're thinking of where the lost host still has data?

I'll consolidate client-mode and server-mode recovery into a single unified process, and add a step to remove the host record before joining to prevent duplicate host ID errors.

jason-lynch · 2026-02-12T13:58:04Z

docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md

+
+```bash
+# On Swarm manager node
+docker service scale control-plane_<lost-client-host-id>=0


What's the scenario that you're thinking of where the control plane is still running on the lost host?

The host is accessible , but Control Plane has lost track of it or it's in a bad state. The Docker service may still be running from before the loss. We stop it, clear state, and restart to rejoin cleanly.
If the host is truly unreachable, you can't SSH to run Step 3, so this recovery process doesn't apply.

- PARTIAL_QUORUM_RECOVERY.md: Recovery when quorum remains intact - MAJORITY_QUORUM_LOSS_RECOVERY.md: Recovery when majority server-mode hosts lost - TOTAL_QUORUM_LOSS_RECOVERY.md.md: Recovery when all server-mode hosts lost Key improvements: unified recovery processes, etcdutl snapshot restore, removed jq/etcdctl deps

Create Quorum Recovery.md

1f50e3b

https://docs.google.com/document/d/1aKXOJYZc9AfRiuKp5sqCDQscx3uasj90CMIAgB_VD1Y/edit?tab=t.g9uy1ua471x3

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

moizpgedge added 2 commits February 11, 2026 15:45

Create TOTAL_QUORUM_LOSS_RECOVERY.md

02d4a4a

runbook for etcd recovery

768e451

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

moizpgedge requested review from jason-lynch and tsivaprasad February 11, 2026 15:16

runbook for etcd recovery

38fbae8

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

moizpgedge force-pushed the Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost branch from 356fb56 to 38fbae8 Compare February 12, 2026 10:29

Update and rename TOTAL_QUORUM_LOSS_RECOVERY.md.md to TOTAL_QUORUM_LO…

f396f00

…SS_RECOVERY.md

jason-lynch reviewed Feb 12, 2026

View reviewed changes

	###### Step 11d: Get join token from the recovery host
	###### Step 13d: Get join token from the recovery host

		- etcd data backup directory (from before outage)
		- Certificate files and `generated.config.json` from a server-mode host (from before outage)


		Status: You now have 1 server-mode host online. Quorum is NOT YET RESTORED. Continue to restore quorum.

		### Step 6: Remove Lost Host Records

Conversation

moizpgedge commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Disaster Recovery Documentation

Overview

Documents Reviewed

Document 1: TOTAL_QUORUM_LOSS_RECOVERY.md

Purpose

What's Included

Assessment

Recommendation

Document 2: MAJORITY_QUORUM_LOSS_RECOVERY.md

Purpose

What's Included

Assessment

Recommendation

Document 3: PARTIAL_QUORUM_RECOVERY.md

Purpose

What's Included

Scenario 1: Client-Mode Host Recovery

Scenario 2: Server-Mode Host Recovery (Quorum Intact)

Assessment

Recommendation

Overall Assessment

Documentation Quality

Consistency Observations

Suggestions for Improvement

Minor Issues

Summary

What's Included

Key Highlights

Recommendation

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

moizpgedge commented Feb 4, 2026 •

edited

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading