Skip to content

Create Quorum Recovery.md#262

Open
moizpgedge wants to merge 6 commits intomainfrom
Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost
Open

Create Quorum Recovery.md#262
moizpgedge wants to merge 6 commits intomainfrom
Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost

Conversation

@moizpgedge
Copy link
Contributor

@moizpgedge moizpgedge commented Feb 4, 2026

PR Review: Disaster Recovery Documentation

Overview

This PR introduces three comprehensive disaster recovery guides for the pgEdge Control Plane, covering different quorum loss scenarios. The documentation provides step-by-step recovery procedures for operators to restore the Control Plane cluster after various failure scenarios.

Documents Reviewed

  1. TOTAL_QUORUM_LOSS_RECOVERY.md - Recovery when all server-mode hosts are lost
  2. MAJORITY_QUORUM_LOSS_RECOVERY.md - Recovery when majority of server-mode hosts are lost (but at least one remains)
  3. PARTIAL_QUORUM_RECOVERY.md - Recovery when quorum remains intact but hosts are lost

Document 1: TOTAL_QUORUM_LOSS_RECOVERY.md

Purpose

Recovery guide for the most severe disaster scenario: all server-mode hosts are lost (100% server-mode host loss). This is the starting point when no server-mode hosts are online and the Control Plane API is completely inaccessible.

What's Included

Prerequisites:

  • etcd data backup directory (from before outage)
  • Certificate files and generated.config.json from a server-mode host
  • Recovery host with matching certificates/config

Recovery Steps (6 steps):

  1. Stop All Control Plane Services - Ensures clean state before recovery
  2. Prepare Recovery Host - Copies certificates and config files
  3. Backup Existing etcd Data - Safety measure to backup any existing data before restore
  4. Restore from Backup - Copies etcd backup directory to recovery host
  5. Start Control Plane on Recovery Host - Starts only the recovery host (one server-mode host)
  6. Verify Recovery Host - Confirms API accessibility and host status

Key Features:

  • Uses backup directory (not snapshot) for recovery
  • Restores one server-mode host only
  • Clear verification steps to confirm one host is online
  • References MAJORITY_QUORUM_LOSS_RECOVERY.md for next steps

Assessment

Strengths:

  • Concise and focused (108 lines)
  • Clear step-by-step instructions
  • Proper safety measures (backup before restore)
  • Good verification steps
  • Clear handoff to majority quorum loss recovery guide

⚠️ Observations:

  • Document uses backup directory approach (not snapshot restore)
  • Step 3 backs up existing data, which may not exist if all hosts are truly lost (but good safety practice)
  • Could benefit from a note about where to get the backup directory if not already available

Recommendation

Approve - Well-structured, focused guide that correctly restores one server-mode host from backup.


Document 2: MAJORITY_QUORUM_LOSS_RECOVERY.md

Purpose

Recovery guide for when quorum is lost but at least one server-mode host is still online (>50% but not 100% of server-mode hosts offline). This guide is used after TOTAL_QUORUM_LOSS_RECOVERY.md has restored one host, or when some hosts remain online but quorum is lost.

What's Included

Prerequisites:

  • etcd snapshot (taken before outage)
  • Recovery host: One of the remaining online server-mode hosts (already has certificates/config)

Recovery Steps (11 steps):

  1. Stop All Control Plane Services - Ensures clean state
  2. Backup Existing etcd Data - Safety backup
  3. Restore Snapshot - Uses etcdutl snapshot restore to restore etcd state
  4. Start Control Plane - Starts recovery host
  5. Verify Recovery Host - Confirms one host is online
  6. Remove Lost Host Records - Removes stale host records one at a time
  7. Rejoin Server-Mode Hosts Until Quorum Restored - Critical step-by-step process:
    • Stop service → Clear state → Start service → Get join token → Join host → Check quorum status
    • Repeats until quorum threshold is reached
  8. Rejoin Remaining Server-Mode Hosts - Adds redundancy after quorum restored
  9. Rejoin Client-Mode Hosts - Only after quorum is restored
  10. Restart All Hosts - Final restart cycle
  11. Final Verification - Comprehensive checks (hosts, databases, etcd cluster)

Key Features:

  • Uses snapshot restore (via etcdutl)
  • Detailed quorum calculation and verification
  • Step-by-step host rejoining with quorum checking
  • Comprehensive troubleshooting section
  • Best practices included

Assessment

Strengths:

  • Comprehensive and detailed (340 lines)
  • Excellent quorum calculation explanation
  • Clear decision points (when to stop rejoining server-mode hosts)
  • Good separation between server-mode and client-mode recovery
  • Strong troubleshooting section
  • Emphasizes working one host at a time

⚠️ Observations:

  • Uses snapshot restore (different from TOTAL_QUORUM_LOSS which uses backup directory)
  • Step 3 mentions optional --bump-revision and --mark-compacted flags but doesn't show them in the command
  • Could clarify the relationship with TOTAL_QUORUM_LOSS_RECOVERY (this is the continuation)

Recommendation

Approve with minor suggestions:

  • Consider adding --bump-revision and --mark-compacted to the restore command if recommended for production
  • Add a note at the beginning clarifying this can be used standalone (if one host remains) or as a continuation of TOTAL_QUORUM_LOSS_RECOVERY

Document 3: PARTIAL_QUORUM_RECOVERY.md

Purpose

Recovery guide for when quorum remains intact but one or more hosts are lost. This is the simplest recovery scenario since etcd cluster can still accept writes and elect leaders.

What's Included

Two Main Scenarios:

Scenario 1: Client-Mode Host Recovery

  • When: One or more client-mode hosts are lost, quorum intact
  • Impact: Lost hosts unreachable, database instances offline, no quorum impact
  • Recovery Steps (5 steps):
    1. Stop Control Plane service
    2. Remove generated.config.json
    3. Start Control Plane service
    4. Request join token
    5. Join host to cluster
  • Repeatable: Can be done for each lost client-mode host independently

Scenario 2: Server-Mode Host Recovery (Quorum Intact)

  • When: One server-mode host lost, but enough others remain for quorum
  • Impact: Lost host unreachable, database instances offline, quorum remains intact
  • Recovery Steps (8 steps):
    1. Remove lost host record
    2. Stop host service
    3. Clear server-mode state (etcd, certificates, config)
    4. Start host service (or redeploy stack)
    5. Request join token
    6. Join lost host
    7. Verify host health
    8. Restore database instances (with ZODAN explanation)

Key Features:

  • Two distinct scenarios with different procedures
  • Clear quorum calculation examples
  • Database instance restoration with ZODAN (Zero Downtime Add Node) explanation
  • Recovery order guidance for multiple hosts
  • Comprehensive verification checklist
  • Troubleshooting section

Assessment

Strengths:

  • Well-organized with clear scenario separation
  • Excellent explanation of ZODAN for database restoration
  • Good quorum calculation examples
  • Clear recovery order guidance
  • Comprehensive verification checklist
  • Practical troubleshooting

Notable Features:

  • Explains that Control Plane API remains accessible (key advantage)
  • Documents ZODAN automatic behavior
  • Shows how to specify source_node for database restoration
  • Clear distinction between client-mode and server-mode recovery

Recommendation

Approve - Excellent documentation that covers the simpler recovery scenarios comprehensively. Well-structured and practical.


Overall Assessment

Documentation Quality

Strengths:

  1. Clear Progression: The three documents form a logical progression from worst-case (total loss) → majority loss → partial loss
  2. Comprehensive Coverage: All major disaster scenarios are covered
  3. Practical Commands: All steps include executable commands with proper variable usage
  4. Safety First: Backup steps included before destructive operations
  5. Verification: Each guide includes verification steps
  6. Troubleshooting: Common issues and solutions documented
  7. Best Practices: Guidance on recovery order and procedures

Consistency Observations

⚠️ Inconsistency:

  • TOTAL_QUORUM_LOSS_RECOVERY.md.md uses backup directory approach
  • MAJORITY_QUORUM_LOSS_RECOVERY.md uses snapshot restore approach

Question: Is this intentional? Should both use the same approach, or are these different valid recovery methods?

Suggestions for Improvement

  1. Cross-References: Add clearer navigation between documents

    • TOTAL → MAJORITY (already done ✅)
    • Consider adding links from PARTIAL to MAJORITY if quorum is lost during recovery
  2. Consistency Check: Clarify when to use backup directory vs snapshot restore

    • Document the difference or standardize on one approach
  3. Prerequisites Section:

    • Consider adding a "Before You Begin" section explaining:
      • How to identify which guide to use
      • How to verify current cluster state
      • How to take snapshots/backups proactively
  4. Example Values:

    • Consider adding example values for variables (e.g., PGEDGE_DATA_DIR="/data/control-plane")
  5. Error Handling:

    • Add more explicit error handling in commands (e.g., || exit 1 for critical steps)

Minor Issues

  1. File Naming: TOTAL_QUORUM_LOSS_RECOVERY.md.md has double extension - consider renaming to TOTAL_QUORUM_LOSS_RECOVERY.md

  2. Step Numbering: In MAJORITY_QUORUM_LOSS_RECOVERY, Step 7 has sub-steps (7a-7f) which is good, but Step 8 references "Steps 7a-7e" - could be clearer

  3. Variable Consistency: Some documents use ${VAR} and others use $VAR - consider standardizing


Summary

What's Included

This PR adds three comprehensive disaster recovery guides:

  1. Total Quorum Loss Recovery - Restores one server-mode host from backup when all are lost
  2. Majority Quorum Loss Recovery - Restores quorum by rejoining hosts after snapshot restore
  3. Partial Quorum Recovery - Recovers lost hosts when quorum remains intact

Key Highlights

  • Complete Coverage: All disaster scenarios from total loss to partial loss
  • Step-by-Step: Detailed, executable commands for each scenario
  • Safety First: Backup procedures before destructive operations
  • Verification: Comprehensive checks at each stage
  • Troubleshooting: Common issues and solutions documented
  • Best Practices: Recovery order and procedural guidance

Recommendation

Approve - This is excellent documentation that fills a critical gap in operational procedures. The guides are well-structured, comprehensive, and practical.

Suggested Follow-ups:

  1. Clarify backup directory vs snapshot restore approach
  2. Add cross-references and navigation improvements
  3. Consider renaming TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove double extension
  4. Add a "Quick Reference" or "Decision Tree" to help operators choose the right guide

Plat-314

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds four new documentation guides describing quorum rules and structured recovery procedures for a pgEdge Control Plane with embedded etcd, covering partial, majority, and total quorum loss scenarios, step-by-step recovery workflows, commands, verification steps, and operator checklists.

Changes

Cohort / File(s) Summary
Quorum reference
docs/Quorum Recovery.md
New guide defining server-mode vs client-mode hosts, quorum calculation (floor(N/2)+1), failure scenarios matrix, and targeted recovery actions with examples and verification steps.
Disaster recovery workflows
docs/Disaster recovery/PARTIAL_QUORUM_RECOVERY.md, docs/Disaster recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md, docs/Disaster recovery/TOTAL_QUORUM_LOSS_RECOVERY.md.md
Three new recovery guides: partial-quorum procedures when quorum remains; majority-quorum-loss workflow restoring server-mode hosts from snapshots; total-quorum-loss workflow for full server-mode outage including recovery-host steps, etcd snapshot/restore, and transition to majority recovery. Each includes prerequisites, commands, verification, and operator checklist.

Poem

🐇 I nibble logs beneath the moon,

When votes are sparse and hearts are strewn,
One hop, one host, we mend the ring,
Snapshots hum and joiners sing,
A rabbit bounces—quorum's boon.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Create Quorum Recovery.md' clearly and specifically describes the main change: addition of a new Quorum Recovery documentation file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description is comprehensive and well-structured, providing detailed reviews of three disaster recovery guides with clear assessment and recommendations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/Quorum` Recovery.md:
- Line 634: Replace the incorrect step label "Step 11d" with "Step 13d" in the
rejoining client-mode hosts section (the line currently showing "###### Step
11d: Get join token from the recovery host"); update any matching subheading
text to read "Step 13d" so it correctly falls under the Step 13 sequence for
rejoining client-mode hosts.
🧹 Nitpick comments (1)
docs/Quorum Recovery.md (1)

118-120: Optional: Consider minor rephrasing for variety (not blocking).

Static analysis flags repetitive sentence beginnings at lines 118-120 ("You will...") and lines 314-316 ("Rejoin..."). While these are stylistically flagged, they're actually acceptable in technical documentation contexts—the first is in an assumptions list, and the second is in a procedural step list where parallel structure aids clarity.

Also applies to: 314-316

sleep 10
```

###### Step 11d: Get join token from the recovery host
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix incorrect step numbering.

Line 634 shows "Step 11d" but should be "Step 13d" (this is within the Step 13 section for rejoining client-mode hosts). This appears to be a copy-paste error.

📝 Proposed fix
-###### Step 11d: Get join token from the recovery host
+###### Step 13d: Get join token from the recovery host
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
###### Step 11d: Get join token from the recovery host
###### Step 13d: Get join token from the recovery host
🤖 Prompt for AI Agents
In `@docs/Quorum` Recovery.md at line 634, Replace the incorrect step label "Step
11d" with "Step 13d" in the rejoining client-mode hosts section (the line
currently showing "###### Step 11d: Get join token from the recovery host");
update any matching subheading text to read "Step 13d" so it correctly falls
under the Step 13 sequence for rejoining client-mode hosts.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md:
- Line 1: The directory name contains a trailing space ("Disaster recovery ")
which breaks tooling; rename the directory and update the file
MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name
(e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure
any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated
accordingly.
- Around line 121-132: After extracting RESP and TASK_ID, validate that TASK_ID
is non-empty and abort with a clear error message if empty (refer to RESP and
TASK_ID) to avoid polling an invalid endpoint; in the polling loop inspect
STATUS (derived from the tasks endpoint) and if it equals "failed" print/log the
task details from RESP or the task query, exit with a non-zero code, and avoid
continuing recovery as if deletion succeeded; ensure both error paths provide
actionable messages including task id and status so operators can debug.
- Around line 182-185: The etcd credentials are being passed on the command line
(ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead
export them as environment variables and call etcdctl without the
--user/--password flags so the secret never appears in the process
list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and
ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status
command that populates SERVER_COUNT (and apply the same change to the similar
etcdctl call in Step 11).
- Line 101: The docs claim Control Plane auto-detects restored snapshots and
handles the --force-new-cluster flag, but the code paths IsInitialized(),
start(), and embedConfig() lack any snapshot detection or flag handling; either
remove that note from the docs or implement detection: update
IsInitialized()/start() to inspect snapshot metadata vs WAL presence (detect a
restored snapshot with missing WAL), and modify embedConfig() (or the etcd start
logic) to set the equivalent ForceNewCluster behavior or inject the
--force-new-cluster setting when a restored snapshot is detected so etcd starts
correctly after snapshot restore.
🧹 Nitpick comments (4)
docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md (4)

6-8: Clarify the emoji indicators.

Lines 6-8 use ❌ (negative indicator) but describe capabilities that "may be accessible" or "cannot proceed", which creates inconsistency. The ❌ typically means "not available" or "blocked", but line 6 says the API "may be accessible", suggesting partial availability rather than complete failure.

Consider using ⚠️ for partial/uncertain states or rewriting to clearly indicate whether these are always blocked or conditionally available.


91-91: Provide guidance for the optional production flags.

The optional flags --bump-revision and --mark-compacted are mentioned as "recommended for production" but lack guidance on what value to use for <revision-bump-value>. Operators following this guide won't know what value is appropriate.

Consider adding a note explaining how to determine the revision bump value (e.g., "use a large value like 1000000 to ensure new revisions are higher than client watch revisions").


321-321: Use safer process termination sequence.

The command sudo pkill -9 etcd forcefully kills all etcd processes with SIGKILL, preventing graceful shutdown and potentially leaving corrupted state. It may also affect unrelated etcd installations on the same system.

⚡ Proposed safer sequence
-Kill any remaining etcd processes: `sudo pkill -9 etcd`
+Check for and gracefully terminate any remaining etcd processes:
+```bash
+# Check if etcd processes are running
+pgrep -a etcd
+
+# Try graceful termination first
+sudo pkill -TERM etcd
+sleep 5
+
+# Force kill only if still running
+if pgrep etcd > /dev/null; then
+    echo "⚠️ Etcd processes still running, forcing termination..."
+    sudo pkill -9 etcd
+fi
+```

39-55: Add snapshot verification to prerequisites.

The guide assumes the snapshot exists and is valid but doesn't include verification steps. A corrupted or stale snapshot will cause recovery to fail, potentially after significant progress through the procedure.

📋 Proposed snapshot verification steps

Add after line 54 (before "## Recovery Steps"):

### Verify Snapshot

Before starting recovery, verify the snapshot is valid and recent:

```bash
# Check snapshot exists
ls -lh "${SNAPSHOT_PATH}"

# Verify snapshot integrity
etcdutl snapshot status "${SNAPSHOT_PATH}" -w table

# Check snapshot age (optional)
SNAPSHOT_DATE=$(stat -c %y "${SNAPSHOT_PATH}" 2>/dev/null || stat -f %Sm "${SNAPSHOT_PATH}")
echo "Snapshot created: ${SNAPSHOT_DATE}"

Decision point: Confirm the snapshot age is acceptable for your data loss tolerance before proceeding.

</details>

</blockquote></details>

</blockquote></details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@@ -0,0 +1,339 @@
# Majority Quorum Loss Recovery Guide
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix directory name with trailing space.

The file path docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md contains a directory name with a trailing space ("Disaster recovery "). This can cause issues with shell scripts, tab completion, and automated tooling.

Rename to: docs/Disaster_Recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md or docs/disaster-recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md (following common naming conventions with underscores or hyphens, no spaces).

🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 1, The
directory name contains a trailing space ("Disaster recovery ") which breaks
tooling; rename the directory and update the file
MAJORITY_QUORUM_LOSS_RECOVERY.md to live under a normalized directory name
(e.g., "Disaster_Recovery" or "disaster-recovery") with no spaces, and ensure
any references or links to MAJORITY_QUORUM_LOSS_RECOVERY.md are updated
accordingly.

sleep 10
```

**Note:** Control Plane automatically detects the restored snapshot and handles `--force-new-cluster` internally. You don't need to run etcd manually.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n -C3 "force-new-cluster|forceNewCluster" --type=go

Repository: pgEdge/control-plane

Length of output: 46


🏁 Script executed:

# Also search for snapshot detection and recovery logic
rg -n -C3 "snapshot.*detect|detect.*snapshot|MAJORITY_QUORUM" --type=go

Repository: pgEdge/control-plane

Length of output: 46


🏁 Script executed:

# Search for etcd initialization and cluster startup patterns
rg -n -C3 "cluster.*init|initial.*cluster|new-cluster" --type=go

Repository: pgEdge/control-plane

Length of output: 50377


🏁 Script executed:

# Search for etcd initialization and where command flags are built
rg -n "start.*etcd|StartEtcd|cmd.*etcd|--force" --type=go -A5 | head -200

Repository: pgEdge/control-plane

Length of output: 12065


🏁 Script executed:

# Look for recovery or snapshot detection logic
rg -n -i "snapshot|recovery|restore" --type=go server/internal/etcd | head -100

Repository: pgEdge/control-plane

Length of output: 46


🏁 Script executed:

# Search for where etcd is started in the actual code
find . -name "*.go" -exec rg -l "etcd.*start|start.*etcd" {} \; | head -20

Repository: pgEdge/control-plane

Length of output: 160


🏁 Script executed:

cat -n server/internal/etcd/embedded.go | head -150

Repository: pgEdge/control-plane

Length of output: 5332


🏁 Script executed:

# Look for the embedConfig function that builds etcd config
sed -n '200,400p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 5422


🏁 Script executed:

# Search for ForceNewCluster or InitialCluster configuration
rg -n "ForceNewCluster|InitialCluster|initial_cluster" --type=go

Repository: pgEdge/control-plane

Length of output: 401


🏁 Script executed:

# Find initializationConfig and embedConfig functions
rg -n "func initializationConfig|func embedConfig" --type=go -A 20

Repository: pgEdge/control-plane

Length of output: 3442


🏁 Script executed:

# Search for snapshot, restore, or WAL related operations
rg -n -i "snapshot|restore.*etcd|wal.*restore" --type=go | head -50

Repository: pgEdge/control-plane

Length of output: 1879


🏁 Script executed:

# Check the disaster recovery documentation itself to see what it claims
cat docs/Disaster\ recovery/MAJORITY_QUORUM_LOSS_RECOVERY.md | head -120

Repository: pgEdge/control-plane

Length of output: 153


🏁 Script executed:

# Find the disaster recovery documentation
find . -name "*MAJORITY_QUORUM*" -o -name "*Disaster*recovery*" 2>/dev/null

Repository: pgEdge/control-plane

Length of output: 148


🏁 Script executed:

# List files in docs directory to understand structure
ls -la docs/ 2>/dev/null || ls -la "docs/" 2>/dev/null || find docs -maxdepth 2 -type f -name "*.md" 2>/dev/null | head -20

Repository: pgEdge/control-plane

Length of output: 1228


🏁 Script executed:

# Search for content matching the review comment
rg -n "automatically detects|force-new-cluster" --type=md

Repository: pgEdge/control-plane

Length of output: 1025


🏁 Script executed:

cat "docs/Disaster recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md" | sed -n '90,110p'

Repository: pgEdge/control-plane

Length of output: 715


🏁 Script executed:

# Check the start() function to see if there's any snapshot or force-new-cluster handling
sed -n '150,250p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2943


🏁 Script executed:

# Check the embedConfig function more thoroughly
sed -n '626,695p' server/internal/etcd/embedded.go

Repository: pgEdge/control-plane

Length of output: 2937


The automatic force-new-cluster detection claim is not implemented in the Control Plane codebase.

The documentation states that Control Plane "automatically detects the restored snapshot and handles --force-new-cluster internally," but the code does not support this claim. The IsInitialized() method only checks for the existence of the WAL directory to determine if etcd has been started before. The start() and embedConfig() functions create a standard etcd configuration without any snapshot detection logic or --force-new-cluster flag handling. When etcd starts on a restored snapshot, it will not automatically apply --force-new-cluster unless the WAL directory is missing—meaning the recovery procedure as documented will not work as described.

Either remove this note from the documentation, or implement the snapshot detection and automatic flag handling in the Control Plane code.

🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md at line 101, The
docs claim Control Plane auto-detects restored snapshots and handles the
--force-new-cluster flag, but the code paths IsInitialized(), start(), and
embedConfig() lack any snapshot detection or flag handling; either remove that
note from the docs or implement detection: update IsInitialized()/start() to
inspect snapshot metadata vs WAL presence (detect a restored snapshot with
missing WAL), and modify embedConfig() (or the etcd start logic) to set the
equivalent ForceNewCluster behavior or inject the --force-new-cluster setting
when a restored snapshot is detected so etcd starts correctly after snapshot
restore.

Comment on lines 121 to 132
RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true")
TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')

# Wait for completion
while true; do
STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty')
if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then
break
fi
sleep 5
done
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add error handling for task failures and missing task IDs.

The script has two error handling gaps:

  1. Empty TASK_ID: If all jq paths fail to extract a task ID (line 122), TASK_ID will be empty, causing invalid API calls in the polling loop.
  2. Failed tasks: The loop exits on "failed" status (line 127) but doesn't handle the failure - recovery continues as if deletion succeeded.
🛡️ Proposed fix to add validation and failure handling
 RESP=$(curl -sS -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/${LOST_HOST_ID}?force=true")
 TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
 
+# Validate task ID exists
+if [ -z "${TASK_ID}" ]; then
+    echo "❌ Failed to get task ID from response: ${RESP}"
+    exit 1
+fi
+
 # Wait for completion
 while true; do
     STATUS=$(curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/tasks/${TASK_ID}" | jq -r '.task.status // .status // empty')
     if [ "${STATUS}" = "completed" ] || [ "${STATUS}" = "failed" ]; then
         break
     fi
     sleep 5
 done
+
+# Check if task failed
+if [ "${STATUS}" = "failed" ]; then
+    echo "❌ Host removal task failed for ${LOST_HOST_ID}"
+    exit 1
+fi
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 121 -
132, After extracting RESP and TASK_ID, validate that TASK_ID is non-empty and
abort with a clear error message if empty (refer to RESP and TASK_ID) to avoid
polling an invalid endpoint; in the polling loop inspect STATUS (derived from
the tasks endpoint) and if it equals "failed" print/log the task details from
RESP or the task query, exit with a non-zero code, and avoid continuing recovery
as if deletion succeeded; ensure both error paths provide actionable messages
including task id and status so operators can debug.

Comment on lines 182 to 185
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security risk: Credentials exposed in process list.

Lines 182-184 pass etcd credentials via command-line arguments (--user and --password), which are visible in the process list (ps aux) and shell history. This exposes sensitive credentials to any user on the system.

🔒 Proposed fix using environment variables
 ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
 ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
-SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+export ETCDCTL_USER="${ETCD_USER}"
+export ETCDCTL_PASSWORD="${ETCD_PASS}"
+SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \
+    --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \
+    --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \
+    --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \
+    --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \
+    -w json | jq 'length')
 [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️  Continue rejoining server-mode hosts"

Note: Apply the same pattern to lines 260-270 in Step 11.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
export ETCDCTL_USER="${ETCD_USER}"
export ETCDCTL_PASSWORD="${ETCD_PASS}"
SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status \
--endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \
--cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \
--cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \
--key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \
-w json | jq 'length')
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /MAJORITY_QUORUM_LOSS_RECOVERY.md around lines 182 -
185, The etcd credentials are being passed on the command line
(ETCD_USER/ETCD_PASS in the etcdctl invocation) which exposes them; instead
export them as environment variables and call etcdctl without the
--user/--password flags so the secret never appears in the process
list—set/export ETCDCTL_USERNAME (or an agreed env var) to "${ETCD_USER}" and
ETCDCTL_PASSWORD to "${ETCD_PASS}" before running the etcdctl endpoint status
command that populates SERVER_COUNT (and apply the same change to the similar
etcdctl call in Step 11).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md:
- Around line 140-152: After capturing RESP and deriving TASK_ID, validate that
TASK_ID is non-empty before entering the wait loop: check RESP for an error or
missing task fields (the curl response stored in RESP and extraction into
TASK_ID) and if TASK_ID is empty, log or echo the RESP and a clear error message
and exit nonzero (or skip waiting) to avoid hanging; otherwise proceed to the
existing wait loop using TASK_ID. Ensure the validation references the RESP
variable and the TASK_ID extraction logic so you handle unexpected DELETE
failure formats consistently.
- Around line 59-100: The recovery steps lack error handling and proper quoting:
update the rm command to quote ${PGEDGE_DATA_DIR} (target: generated.config.json
removal), wrap SERVER_HOST and CLIENT_HOST expansions in quotes when calling
curl, capture the join token into JOIN_TOKEN and validate it is non-empty/not
"null" before proceeding, replace the blind sleep after docker service scale
control-plane_<lost-client-host-id>=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.

In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md:
- Line 1: Rename the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the
duplicate extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal
references or links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they
reference TOTAL_QUORUM_LOSS_RECOVERY.md instead.
- Around line 199-206: The one-liner exposing ETCD_PASS on the command line and
hard to read should be refactored: read ETCD_USER/ETCD_PASS from
generated.config.json (already present), export a combined ETCDCTL_USER or use
an env var for credentials and set ETCDCTL_API=3, call etcdctl endpoint status
across multiple lines (using --endpoints, --cacert, --cert, --key) without
passing --user/--password, capture the JSON length into SERVER_COUNT, then
immediately unset the credential env var; update the check that compares
SERVER_COUNT to QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER,
ETCD_PASS, ETCDCTL_USER and ETCDCTL_API symbols should be used to locate and
change the code).
- Around line 280-290: The etcdctl invocation is exposing ETCD_PASS on the
command line; instead export the secret into an environment variable (e.g.,
export ETCDCTL_PASSWORD="${ETCD_PASS}") and remove the --password "${ETCD_PASS}"
argument from the etcdctl endpoint status call (keep ETCDCTL_API, --endpoints,
--cacert, --cert, --key, and --user as-is), then immediately unset
ETCDCTL_PASSWORD after the command to avoid leaving credentials in the process
environment.
🧹 Nitpick comments (3)
docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md (1)

322-322: Hyphenate compound modifier for clarity.

The phrase "zero downtime add node" functions as a compound modifier and should be hyphenated as "zero-downtime-add-node" for grammatical correctness.

📝 Proposed fix
-**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero downtime add node.
+**Solution:** Update the database spec (Step 8) to include the recovered host in the node's `host_ids` array. Control Plane will automatically use zero-downtime-add-node.
docs/Disaster recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md (2)

86-90: Clarify how to verify certificate identity match.

The instruction "Ensure the host ID matches the certificate identity" is vague and lacks actionable verification steps. During a critical outage, operators need explicit commands to validate this requirement.

📋 Suggested addition for verification steps

Add a verification subsection:

**Verify certificate identity:**

```bash
# Extract Common Name from certificate
CERT_CN=$(openssl x509 -in "${PGEDGE_DATA_DIR}/certificates/etcd-server.crt" -noout -subject | grep -oP 'CN\s*=\s*\K[^,]+')

# Compare with host ID
if [ "${CERT_CN}" != "${RECOVERY_HOST_ID}" ]; then
    echo "ERROR: Certificate CN (${CERT_CN}) does not match host ID (${RECOVERY_HOST_ID})"
    exit 1
fi
echo "✅ Certificate identity matches host ID"

</details>

---

`341-344`: **Recommend graceful shutdown before SIGKILL.**

Using `pkill -9` (SIGKILL) immediately prevents etcd from cleanly shutting down. In a troubleshooting scenario, attempting a graceful shutdown first reduces risk of data corruption.



<details>
<summary>🔧 Proposed improvement</summary>

```diff
 **Fix:** 
 - Stop all Control Plane services (Step 1)
-- Kill any remaining etcd processes: `sudo pkill -9 etcd`
+- Try graceful shutdown first: `sudo pkill etcd` (wait 10 seconds)
+- If processes remain, force kill: `sudo pkill -9 etcd`
 - Verify certificate files match the snapshot identity

Comment on lines 59 to 100
```bash
PGEDGE_DATA_DIR="<path-to-data-dir>"
API_PORT=<api-port>
SERVER_HOST="http://<server-mode-host-ip>:${API_PORT}"
CLIENT_HOST="http://<client-mode-host-ip>:${API_PORT}"
```

#### Step 1: Stop the Control Plane Service

```bash
# On Swarm manager node
docker service scale control-plane_<lost-client-host-id>=0
```

#### Step 2: Remove generated.config.json

```bash
# On lost client-mode host node
rm -f ${PGEDGE_DATA_DIR}/generated.config.json
```

#### Step 3: Start the Control Plane Service

```bash
# On Swarm manager node
docker service scale control-plane_<lost-client-host-id>=1
sleep 10
```

#### Step 4: Request a Join Token

```bash
JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
```

#### Step 5: Join the Host to the Cluster

```bash
curl -sS -X POST ${CLIENT_HOST}/v1/cluster/join \
-H "Content-Type: application/json" \
-d "{\"token\":\"${JOIN_TOKEN}\",\"server_url\":\"${SERVER_HOST}\"}"
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling and quote variable expansions.

The recovery commands lack error handling, which could cause silent failures during a critical outage. Additionally, unquoted path variables could break if paths contain spaces.

Issues:

  1. Line 91: curl could fail but the script continues
  2. Line 77: Unquoted ${PGEDGE_DATA_DIR} in rm command
  3. Lines 85-86: sleep 10 doesn't verify service readiness
🛡️ Proposed improvements for safety and reliability

Add error handling for API calls:

 #### Step 4: Request a Join Token
 
 ```bash
-JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
+JOIN_TOKEN=$(curl -sS "${SERVER_HOST}/v1/cluster/join-token" | jq -r ".token")
+if [ -z "${JOIN_TOKEN}" ] || [ "${JOIN_TOKEN}" = "null" ]; then
+    echo "ERROR: Failed to obtain join token"
+    exit 1
+fi

**Quote path variables:**

```diff
 #### Step 2: Remove generated.config.json
 
 ```bash
 # On lost client-mode host node
-rm -f ${PGEDGE_DATA_DIR}/generated.config.json
+rm -f "${PGEDGE_DATA_DIR}/generated.config.json"

**Add service readiness check:**

```diff
 #### Step 3: Start the Control Plane Service
 
 ```bash
 # On Swarm manager node
 docker service scale control-plane_<lost-client-host-id>=1
-sleep 10
+
+# Wait for service to be ready
+until curl -sS "${CLIENT_HOST}/v1/health" >/dev/null 2>&1; do
+    sleep 2
+done

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In @docs/Disaster recovery /PARTIAL_QUORUM_RECOVERY.md around lines 59 - 100,
The recovery steps lack error handling and proper quoting: update the rm command
to quote ${PGEDGE_DATA_DIR} (target: generated.config.json removal), wrap
SERVER_HOST and CLIENT_HOST expansions in quotes when calling curl, capture the
join token into JOIN_TOKEN and validate it is non-empty/not "null" before
proceeding, replace the blind sleep after docker service scale
control-plane_=1 with a readiness loop that polls the
cluster health endpoint (e.g., ${CLIENT_HOST}/v1/health) and only continues when
healthy, and ensure any failing curl or jq commands cause the script to exit
with an error message so failures are not silent.


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines 140 to 152
#### Step 1: Remove the Lost Host Record

**Run from a node with API access:**

```bash
RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true")
TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
echo "${RESP}" | jq '.'
```

#### Step 2: Stop the Host Service

**Run on a Swarm manager node. Skip this step if the service name does not exist.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate task ID before waiting for completion.

If the DELETE request fails or returns an unexpected format, TASK_ID could be empty, causing the wait loop to fail or behave unexpectedly during a critical recovery operation.

🛡️ Proposed fix to validate task ID
 RESP=$(curl -sS -X DELETE "${SERVER_HOST}/v1/hosts/${LOST_HOST_ID}?force=true")
 TASK_ID=$(echo "${RESP}" | jq -r '.task.task_id // .task.id // .id // empty')
-echo "${RESP}" | jq '.'
+
+if [ -z "${TASK_ID}" ] || [ "${TASK_ID}" = "null" ]; then
+    echo "ERROR: Failed to delete host or obtain task ID"
+    echo "${RESP}" | jq '.'
+    exit 1
+fi
+
+echo "Task ID: ${TASK_ID}"
🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /PARTIAL_QUORUM_RECOVERY.md around lines 140 - 152,
After capturing RESP and deriving TASK_ID, validate that TASK_ID is non-empty
before entering the wait loop: check RESP for an error or missing task fields
(the curl response stored in RESP and extraction into TASK_ID) and if TASK_ID is
empty, log or echo the RESP and a clear error message and exit nonzero (or skip
waiting) to avoid hanging; otherwise proceed to the existing wait loop using
TASK_ID. Ensure the validation references the RESP variable and the TASK_ID
extraction logic so you handle unexpected DELETE failure formats consistently.

@@ -0,0 +1,366 @@
# Total Quorum Loss Recovery Guide
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix double file extension in filename.

The file is named TOTAL_QUORUM_LOSS_RECOVERY.md.md with a duplicate .md extension. This should be corrected to TOTAL_QUORUM_LOSS_RECOVERY.md.

🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md at line 1, Rename
the file named TOTAL_QUORUM_LOSS_RECOVERY.md.md to remove the duplicate
extension (TOTAL_QUORUM_LOSS_RECOVERY.md) and update any internal references or
links that point to TOTAL_QUORUM_LOSS_RECOVERY.md.md so they reference
TOTAL_QUORUM_LOSS_RECOVERY.md instead.

Comment on lines 199 to 206
```bash
TOTAL_SERVER_HOSTS=<total-server-mode-hosts>
QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 ))
ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
[ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️ Continue rejoining server-mode hosts"
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refactor long command and avoid password on command line.

Line 204 contains a 300+ character command that is difficult to read and exposes credentials on the command line. During a critical outage, operators need clear, debuggable commands.

Issues:

  1. Security: Password visible in process listings (ps aux)
  2. Readability: Extremely long one-liner is hard to verify
  3. Maintainability: Difficult to modify or troubleshoot
♻️ Proposed refactor for clarity and security
 #### 8f. Check Quorum Status
 ```bash
 TOTAL_SERVER_HOSTS=<total-server-mode-hosts>
 QUORUM_THRESHOLD=$(( (TOTAL_SERVER_HOSTS / 2) + 1 ))
+
+# Read credentials from config
 ETCD_USER=$(jq -r ".etcd_username" "${PGEDGE_DATA_DIR}/generated.config.json")
 ETCD_PASS=$(jq -r ".etcd_password" "${PGEDGE_DATA_DIR}/generated.config.json")
-SERVER_COUNT=$(ETCDCTL_API=3 etcdctl endpoint status --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" --user "${ETCD_USER}" --password "${ETCD_PASS}" -w json | jq 'length')
+
+# Use environment variable to avoid password on command line
+export ETCDCTL_USER="${ETCD_USER}:${ETCD_PASS}"
+export ETCDCTL_API=3
+
+# Check server count
+SERVER_COUNT=$(etcdctl endpoint status \
+    --endpoints "https://${RECOVERY_HOST_IP}:${ETCD_CLIENT_PORT}" \
+    --cacert "${PGEDGE_DATA_DIR}/certificates/ca.crt" \
+    --cert "${PGEDGE_DATA_DIR}/certificates/etcd-user.crt" \
+    --key "${PGEDGE_DATA_DIR}/certificates/etcd-user.key" \
+    -w json | jq 'length')
+
+# Clear credential from environment
+unset ETCDCTL_USER
+
 [ "${SERVER_COUNT}" -ge "${QUORUM_THRESHOLD}" ] && echo "✅ Quorum RESTORED!" || echo "⚠️  Continue rejoining server-mode hosts"

</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🤖 Prompt for AI Agents
In `@docs/Disaster` recovery /TOTAL_QUORUM_LOSS_RECOVERY.md.md around lines 199 -
206, The one-liner exposing ETCD_PASS on the command line and hard to read
should be refactored: read ETCD_USER/ETCD_PASS from generated.config.json
(already present), export a combined ETCDCTL_USER or use an env var for
credentials and set ETCDCTL_API=3, call etcdctl endpoint status across multiple
lines (using --endpoints, --cacert, --cert, --key) without passing
--user/--password, capture the JSON length into SERVER_COUNT, then immediately
unset the credential env var; update the check that compares SERVER_COUNT to
QUORUM_THRESHOLD (the existing SERVER_COUNT, ETCD_USER, ETCD_PASS, ETCDCTL_USER
and ETCDCTL_API symbols should be used to locate and change the code).

@moizpgedge moizpgedge force-pushed the Docs/PLAT-304/Recovery-approach-when-etcd-quorum-is-lost branch from 356fb56 to 38fbae8 Compare February 12, 2026 10:29
rm -rf "${PGEDGE_DATA_DIR}/etcd"

# Copy etcd data from backup directory
cp -r "${BACKUP_ETCD_DIR}" "${PGEDGE_DATA_DIR}/etcd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be sufficient if the cluster previously had more than one node. You'll need to run etcdutl snapshot restore to reset the cluster membership like we did on the call yesterday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I'll update Step 4 to use etcdutl snapshot restore to reset cluster membership to a single node, which will match the approach we discussed.

Comment on lines 7 to 8
- etcd data backup directory (from before outage)
- Certificate files and `generated.config.json` from a server-mode host (from before outage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine these first two items and just say "A snapshot or other backup of the Control Plane data volume from before the outage". We're going to have to recommend some type of backups in our installation guide, and volume-level snapshots will be the easiest to recommend and easiest for customers to implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll update it to combine those into a single volume-level backup requirement. The restore process will still use etcdutl snapshot restore to reset cluster membership for multi-node clusters.

Comment on lines 34 to 37
Copy certificate files and `generated.config.json` from before outage to:
- `${PGEDGE_DATA_DIR}/certificates/`
- `${PGEDGE_DATA_DIR}/generated.config.json`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users should always restore the entire data volume, right? Otherwise, they're missing their Postgres instance data. If you've encountered scenarios where that doesn't work properly, you'll need to be specific here and describe when they should or shouldn't restore the entire directory/volume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add specific guidance on when to restore the entire volume vs. selective restoration. The document will clarify that full volume restoration is recommended in most cases, with selective restoration only for specific scenarios like recovering only Control Plane state without Postgres data.


```bash
# Verify API is accessible
curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/cluster" | jq '.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove jq from these commands? It is a nice suggestion and it's my preference as well, but we can't assume that it's installed on every system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

- One host should show `status: "reachable"` and `etcd_mode: "server"`
- Server count should be **1**

**Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove this:

Quorum is NOT YET RESTORED.

This process resets the cluster membership to one server, so you do have quorum. What's more accurate is to say that you need to re-add the other Control Plane servers to the cluster.

# Check databases
curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases" | jq '.[] | {id, state}'

# Check etcd cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to recommend that users try to doing anything with our internal Etcd other than using etcdutl to restore the instance. The host check above is already sufficient for validating that Etcd is working properly, so could you please remove these etcdctl instructions?


**Status:** You now have **1 server-mode host online**. Quorum is **NOT YET RESTORED**. Continue to restore quorum.

### Step 6: Remove Lost Host Records
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step onward is identical to the partial loss process, right? Do we need to duplicate all of those instructions in multiple places? I'm now wondering if all of this could be combined into a single document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll wait until all the documentation is fully accurate and finalized. Then @tsivaprasad and I will collaborate to merge his content and mine into a single, well-structured document.

JOIN_TOKEN=$(curl -sS ${SERVER_HOST}/v1/cluster/join-token | jq -r ".token")
```

#### Step 5: Join the Host to the Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made this same comment on the Google Doc version: this will break unless you remove the host first, because we validate that host IDs are unique (#217). The steps for restoring a client-mode host should be identical to restoring a server-mode host, so I don't think we need two sets of instructions for these.


```bash
# On lost client-mode host node
rm -f ${PGEDGE_DATA_DIR}/generated.config.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario that you're thinking of where the lost host still has data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll consolidate client-mode and server-mode recovery into a single unified process, and add a step to remove the host record before joining to prevent duplicate host ID errors.


```bash
# On Swarm manager node
docker service scale control-plane_<lost-client-host-id>=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario that you're thinking of where the control plane is still running on the lost host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host is accessible , but Control Plane has lost track of it or it's in a bad state. The Docker service may still be running from before the loss. We stop it, clear state, and restart to rejoin cleanly.
If the host is truly unreachable, you can't SSH to run Step 3, so this recovery process doesn't apply.

- PARTIAL_QUORUM_RECOVERY.md: Recovery when quorum remains intact

- MAJORITY_QUORUM_LOSS_RECOVERY.md: Recovery when majority server-mode hosts lost

- TOTAL_QUORUM_LOSS_RECOVERY.md.md: Recovery when all server-mode hosts lost

Key improvements: unified recovery processes, etcdutl snapshot restore, removed jq/etcdctl deps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants