Skip to content

fix: RAC checkpoint resume + heap-use-after-free on shutdown#6

Merged
rophy merged 4 commits intomasterfrom
fix/rac-checkpoint-resume
Mar 23, 2026
Merged

fix: RAC checkpoint resume + heap-use-after-free on shutdown#6
rophy merged 4 commits intomasterfrom
fix/rac-checkpoint-resume

Conversation

@rophy
Copy link
Owner

@rophy rophy commented Mar 23, 2026

Summary

  • fix: per-thread archive offset for RAC checkpoint resume — archive single-thread path used global metadata->fileOffset instead of per-thread value, causing ERROR 50051 did not find lwn at offset after kill/restart
  • fix: heap-use-after-free during error shutdownWriterStream destructor didn't clear ctx->parserThread, leaving dangling pointer for Builder/Parser destructors
  • chore: remove LFS checkout from CI — no longer needed after LFS removal from repo

Test plan

  • 2-hour RAC soak test: 40 iterations, 120 kill/restart cycles, zero failures
  • SQL e2e: free-23 (36 pass), xe-21 (36 pass), RAC (54 pass)
  • Single-instance checkpoint/restart: 10 cycles pass

Summary by CodeRabbit

  • Bug Fixes

    • Prevented invalid memory access with an added null-safety check in context cleanup.
    • Eliminated dangling/stale pointer issues during stream destruction to avoid crashes.
    • Improved per-thread context synchronization for more reliable archived log replay.
  • Chores

    • Adjusted CI checkout behavior to stop enabling Git LFS during the test workflow.

rophy added 3 commits March 23, 2026 20:35
LFS is no longer used after removing static fixtures.
WriterStream sets ctx->parserThread = this, but the destructor didn't
clear it. When writers are deleted before builders/parsers in
~OpenLogReplicator(), Builder::~Builder() and Parser::~Parser() call
freeMemoryChunk(ctx->parserThread, ...) on the freed WriterStream.

Fix: null out ctx->parserThread in WriterStream destructor, and guard
freeMemoryChunk() against null thread pointer.

Found by ASAN during RAC checkpoint/restart fault tolerance testing.
Tracked: #5
The archive single-thread processing path used the global
metadata->fileOffset instead of the per-thread value from
threadStates. After RAC checkpoint resume, thread 1's parser
would seek to thread 2's offset, causing ERROR 50051
"did not find lwn at offset".

Add per-thread context switch (fileOffset + sequence) before
parser->parse() and save-back after, matching what the
multi-thread online path already does.

Validated with 2-hour soak test: 43 iterations, 129 kill/restart
cycles, zero failures.
@coderabbitai
Copy link

coderabbitai bot commented Mar 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ff854b5-4cd9-4325-8242-ace9bf4800f0

📥 Commits

Reviewing files that changed from the base of the PR and between 9554210 and ecde394.

📒 Files selected for processing (1)
  • src/common/Ctx.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/common/Ctx.cpp

📝 Walkthrough

Walkthrough

This PR removes Git LFS from a CI checkout, adds null-check guards before per-thread context updates, synchronizes per-thread fileOffset/sequence around archived log parsing, and clears stale parser thread pointers in WriterStream destructor to prevent use-after-free.

Changes

Cohort / File(s) Summary
CI Configuration
​.github/workflows/redo-log-tests.yaml
Removed lfs: true option from actions/checkout@v4 in the test job.
Memory safety / Context guards
src/common/Ctx.cpp
Guard t->contextSet(...) calls with if (likely(t != nullptr)) to avoid dereferencing null/stale thread pointers.
Destructor cleanup
src/writer/WriterStream.cpp
Clear ctx->parserThread when it points to the destroying WriterStream instance to avoid stale pointer use during later free operations.
Per-thread state sync during parsing
src/replicator/Replicator.cpp
In single-thread archived-log path, restore per-thread fileOffset/sequence into shared metadata before parser->parse(), then save updated values back into the thread's saved state after parse, ensuring correct per-thread resume behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Replicator
    participant Metadata
    participant Parser
    participant ThreadState

    Replicator->>Metadata: select bestThread
    Replicator->>ThreadState: load saved fileOffset, sequence for bestThread
    ThreadState-->>Metadata: set metadata.fileOffset, metadata.sequence
    Replicator->>Parser: parser->parse(...)  -- executes with restored metadata
    Parser-->>Metadata: update fileOffset, sequence during parse
    Replicator->>ThreadState: save metadata.fileOffset, metadata.sequence back to thread state
    Replicator->>Metadata: setFirstNextScn(...)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 I nibbled stale pointers in the night,

Cleared a thread so memory sleeps tight,
Sync'd offsets, hopped through parsing gates,
CI chewed less LFS on its plates,
Hooray — no more bugs to give me frights! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main fixes in the changeset: RAC checkpoint resume bug and heap-use-after-free issue on shutdown.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rac-checkpoint-resume

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/common/Ctx.cpp (1)

531-558: ⚠️ Potential issue | 🔴 Critical

Null dereference is still possible in freeMemoryChunk.

Line 531 adds a null check, but Line 554 and Line 558 still dereference t unconditionally. With ctx->parserThread now being set to nullptr during teardown, this path can still crash.

🐛 Minimal fix
-    if (chunk != nullptr) {
-        t->contextSet(Thread::CONTEXT::OS, Thread::REASON::OS);
-        free(chunk);
-    }
-
-    t->contextSet(Thread::CONTEXT::CPU);
+    if (chunk != nullptr) {
+        if (likely(t != nullptr))
+            t->contextSet(Thread::CONTEXT::OS, Thread::REASON::OS);
+        free(chunk);
+    }
+
+    if (likely(t != nullptr))
+        t->contextSet(Thread::CONTEXT::CPU);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/Ctx.cpp` around lines 531 - 558, The code still dereferences the
Thread pointer t unconditionally later in freeMemoryChunk; wrap the two later
uses of t->contextSet (the call before free(chunk) and the final
t->contextSet(Thread::CONTEXT::CPU)) with a null check (e.g., only call
t->contextSet(...) if t != nullptr) to avoid dereferencing ctx->parserThread
after teardown; locate these calls in Ctx.cpp within the freeMemoryChunk path
and guard them the same way the early contextSet is guarded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/common/Ctx.cpp`:
- Around line 531-558: The code still dereferences the Thread pointer t
unconditionally later in freeMemoryChunk; wrap the two later uses of
t->contextSet (the call before free(chunk) and the final
t->contextSet(Thread::CONTEXT::CPU)) with a null check (e.g., only call
t->contextSet(...) if t != nullptr) to avoid dereferencing ctx->parserThread
after teardown; locate these calls in Ctx.cpp within the freeMemoryChunk path
and guard them the same way the early contextSet is guarded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0023f2be-cb1e-42e1-800c-ae0c6fe4713d

📥 Commits

Reviewing files that changed from the base of the PR and between 911ba28 and 9554210.

📒 Files selected for processing (4)
  • .github/workflows/redo-log-tests.yaml
  • src/common/Ctx.cpp
  • src/replicator/Replicator.cpp
  • src/writer/WriterStream.cpp
💤 Files with no reviewable changes (1)
  • .github/workflows/redo-log-tests.yaml

CodeRabbit review caught two more unguarded dereferences of the
Thread pointer after the null check at line 531. Lines 554 and
558 still called t->contextSet() unconditionally, which would
crash when ctx->parserThread is null during teardown.
@rophy
Copy link
Owner Author

rophy commented Mar 23, 2026

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
🤖 Prompt for all review comments with AI agents
ℹ️ Review info

@coderabbitai fixed, can you resolve the comment?

@coderabbitai
Copy link

coderabbitai bot commented Mar 23, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# List review comments made by coderabbitai on PR `#6`
gh api repos/rophy/OpenLogReplicator/pulls/6/comments --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body[0:200], path: .path}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check general PR comments by coderabbitai
gh api repos/rophy/OpenLogReplicator/issues/6/comments --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body[0:300]}'

echo "---REVIEWS---"
# Check review threads
gh api repos/rophy/OpenLogReplicator/pulls/6/reviews --jq '.[] | select(.user.login == "coderabbitai") | {id: .id, body: .body[0:300], state: .state}'

Length of output: 13


🏁 Script executed:

#!/bin/bash
# List all reviews on the PR
gh pr view 6 --repo rophy/OpenLogReplicator --json reviews,comments --jq '.'

echo "---ALL REVIEW COMMENTS---"
gh api repos/rophy/OpenLogReplicator/pulls/6/comments --jq '.[] | {id: .id, user: .user.login, body: .body[0:200], path: .path}'

Length of output: 19364


🏁 Script executed:

#!/bin/bash
# Try to find resolvable review threads via GraphQL
gh api graphql -f query='
{
  repository(owner: "rophy", name: "OpenLogReplicator") {
    pullRequest(number: 6) {
      reviewThreads(first: 20) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              author { login }
              body
            }
          }
        }
      }
    }
  }
}'

Length of output: 70


@rophy: The fix looks great — the latest commit correctly guards the remaining two t->contextSet(...) calls in freeMemoryChunk with if (likely(t != nullptr)), which is exactly what was flagged.

Unfortunately, since the comment was flagged as an "outside diff range" comment, it was embedded in the review body rather than posted as an inline thread — so there is no resolvable thread to dismiss programmatically. The review is effectively addressed by the fix itself. ✅


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

@rophy rophy merged commit c81e190 into master Mar 23, 2026
2 checks passed
@rophy rophy deleted the fix/rac-checkpoint-resume branch March 23, 2026 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant