Skip to content

fix: add IDB healing mechanism for backing store corruption and connection lost errors#780

Open
leshniak wants to merge 9 commits into
Expensify:mainfrom
callstack-internal:fix/idb-corruption-detect-and-heal
Open

fix: add IDB healing mechanism for backing store corruption and connection lost errors#780
leshniak wants to merge 9 commits into
Expensify:mainfrom
callstack-internal:fix/idb-corruption-detect-and-heal

Conversation

@leshniak
Copy link
Copy Markdown
Contributor

@leshniak leshniak commented Apr 28, 2026

Details

Adds an IDB healing mechanism in createStore.ts — the IDB connection manager for IDBKeyValProvider — addressing two sibling storage error classes:

  1. Chromium backing store corruptionUnknownError: Internal error opening backing store for indexedDB.open. — 884K errors/month, 26.3% of all storage errors, Chrome/Edge only (investigation, solution design)
  2. Safari connection lostUnknownError: Connection to Indexed Database server lost. — 637K errors/month, 19% of all storage errors, 100% WebKit (investigation, solution)

Both errors share the same structural problem: a stale cached dbp is never cleared, so all retries reuse the dead/corrupt connection. This is a Dexie-style heal pattern (PR1398_maxLoop).

What it does:

  • isBackingStoreError() — detects Chromium LevelDB corruption ('Internal error opening backing store')
  • isConnectionLostError() — detects Safari/WebKit connection termination ('connection to indexed database server lost', 'connection is closing'). WebKit #197050, #201483
  • Shared healAttemptsRemaining counter (3 attempts, reset on every successful IDB operation)
  • On backing store or connection lost error + budget > 0: decrement counter, drop cached dbp, retry executeTransaction once (forces a fresh indexedDB.open())
  • Guard against stale rejection/probe handlers clearing a newer dbp (capture reference, only clear if unchanged)
  • Clear dbp on rejection in getDB() and verifyStoreExists (fixes pre-existing bug where a cached rejected promise caused infinite failures)
  • Full diagnostic logging via Logger.logAlert/logInfo at every step: error detection, heal attempt, successful recovery, budget exhaustion, and non-recoverable errors

What it does NOT do (by design per #90636):

  • No deleteDatabase()proven to also fail when LevelDB files are corrupt
  • No MemoryOnlyProvider degradation — cache already absorbs all writes during the session
  • No user-visible UI — session serves correctly from cache
  • No changes to storage/index.ts or OnyxUtils.ts — those are separate issues (#90632, #90633)

E/App Test PR

Expensify/App#91085 — pins react-native-onyx to this PR's HEAD commit for integration testing.

Related Issues

Expensify/App#90636
Expensify/App#87862
Expensify/App#87864

Automated Tests

19 tests in tests/unit/storage/providers/createStoreTest.ts:

InvalidStateError retry (5): retry + succeed, retry + propagate, non-InvalidState DOMException skipped, non-DOMException skipped, data integrity after retry

Diagnostic logging (1): alert with all fields on retry

Onclose/onversionchange handlers (4): log + recover for each

Backing store healing (5): mid-session heal, init-time heal, budget exhaustion, budget reset, error classification (wrong message / QuotaExceeded bypass)

Connection lost healing (4): heal + reopen, budget exhaustion, "connection is closing" variant, shared budget with backing store

All tests pass.

Manual Tests

Each test injects a monkey-patch via the browser console to simulate the error class, then triggers an Onyx write (e.g. navigate to a different chat). No code changes needed.

Test 1 — Backing store error (Chromium path):

  1. Open the app in Chrome, log in, navigate to any chat
  2. Paste in DevTools console:
    const _origTx = IDBDatabase.prototype.transaction;
    let _fired = false;
    IDBDatabase.prototype.transaction = function(...args) {
      if (!_fired) {
        _fired = true;
        IDBDatabase.prototype.transaction = _origTx;
        throw new DOMException('Internal error opening backing store', 'UnknownError');
      }
      return _origTx.apply(this, args);
    };
  3. Navigate to a different chat (triggers Onyx write)
  4. Expected log sequence:
    • [Onyx] IDB heal: backing store error detected — dropping cached connection and reopening (2 attempts left)
    • [Onyx] IDB heal: successfully recovered after backing store error
  5. ✅ App recovers — navigation works, no white screen, subsequent operations succeed

Test 2 — Connection lost (Safari path):

  1. Same setup as Test 1
  2. Paste in DevTools console:
    const _origTx = IDBDatabase.prototype.transaction;
    let _fired = false;
    IDBDatabase.prototype.transaction = function(...args) {
      if (!_fired) {
        _fired = true;
        IDBDatabase.prototype.transaction = _origTx;
        throw new DOMException('Connection to Indexed Database server lost', 'UnknownError');
      }
      return _origTx.apply(this, args);
    };
  3. Navigate to a different chat
  4. Expected log sequence:
    • [Onyx] IDB heal: connection lost error detected — dropping cached connection and reopening (2 attempts left)
    • [Onyx] IDB heal: successfully recovered after connection lost error
  5. ✅ App recovers seamlessly

Test 3 — Budget exhaustion (3 consecutive failures, then gives up):

  1. Same setup as Test 1
  2. Paste in DevTools console:
    const _origTx = IDBDatabase.prototype.transaction;
    let _count = 0;
    IDBDatabase.prototype.transaction = function(...args) {
      if (_count < 6) {
        _count++;
        throw new DOMException('Internal error opening backing store', 'UnknownError');
      }
      IDBDatabase.prototype.transaction = _origTx;
      return _origTx.apply(this, args);
    };
  3. Navigate to a different chat
  4. Expected log sequence:
    • [Onyx] IDB heal: backing store error detected — dropping cached connection and reopening (2 attempts left)
    • [Onyx] IDB heal: backing store error detected — dropping cached connection and reopening (1 attempts left)
    • [Onyx] IDB heal: backing store error detected — dropping cached connection and reopening (0 attempts left)
    • [Onyx] IDB heal: backing store error — heal budget exhausted, giving up
  5. ✅ App may show degraded behavior but should not crash or white-screen

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • If we are not using the full Onyx data that we loaded, I've added the proper selector in order to ensure the component only re-renders when the data it is using changes
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

This PR modifies only the IDB connection manager (createStore.ts) — a web-only storage internals file with no UI changes. Screenshots/videos of manual test console output will be attached after integration testing via E/App PR #91085.

Android: Native

N/A — IDB is web-only, no native changes.

Android: mWeb Chrome
iOS: Native

N/A — IDB is web-only, no native changes.

iOS: mWeb Safari
MacOS: Chrome / Safari

Healed (simulated error):

healed.mp4

Killed connection (simulated error):

killed.mp4

Exhausted (simulated error):

exhausted.mp4

Healed (simulated on Safari):

safari_error.mp4

Killed connection (simulated on Safari):

safari_killed.mp4

Exhausted (simulated on Safari):

safari_exhausted.mp4

@leshniak leshniak force-pushed the fix/idb-corruption-detect-and-heal branch from bd7a14f to 32a6cc8 Compare May 15, 2026 08:59
@leshniak leshniak changed the title fix: detect IDB backing store corruption and heal or degrade gracefully fix: add IDB backing store corruption healing mechanism May 15, 2026
@leshniak leshniak marked this pull request as ready for review May 15, 2026 13:32
@leshniak leshniak requested a review from a team as a code owner May 15, 2026 13:32
@melvin-bot melvin-bot Bot requested review from Beamanator and removed request for a team May 15, 2026 13:33
* https://github.com/Expensify/App/issues/87862
*/
function isBackingStoreError(error: unknown): boolean {
return error instanceof DOMException && error.name === 'UnknownError' && error.message.includes('Internal error opening backing store');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return error instanceof DOMException && error.name === 'UnknownError' && error.message.includes('Internal error opening backing store');
return error instanceof Error && error.message.includes('Internal error opening backing store');

I think we can simplify to this

* Detects the Chromium-specific IDB backing store corruption error.
* Fires when LevelDB files backing IndexedDB are corrupted and Chrome's
* internal recovery (RepairDB -> delete -> recreate) also fails.
* https://github.com/Expensify/App/issues/87862
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* https://github.com/Expensify/App/issues/87862

I dont think its needed to link issue here

executeTransaction(txMode, callback)
.then(resetHealBudget)
.catch((error) => {
if (error instanceof DOMException && error.name === 'InvalidStateError') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (error instanceof DOMException && error.name === 'InvalidStateError') {
if (error instanceof Error && error.name === 'InvalidStateError') {

Same

@fabioh8010
Copy link
Copy Markdown
Contributor

@leshniak

  1. Please provide a E/App PR in the PR description where we could test this change, you can link to your onyx PR by using this hash trick in package.json (replace <last_pr_commit_sha> with the SHA): "react-native-onyx": "git+https://github.com/Expensify/react-native-onyx.git#<last_pr_commit_sha>",
  2. Please attach recordings in all web platforms sections as evidence
  3. Would be nice if you could design some test steps to simulate this issue when testing manually

@leshniak leshniak changed the title fix: add IDB backing store corruption healing mechanism fix: add IDB healing mechanism for backing store corruption and connection lost errors May 19, 2026
leshniak and others added 5 commits May 19, 2026 16:05
Adds a Dexie-style heal pattern to createStore for Chromium's
Internal error opening backing store error (884K errors/month).

- isBackingStoreError() detects the Chromium-specific corruption
- Shared healAttemptsRemaining counter (3, reset on success)
- On backing store error: clear cached connection, retry once
- Clear dbp on rejection so retries get fresh indexedDB.open()
- 5 new tests: mid-session heal, init heal, budget exhaustion,
  budget reset, error classification

No deleteDatabase(), no provider swap, no UI changes.
Scoped to IDBKeyValProvider only -- SQLite provider untouched.

Ref: Expensify/App#90636

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Capture dbp reference before attaching reject handler; only clear if
  dbp hasn't been replaced by a concurrent heal/retry (prevents stale
  rejection handler from clearing a newer promise)
- Add comment documenting concurrent store() budget drain behavior
- Fix test formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ose)

The heal path clears the cached dbp and reopens via indexedDB.open(),
but does not call db.close() on the old IDBDatabase. Updated comments
and log messages from 'close + reopen' to 'drop cached connection and
reopen' to match what the code actually does.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- isBackingStoreError: use Error instead of DOMException, drop .name check
- InvalidStateError catch: same simplification
- Remove issue link from JSDoc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add isConnectionLostError() to detect 'Connection to Indexed Database
server lost' and 'Connection is closing' — Safari/WebKit errors that
fire when the browser terminates IDB connections for backgrounded tabs.

Uses the same heal-and-retry mechanism as backing store corruption:
drop cached dbp, retry once with fresh indexedDB.open(), shared budget.

Addresses Expensify/App#87864.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leshniak
Copy link
Copy Markdown
Contributor Author

leshniak commented May 19, 2026

  1. E/App PR: fix: IDB backing store corruption healing (onyx PR #780 test) App#91085 — pins react-native-onyx to this PR's HEAD (beb75d5f) via the hash trick. Also removes the stale react-native-onyx+3.0.71.patch (those changes already landed upstream in 3.0.72 via PR Revert PR #770 and fix useOnyx #785). Branch rebased from 3.0.69 → 3.0.75 so the versions align.

  2. Recordings — will attach shortly after running the manual tests.

  3. Manual test steps — added to both PRs. Three scenarios:

    • Transient corruption (Chrome DevTools → Clear site data mid-session) — triggers heal, app recovers
    • Safari connection lost (background tab 30s+ → foreground) — visibilitychange probe detects dead connection, drops cache before writes hit it
    • Permanent corruption (corrupt LevelDB files on disk) — heal fires 3x (budget drains), then stops — app continues from cache

@fabioh8010
Copy link
Copy Markdown
Contributor

@leshniak let's handle Proactive detection feature in separate PR please, as this one is growing quite a bit now.

Extract module-level helpers: isInvalidStateError, isBudgetedHealError,
getBudgetedHealErrorLabel. Extract cacheOpenPromise to deduplicate
rejected-promise cleanup in getDB and verifyStoreExists.

Pure refactor — no behavior change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Heal attempts: logAlert with error type, action taken, remaining budget
- Heal success: logInfo confirming recovery after each error type
- Budget exhaustion: explicit logAlert when heal budget drains
- Non-recoverable errors: logAlert with error details
- Updated test assertions to match new log messages and levels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leshniak added a commit to callstack-internal/react-native-onyx that referenced this pull request May 20, 2026
- Probe start: logInfo when tab becomes visible and probe begins
- Probe healthy: logInfo confirming connection is healthy
- Probe stale: logAlert with error details when stale connection detected
- Heal attempts/success/exhaustion/non-recoverable: same as Expensify#780
- Updated test assertions to match new log messages and levels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@fabioh8010 fabioh8010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leshniak

  • I checked the E/App PR and description looks broken – also we are still missing videos/evidence there
  • Please add the same videos/evidence to this PR description in order for CI to pass.

function isConnectionLostError(error: unknown): boolean {
if (!(error instanceof Error)) return false;
const msg = error.message.toLowerCase();
return msg.includes('connection to indexed database server lost') || msg.includes('connection is closing');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using msg.includes('connection is closing') means we will include other connection closing errors too, e.g.:

  • InvalidStateError: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing. (initially handled in fix the database connection is closing issue. #748)
  • UnknownError: Connection is closing because of: IO error:
  • UnknownError: Connection is closing because of: Failed to remove blob file.
  • UnknownError: Connection is closing.
  • UnknownError: Connection is closing because of: Force close delete origin
  • UnknownError: Connection is closing because of: Corruption: block checksum mismatch

It's intended, right?

Only budget exhaustion and non-recoverable errors should trigger alerts.
Heal attempts are handled gracefully and only need informational logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants