Skip to content

Write-through sync in brain.correct() — eliminate hook+session dependency #194

@Gradata

Description

@Gradata

Write-through sync in brain.correct() — eliminate hook+session dependency

Problem

Current cloud sync only fires inside gradata.hooks.session_close, which requires:

  1. Hook config wired correctly per-runtime (fragile, silent failures)
  2. Session lifecycle to end (hermes --continue = sessions never end)
  3. GRADATA_API_KEY in subprocess env (often missing)
  4. trigger events existing since last sync (catch-22 with Review: Full codebase audit via Greptile #1)

Four independent failure points, all silent. Discovered 2026-05-14 after our 27-agent fleet hadn't synced for ~5 days despite the sync infrastructure being "operational."

Band-aid landed: cron job gradata-sync-cron runs every 60s, reads system.db, POSTs deltas to /api/v1/sync. Works today but adds another moving part.

Proposal

brain.correct(draft, final) POSTs the correction to api.gradata.ai immediately after local SQLite write. Fire-and-forget HTTP via background thread or asyncio. Local SQLite remains source of truth; cloud is a write-through replica.

Architecture

brain.correct(draft, final)
  ├── write to system.db (existing)
  ├── append to events.jsonl (existing)
  └── enqueue to sync_queue table (NEW)
       └── background drain (NEW)
            └── POST /api/v1/ingest (NEW, lightweight single-correction endpoint)
                 └── on success: mark synced_at
                 └── on failure: retry on next correct() call

Files

New

  • src/gradata/_sync_queue.py (~80 LOC) — SQLite table + drain function
  • src/gradata/_sync_worker.py (~50 LOC) — background daemon thread, 30s timer

Modified

  • src/gradata/brain.py:
    • Brain.__init__: start sync worker if GRADATA_API_KEY set
    • Brain.correct: enqueue after local write
    • Brain.end_session: drain synchronously (preserve batch compat)
  • src/gradata/_migrations/: new migration for sync_queue table
  • cloud/app/routes/sync.py: add POST /api/v1/ingest for single-correction writes
  • cloud/app/models.py: IngestRequest (single CorrectionPayload, no batching)

Schema for sync_queue table

CREATE TABLE sync_queue (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  payload_json TEXT NOT NULL,
  kind TEXT NOT NULL CHECK(kind IN ('correction','lesson','event')),
  enqueued_at REAL NOT NULL,
  synced_at REAL,
  attempts INTEGER DEFAULT 0,
  last_error TEXT
);
CREATE INDEX idx_sync_queue_pending ON sync_queue(synced_at) WHERE synced_at IS NULL;

Implementation plan (5-day target)

Day 1: Schema + queue primitives

  • Migration for sync_queue table
  • _sync_queue.py::enqueue() and _sync_queue.py::drain()
  • Unit tests: enqueue idempotency, drain marks synced_at, retry on failure

Day 2: Cloud /ingest endpoint

  • IngestRequest model (single CorrectionPayload)
  • POST /api/v1/ingest handler — same projector path as /sync but single-row
  • Rate limit: 600/minute per brain (10/sec)
  • Tests against staging Supabase

Day 3: Wire into Brain class

  • Brain.__init__ starts _sync_worker thread if API key present
  • Brain.correct enqueues after local write (non-blocking)
  • Brain.close() drains synchronously before exit
  • Backward compat: end_session() still triggers full sync

Day 4: Fleet deployment

  • Deploy SDK update to one agent first (e.g. data-engineer)
  • Verify a brain.correct() call shows up in dashboard in <5s
  • Roll out to remaining 26 agents
  • Watch error rate for 24h

Day 5: Hook deprecation

  • Remove cloud_sync_tick from session_close.py (now redundant)
  • Remove gradata-sync-cron (replaced by SDK write-through)
  • Update docs: hooks = local-only brain quality, SDK = cloud sync
  • Update gradata install --agent X to skip writing the on_session_end cloud-sync hook entry

Acceptance criteria

  • brain.correct(draft, final) returns in <50ms (no network in critical path)
  • Cloud dashboard reflects the correction within 5 seconds of the call
  • If the cloud is unreachable, the correction is queued locally and synced on next opportunity
  • Re-running brain.correct() with the same content is idempotent (existing dedup via event_id)
  • gradata doctor reports sync_queue: N pending so users can see backlog
  • Removing the hook entry from ~/.hermes/config.yaml does NOT break cloud sync

Risks

  • Background thread crashes → corrections accumulate in queue. Mitigation: queue depth metric in gradata doctor, watchdog cron alerts on backlog >1000.
  • Rate limit at /ingest → bursts of corrections (e.g. fleet restart) overflow. Mitigation: client-side batching (group corrections enqueued within 1s into one POST).
  • API key rotation → all corrections fail until key updated. Mitigation: log 401s loudly to local file, surface in gradata doctor.
  • Schema drift between SDK and cloud → 422 validation errors. Mitigation: version the IngestRequest model, server accepts N and N-1.

Rollback

If write-through causes issues:

  1. Set GRADATA_DISABLE_WRITE_THROUGH=1 env var → SDK falls back to hook+session_close path
  2. Re-enable gradata-sync-cron cron job
  3. No data loss — local SQLite has everything

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions