Write-through sync in brain.correct() — eliminate hook+session dependency

# Write-through sync in brain.correct() — eliminate hook+session dependency

## Problem

Current cloud sync only fires inside `gradata.hooks.session_close`, which requires:
1. Hook config wired correctly per-runtime (fragile, silent failures)
2. Session lifecycle to end (`hermes --continue` = sessions never end)
3. `GRADATA_API_KEY` in subprocess env (often missing)
4. trigger events existing since last sync (catch-22 with #1)

Four independent failure points, all silent. Discovered 2026-05-14 after our 27-agent fleet hadn't synced for ~5 days despite the sync infrastructure being "operational."

Band-aid landed: cron job `gradata-sync-cron` runs every 60s, reads `system.db`, POSTs deltas to `/api/v1/sync`. Works today but adds another moving part.

## Proposal

`brain.correct(draft, final)` POSTs the correction to `api.gradata.ai` immediately after local SQLite write. Fire-and-forget HTTP via background thread or asyncio. Local SQLite remains source of truth; cloud is a write-through replica.

## Architecture

```
brain.correct(draft, final)
  ├── write to system.db (existing)
  ├── append to events.jsonl (existing)
  └── enqueue to sync_queue table (NEW)
       └── background drain (NEW)
            └── POST /api/v1/ingest (NEW, lightweight single-correction endpoint)
                 └── on success: mark synced_at
                 └── on failure: retry on next correct() call
```

## Files

### New
- `src/gradata/_sync_queue.py` (~80 LOC) — SQLite table + drain function
- `src/gradata/_sync_worker.py` (~50 LOC) — background daemon thread, 30s timer

### Modified
- `src/gradata/brain.py`:
  - `Brain.__init__`: start sync worker if `GRADATA_API_KEY` set
  - `Brain.correct`: enqueue after local write
  - `Brain.end_session`: drain synchronously (preserve batch compat)
- `src/gradata/_migrations/`: new migration for sync_queue table
- `cloud/app/routes/sync.py`: add `POST /api/v1/ingest` for single-correction writes
- `cloud/app/models.py`: `IngestRequest` (single CorrectionPayload, no batching)

## Schema for sync_queue table

```sql
CREATE TABLE sync_queue (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  payload_json TEXT NOT NULL,
  kind TEXT NOT NULL CHECK(kind IN ('correction','lesson','event')),
  enqueued_at REAL NOT NULL,
  synced_at REAL,
  attempts INTEGER DEFAULT 0,
  last_error TEXT
);
CREATE INDEX idx_sync_queue_pending ON sync_queue(synced_at) WHERE synced_at IS NULL;
```

## Implementation plan (5-day target)

### Day 1: Schema + queue primitives
- [ ] Migration for sync_queue table
- [ ] `_sync_queue.py::enqueue()` and `_sync_queue.py::drain()`
- [ ] Unit tests: enqueue idempotency, drain marks synced_at, retry on failure

### Day 2: Cloud /ingest endpoint
- [ ] `IngestRequest` model (single CorrectionPayload)
- [ ] `POST /api/v1/ingest` handler — same projector path as /sync but single-row
- [ ] Rate limit: 600/minute per brain (10/sec)
- [ ] Tests against staging Supabase

### Day 3: Wire into Brain class
- [ ] `Brain.__init__` starts `_sync_worker` thread if API key present
- [ ] `Brain.correct` enqueues after local write (non-blocking)
- [ ] `Brain.close()` drains synchronously before exit
- [ ] Backward compat: `end_session()` still triggers full sync

### Day 4: Fleet deployment
- [ ] Deploy SDK update to one agent first (e.g. `data-engineer`)
- [ ] Verify a `brain.correct()` call shows up in dashboard in <5s
- [ ] Roll out to remaining 26 agents
- [ ] Watch error rate for 24h

### Day 5: Hook deprecation
- [ ] Remove `cloud_sync_tick` from `session_close.py` (now redundant)
- [ ] Remove `gradata-sync-cron` (replaced by SDK write-through)
- [ ] Update docs: hooks = local-only brain quality, SDK = cloud sync
- [ ] Update `gradata install --agent X` to skip writing the `on_session_end` cloud-sync hook entry

## Acceptance criteria

- `brain.correct(draft, final)` returns in <50ms (no network in critical path)
- Cloud dashboard reflects the correction within 5 seconds of the call
- If the cloud is unreachable, the correction is queued locally and synced on next opportunity
- Re-running `brain.correct()` with the same content is idempotent (existing dedup via `event_id`)
- `gradata doctor` reports `sync_queue: N pending` so users can see backlog
- Removing the hook entry from `~/.hermes/config.yaml` does NOT break cloud sync

## Risks

- Background thread crashes → corrections accumulate in queue. Mitigation: queue depth metric in `gradata doctor`, watchdog cron alerts on backlog >1000.
- Rate limit at /ingest → bursts of corrections (e.g. fleet restart) overflow. Mitigation: client-side batching (group corrections enqueued within 1s into one POST).
- API key rotation → all corrections fail until key updated. Mitigation: log 401s loudly to local file, surface in `gradata doctor`.
- Schema drift between SDK and cloud → 422 validation errors. Mitigation: version the IngestRequest model, server accepts N and N-1.

## Rollback

If write-through causes issues:
1. Set `GRADATA_DISABLE_WRITE_THROUGH=1` env var → SDK falls back to hook+session_close path
2. Re-enable `gradata-sync-cron` cron job
3. No data loss — local SQLite has everything

## Refs

- Office hours memo: `/home/olive/gradata-office-hours-memo.md`
- Cron band-aid: `/home/olive/.gradata/sync_cron.py`, job id `ad2bacd12fdf`
- Related: Gradata/gradata#190 (install UX epic)
- Discovered during 2026-05-14 fleet debugging session


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write-through sync in brain.correct() — eliminate hook+session dependency #194

Write-through sync in brain.correct() — eliminate hook+session dependency

Problem

Proposal

Architecture

Files

New

Modified

Schema for sync_queue table

Implementation plan (5-day target)

Day 1: Schema + queue primitives

Day 2: Cloud /ingest endpoint

Day 3: Wire into Brain class

Day 4: Fleet deployment

Day 5: Hook deprecation

Acceptance criteria

Risks

Rollback

Refs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Write-through sync in brain.correct() — eliminate hook+session dependency #194

Description

Write-through sync in brain.correct() — eliminate hook+session dependency

Problem

Proposal

Architecture

Files

New

Modified

Schema for sync_queue table

Implementation plan (5-day target)

Day 1: Schema + queue primitives

Day 2: Cloud /ingest endpoint

Day 3: Wire into Brain class

Day 4: Fleet deployment

Day 5: Hook deprecation

Acceptance criteria

Risks

Rollback

Refs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions