Skip to content

fix: stream group_agent_export instead of buffering the whole group#112

Open
agent-p1p wants to merge 2 commits into
masterfrom
pip/goggles-109
Open

fix: stream group_agent_export instead of buffering the whole group#112
agent-p1p wants to merge 2 commits into
masterfrom
pip/goggles-109

Conversation

@agent-p1p

@agent-p1p agent-p1p commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Closes #109

Summary

group_agent_export was the last unbounded full-group path: it materialized every valid event (including the per-event raw_context/raw_kind JSONField blobs) into a Python list, built one nested export dict, then serialized one large JSON string — all held in a single gunicorn worker's heap before the response was written. For a large group (50 MiB/file upload ceiling × many files) this could spike a worker's RSS into the hundreds of MB/GB and block or OOM it for all users — the same memory-exhaustion class the bounded landing (#17), detail (#38), tabs (#92), and timeline (#38/#94) paths were introduced to avoid.

Per the issue's design guidance, the export is the authoritative, complete whole-group artifact, so capping/truncating is wrong — completeness is the point. This streams the response instead, keeping memory roughly per-chunk + the bounded aggregate sections rather than per-group, with no cap and no truncation.

Approach

"Complete" and "buffer the whole thing in one heap" are separable. The fix splits the export into two passes:

  1. Bounded header (agent_state_export_header_for_group) — every section except the per-event array (group, summary, sources, timeline, actions, messages, sensitivity). These aggregate sections are bounded by engine/message/action cardinality, not raw event count, and — crucially — none of them read raw_context/raw_kind (only agent_event_row does). So the header is built from events fetched without the export-only JSON blobs (include_export_fields=False), i.e. the same lean rows the existing bounded views already load.

  2. Lazy per-event generator (agent_event_rows_for_group) — streams agent_event_row dicts (these do carry the raw JSON) from the DB via .iterator(chunk_size=2_000) (the chunked-iteration pattern already used in this module). Ordered in SQL with F("wall_time_ms").asc(nulls_last=True), line_number, id to match sorted_timeline_events exactly, so rows stream in the final order without a Python re-sort that would force full materialization.

group_agent_export now returns a StreamingHttpResponse whose generator serializes the header once, replaces its closing brace with a streamed "events": [...] array, and serializes each event row individually — so the body is written incrementally and the export is never one giant in-memory JSON string. Both compact and ?pretty=1 modes are preserved.

Verification (local, full CI parity except docker postgres + dep-audit)

  • uv run python manage.py test158 passed (155 existing + 3 new regression tests), no regressions
  • uv run ruff check . — clean
  • uv run ruff format --check . — clean
  • uv run python manage.py check — no issues
  • uv run python manage.py makemigrations --check --dry-runNo changes detected (no model/migration drift)

GitHub Actions additionally covers the Postgres test run (verifies nulls_last ordering on Postgres, where NULL placement differs from SQLite) and dependency audit.

New regression tests

  • test_group_agent_export_streams_events_lazily_in_chunksagent_event_rows_for_group is a lazy generator; view returns StreamingHttpResponse
  • test_group_agent_export_streams_complete_events_in_timeline_order — completeness (every valid event, no drop) + deterministic order matching the buffered sorted_timeline_events order, across two engines
  • test_group_agent_export_pretty_streams_valid_json?pretty=1 still yields valid, complete, indented JSON
  • Updated test_group_agent_export_returns_agent_readable_json to consume the streamed body

Sensitive paths

None. Touches forensics/views.py + forensics/analysis.py + forensics/tests.py only — no models, migrations, settings, auth, or token code (repo_sensitive_paths untouched; makemigrations --check confirms no DB drift).


Open in Stage

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@agent-p1p, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 56 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aafda3b4-11da-4ea3-ab6c-ce5e0865f40a

📥 Commits

Reviewing files that changed from the base of the PR and between 2591fe7 and 57d292a.

📒 Files selected for processing (4)
  • config/settings.py
  • forensics/analysis.py
  • forensics/tests.py
  • forensics/views.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pip/goggles-109

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@stage-review

stage-review Bot commented Jun 24, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 4 individual chapters for you:

Title
1 Define export event limit setting
2 Refactor export logic into header and streaming generator
3 Implement streaming response in export view
4 Verify streaming export and regression tests
Open in Stage

Chapters generated by Stage for commit 57d292a on Jun 24, 2026 6:27am UTC.

@agent-p1p

Copy link
Copy Markdown
Contributor Author

Adversarial review

Verdict: Changes requested (1 blocking finding).

Blocking

  • forensics/views.py:426 / forensics/analysis.py:1061-1089 — the endpoint still eagerly builds a full-group header before the streaming response exists. header_events = list(valid_events_for_group(group)) loads every valid event, then agent_state_export_header_for_group() builds timeline_payload_for_group, human_action_groups_for_group, message_traces_from_events, and group_summary from that full list. timeline_payload_for_group() still includes timeline["items"] for every placed event and excluded["event_ids"] for every unplaced event, human_action_groups_for_group() can append an event_row for every human-action event, and _stream_agent_state_export() serializes that whole header with one json.dumps(header) before yielding any streamed event row. This removes raw_context/raw_kind from the header and streams the final events array, but it does not remove the unbounded full-group path the issue called out; a large group can still force the worker to hold the full event list plus full timeline/action/message structures plus the header JSON string. Fix by making the exported header bounded/streamed too, or by adding an explicit size guard/413 if the complete timeline/actions/messages header cannot be produced without O(group events) memory.

Suggestions

  • None.

Nitpicks

  • None.

Local check run during review: DJANGO_DEBUG=1 uv run python manage.py test forensics.tests.GroupDetailTimelineViewTests.test_group_agent_export_returns_agent_readable_json forensics.tests.GroupDetailTimelineViewTests.test_group_agent_export_streams_events_lazily_in_chunks forensics.tests.GroupDetailTimelineViewTests.test_group_agent_export_streams_complete_events_in_timeline_order forensics.tests.GroupDetailTimelineViewTests.test_group_agent_export_pretty_streams_valid_json — 4 passed.

agent-p1p added a commit that referenced this pull request Jun 24, 2026
The streaming response keeps the per-event array per-chunk, but the
export header (timeline items, excluded ids, human-action rows) is
inherently O(group events) and is still buffered to build the response.
Gate the endpoint on the valid-event count (a single SQL COUNT, no rows
materialized) via GOGGLES_AGENT_EXPORT_MAX_EVENTS so an oversized group
fails predictably with a 413 + JSON error instead of spiking a worker's
RSS and degrading it for all users. Completeness is preserved below the
cap; <= 0 disables the guard. Addresses adversarial review on #112.

Closes #109
@agent-p1p

Copy link
Copy Markdown
Contributor Author

Addressed blocking finding (commit 05da568)

The adversarial review correctly flagged that the export header is still O(group events): timeline.items places one item per placed event, timeline.excluded.event_ids one id per unplaced event, and actions can append an event_row per human-action event. Streaming the per-event array did not remove that path — it only moved the largest single component (full event rows + raw_context/raw_kind JSON) to per-chunk.

Per the issue's stated design tension ("completeness is the point" — a hard cap/truncation is wrong) and the review's own escape hatch ("add an explicit size guard/413 if the complete timeline/actions/messages cannot be produced without O(group events) memory"), I added a configurable event-count gate:

  • New setting GOGGLES_AGENT_EXPORT_MAX_EVENTS (default 250,000; <= 0 disables).
  • The view checks the valid-event count via a single SQL COUNT(*) (no event rows materialized) before building the header. If the group exceeds the cap, it returns a predictable 413 + JSON error ({error, event_count, max_events}) instead of letting an arbitrarily large group spike a worker's RSS.
  • Normal-sized groups (under the cap) still get the full, complete export — header built once, per-event array streamed chunk-by-chunk. No truncation below the cap.
  • Corrected the now-inaccurate agent_state_export_header_for_group docstring (the header IS O(events); the guard is what bounds the worst case).

Verification (CI-exact commands, local)

  • DJANGO_DEBUG=1 uv run python manage.py test — 160 passed (+2 new: 413-fires-over-cap, cap-disabled-streams-full).
  • uv run ruff check . — clean; uv run ruff format --check . — clean.
  • DJANGO_DEBUG=1 uv run python manage.py check — no issues; makemigrations --check --dry-run — no changes.

Note: config/settings.py is a repo-declared sensitive path (one additive config constant, no auth/token/secret logic) — flagged for the merge gate.

The agent-state export was the last unbounded full-group path: it
materialized every valid event (with its raw_context/raw_kind JSON)
into a Python list, built one nested export dict, and serialized one
large JSON string -- all resident in a single gunicorn worker's heap.
A large group could spike a worker's RSS and block/OOM it for all
users, the same memory-exhaustion class the bounded landing/detail/
tabs/timeline paths were introduced to avoid.

Stream the response instead, preserving completeness (no cap, no
truncation):

- Split the build into a bounded header (everything except the per-event
  array, built from events fetched WITHOUT the raw JSON blobs -- only
  agent_event_row reads those) and a lazy per-event generator
  (agent_event_rows_for_group) that pulls rows via .iterator(chunk_size=
  2_000), ordered in SQL with nulls_last to match sorted_timeline_events
  without a full-materialization Python re-sort.
- group_agent_export now returns a StreamingHttpResponse whose generator
  serializes the header once, then streams each event row individually,
  so memory stays roughly per-chunk + the bounded aggregate sections
  instead of per-group, and the body is written incrementally.

Adds regression tests for lazy chunked streaming, completeness +
deterministic timeline ordering, and the pretty-printed stream.

Closes #109
The streaming response keeps the per-event array per-chunk, but the
export header (timeline items, excluded ids, human-action rows) is
inherently O(group events) and is still buffered to build the response.
Gate the endpoint on the valid-event count (a single SQL COUNT, no rows
materialized) via GOGGLES_AGENT_EXPORT_MAX_EVENTS so an oversized group
fails predictably with a 413 + JSON error instead of spiking a worker's
RSS and degrading it for all users. Completeness is preserved below the
cap; <= 0 disables the guard. Addresses adversarial review on #112.

Closes #109
@agent-p1p agent-p1p marked this pull request as ready for review June 24, 2026 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

group_agent_export materializes every valid event (incl. raw_kind/raw_context JSON) into memory in one response — the last unbounded full-group path

1 participant