Skip to content

Joysafeter v2#141

Open
yuzzjj wants to merge 530 commits into
mainfrom
joysafeter-v2
Open

Joysafeter v2#141
yuzzjj wants to merge 530 commits into
mainfrom
joysafeter-v2

Conversation

@yuzzjj

@yuzzjj yuzzjj commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

yuzzjj and others added 30 commits April 26, 2026 07:56
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All agents now go through AgentBuildShell; layout drops overview/builder tabs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CreateAgentDialog: only name + build method card selector
- Create success → navigate to /agents/{id}?stage=brief
- Remove edit dialog mode — use Settings page instead
- AgentCard: remove edit button, add status badge, relative time,
  hover action hint (Start building / Continue building / View usage)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- agent-card: use shared formatRelativeTime from dateHelpers, inline
  getActionHint, add aria-label on delete button
- agent-form-dialog: remove unused workspaceId prop, add role=radiogroup
  and aria-checked on build method cards
- agents/page: 4-column grid layout (xl:grid-cols-4), remove workspaceId
  prop from CreateAgentDialog

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…card

- Remove mx-auto max-w-6xl that caused left gap
- 4-column grid (xl:grid-cols-4)
- AgentCard: reduce padding/spacing, merge meta+action into single row,
  fix Invalid Date with null guard, add defaultValue to all i18n keys
- Remove handleCreate/handleDelete wrappers, inline setters

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove "Business scenario" field (not needed for graph builder)
- Center form with max-w-2xl, cleaner spacing
- Add placeholder text to all fields for guidance
- Constraints changed from Textarea to Input (single line sufficient)
- Add icons to action buttons (Sparkles, Wrench)
- Tighter vertical rhythm, smaller labels

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- InspectorPanel handles both node and edge selection
- SaveManager cross-store dependency with callback injection
- Store migration strategy (compat layer → gradual replace)
- CopilotOverlay chat logic migration from StudioRightPanel
- ReactFlowProvider stays in AgentBuilder, GraphBuilderShell inside
- AgentBuilder split into init logic + layout shell

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… GraphStatusBar rewrite

- Fix BuilderSidebarTabs/StudioRightPanel paths to components/ (not root/studio/)
- Fix React Flow imports from @xyflow/react to reactflow (matches codebase)
- Task 6: rewrite existing GraphStatusBar (149 lines), not create new
- Task 10: correct rm paths for cleanup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hell

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oMode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, remove duplication

- Merge stepper + toolbar into single compact header row via toolbarSlot pattern
- Make ImportExportMenu self-contained (owns overwrite dialog, reads graphStore directly)
- Remove execution state duplication from graphStore (BuilderNode reads executionStore)
- Remove function aliases (removeNode→deleteNode, pushHistory→takeSnapshot)
- Replace require() circular-dep hack with setAutoSaveTrigger registration
- Collapse getInitialState blocks to captured-at-creation snapshots
- Fix GraphToolbar perf: subscribe to boolean selectors not full arrays
- Remove dead code: unused props, variables, redundant wrappers
- Delete obsolete: BuilderToolbar, BuilderSidebarTabs, StudioRightPanel

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Consolidates the two separate "发布" (publish) entry points into a
single Wizard-led flow with Settings as read-only history. Introduces
AgentPublishService backend orchestration layer with proper transaction
boundaries, replacing frontend-orchestrated multi-request publish.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…uilder

Address 3 issues from spec review:
1. Add AgentBuilder.tsx to rewrite list (uses useUnfreezeVersion)
2. Fix publish_release call to use CreateAgentReleaseRequest model
3. Clarify commit removal scope: only orchestrated methods, not standalone

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
11 tasks covering backend (AgentPublishService, route changes) and
frontend (new hooks, UI rewrites, cleanup). Fix publishKeys to use
correct agentKeys.all pattern and STALE_TIME import path.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
freeze_version, publish_release, activate_release, retire_release no
longer own their transaction boundary. Delete unfreeze_version entirely.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…points

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…e history

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
wengaolei1 and others added 30 commits June 11, 2026 10:03
…lope, SandboxBridge, dead code cleanup

- Task 1: Add unified OrchestratorError exception hierarchy (error.py)
- Task 4: Add VaultCipher with AES-256-GCM enc: prefix support
- Task 9: Add last_result_status/error, task_available, sandbox_id to SandboxBridge
- Task 13: Auto-assign UUIDv7 event_id, make seq Optional in EventEnvelope
- Task 15: Remove dead result/memory_sync branches in event_mapping

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ventBus flush, file injection

- Task 2: Add max_scheduling_tasks + image_for_provider() to JoySafeterConfig
- Task 3: Add provider, setup_commands, allowed/disallowed_tools, max_turns, repos to HarnessInput
- Task 7: Add SandboxStatus enum, SandboxCreateConfig, ProviderSandboxInfo to provider
- Task 14: Add flush() method to EventBus
- Task 23: Add InjectionStrategy enum and FileToInject dataclass to file_injection

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eue parity

- Task 16: Add cancel command handler to CommandListener
- Task 17: Add dispatch_cancel/dispatch_input to RedisCoordinator
- Task 18: Fix lock key namespacing with joysafeter:lock: prefix
- Task 19: Add has_pending() to queue backend
- Task 20: Add exponential backoff to CommandListener reconnect
- Task 21: Wrap publish_session_event with source_instance
- Task 25: Use config heartbeat_interval instead of hardcoded 10s
- Task 26: Add deregister_instance method to RedisCoordinator
- Task 27: Return (sandbox_id, owner) tuples from list_active_sandbox_owners

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C logging

- Task 5: Add OAuth token refresh + VaultCipher decryption to HarnessIer
- Task 6: Add setup_commands, tool allow lists, max_turns extraction to builder
- Task 8: Add reverse orphan sweep (DB→provider) to SandboxController
- Task 22: Align MemoryStoreSubscribers API (mount_path, dedup, exclude by sandbox, notify_peers_direct)
- Task 24: image_for_provider already exists in sandbox_resolver — skip
- Task 28: Log e_providers from RunnerReady in gRPC

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eanup

- Task 10: Fix grace period probe schedule to match Rust (3s/5s/10s/15s then 105s)
- Task 11: Fix reconnect path to emit session.status_idle when task_done && !got_idle
- Task 12: Fix cleanup Step 7 to query DB for pending tasks instead of in-memory list

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The upgrade() and downgrade() function bodies had an extra level of
indentation (8 spaces instead of standard 4). This is syntactically
valid Python but violates Alembic's autogenerate convention and
triggers Pyright linting errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r, tool allow lists

- C1: Update task DB status to 'cancelled' + emit session.status_idle on user cancel
- C2: Fix NameError in reconnect path (result_status/active_task_session_id → task_session_id)
- C3: Grace period already wired in _cleanup_sandbox → verified, no change needed
- C5: _parse_tool_allow_lists now reads from nested configs[] array (Rust parity)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- I1: Update sandbox DB status to 'running' on task dispatch
- I2: Publish Redis 'complete' event after task result
- I3: Unconditionally reset requires_action_pending + confirmation after each task
- I4: Emit session.status_running on reconnect before inner event loop
- I5: Fix error event type check ('session.error' not 'error')

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…der parity

- I10: _force_stop_stuck uses updated_at instead of last_used_at (Rust parity)
- I14: Skip pool claim after destroying mismatched stopped sandbox (go direct to create)
- I15: MCP tool name extraction checks both 'name' and 'mcp_server_name' fields
- I11/I12/I13/I16: Acknowledged as Rust-side differences or deliberate design choices

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- C4: Add CAS guard to SessionStateSubscriber (prevent overwriting terminal sessions)
- C6: Add memory_mounts to SandboxCreateConfig + Docker bind mount support
- I9: Already protected by queries::update_session_status CAS (verified)
- I12: Add notify_global() to TaskQueue + wake scheduler after health check cleanup
- I16: Fix Codex history to parse content as array-of-blocks (not just plain string)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stead of /dashboard

Changed 3 locations:
- app/page.tsx: root route redirect for logged-in users
- components/auth/oauth-buttons.tsx: default OAuth callback URL
- app/(auth)/verify/use-verification.ts: post-verification redirect
- middleware.ts: add /managed to allowed redirect paths

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add 'Copy' button next to content header in EventDetail transcript mode
- Uses Clipboard API with both text/html and text/plain MIME types for rich text copy
- Preserves formatting (headings, lists, bold, links) when pasting into Word/Docs/Email
- Falls back to plain text copy if rich text clipboard API is unavailable
- Shows 'Copied' confirmation with green checkmark for 2 seconds
- Added common.copy/common.copied i18n keys for both zh and en locales

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause: SessionService.update_session_status acquired a FOR UPDATE row lock
on joysafeter_sessions BEFORE acquiring pg_advisory_xact_lock for event seq.
EventBatchSender._write_batch acquired advisory lock FIRST then INSERT which
needs FK ShareLock on sessions. This creates an AB-BA deadlock under concurrency.

Fixes:
1. SessionService.update_session_status: acquire advisory lock BEFORE row lock
   (matching JoySafeterSessionLifecycleService lock ordering)
2. EventBatchSender._write_batch: sort session_ids before iterating to prevent
   multi-session advisory lock ordering deadlocks between concurrent batches

The error manifested as:
  asyncpg.exceptions.DeadlockDetectedError: deadlock detected
  Process 15202 waits for ShareLock on transaction 172024; blocked by process 15227
  Process 15227 waits for ShareLock on transaction 172023; blocked by process 15202

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Gate MockAdapter behind JOYSAFETER_MOCK_ADAPTER env var (matching Rust).
   Previously it was unconditionally registered on every startup.

2. Replace silent mock_call_llm fallback with RuntimeError.
   Previously, if langchain_openai was missing, the agent would silently
   return fake 'Mock response' output instead of failing loudly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…quests

Added ssrf_guard.py utility with:
- HTTPS-only scheme enforcement (configurable)
- RFC-1918 / link-local / loopback / cloud metadata IP blocking
- DNS resolution before request (prevents DNS rebinding)
- follow_redirects=False on all patched httpx calls

Fixed 7 SSRF-vulnerable sinks:
1. CRITICAL: vault OAuth token_url (user-controlled, POSTed at runtime)
2. CRITICAL: quickstart ANTHROPIC_BASE_URL (user-writable secret)
3. HIGH: media.py Image/Audio/Video/File URL fetch (4 call sites)
4. HIGH: research_tools fetch_webpage_content
5. HIGH: A2A client agent_card_url
6. HIGH: MCP /test endpoint (no scheme validation)
7. HIGH: OIDC discovery issuer URL

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Changed SSRF guard default from HTTPS-only to allow both HTTP and HTTPS.
The primary SSRF defense is IP validation (blocking cloud metadata, private
IPs, loopback, link-local), not scheme restriction.

Production environments can set JOYSAFETER_SSRF_HTTPS_ONLY=1 to enforce
HTTPS-only when needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Private/internal IPs (10.x, 172.16.x, 192.168.x, 127.0.0.1) are now ALLOWED
by default because many legitimate services run on the internal network:
- LLM APIs (Ollama, vLLM on localhost)
- MCP servers (internal tools)
- Internal service endpoints

Only truly dangerous IPs are ALWAYS blocked:
- Cloud metadata (169.254.169.254, 100.100.100.200, fd00:ec2::254)
- Link-local range (169.254.x.x)
- Multicast addresses
- Known metadata hostnames (metadata.google.internal, metadata.goog)

Opt-in env vars for stricter policies:
- JOYSAFETER_SSRF_BLOCK_PRIVATE=1 — also block RFC-1918 private IPs
- JOYSAFETER_SSRF_HTTPS_ONLY=1 — enforce HTTPS-only scheme

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… inputs

Backend (Pydantic schemas):
- McpServerCreate/Update: @field_validator on url field
- McpTestRequest: @field_validator on url field
- CreateCredentialRequest: @field_validator on mcp_server_url
- McpServerConfig (agent): @field_validator on url field
- Shared validate_url_scheme() in ssrf_guard.py

Frontend:
- lib/utils/url-validation.ts: validateUrlScheme() + isValidUrl()
- add-mcp-dialog.tsx: validate URL before test/save
- create-agent-dialog.tsx: validate MCP URL before adding

All URL fields now enforce http:// or https:// scheme at input time,
before reaching the backend SSRF guard (which handles IP validation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Frontend fixes (3 missing validation points):
- Agent edit page MCP URL: add validateUrlScheme before addMcpServer
- Vault credential dialog: add validateUrlScheme before submit
- (Graph builder A2A URLs validated at backend level)

Backend fixes:
- A2A NodeConfig: validate a2a_url and agent_card_url schemeig load time

Now ALL user-controllable URL inputs have http/https scheme validation:
- 8 frontend form entry points (MCP create/edit/test, Agent create/edit, Vault cred, Quickstart)
- 6 backend Pydantic schemas (McpServerCreate/Update, McpTestRequest, McpServerConfig, CreateCredentialRequest)
- 2 backend config parsers (A2A NodeConfig)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SSRFError extends ValueError. In the IP validation step (step 3), the code
raised SSRFError for metadata IPs, but the subsequent 'except ValueError'
clause caught and silently swallowed it, allowing requests to cloud metadata
endpoints like 169.254.169.254.

Fix: separate ipaddress.ip_address() parsing from SSRF validation checks
so the except ValueError only catches the 'not an IP literal' case.

Test results after fix:
  ✅ http://169.254.169.254 — correctly blocked
  ✅ http://169.254.170.2 — correctly blocked
  ✅ http://100.100.100.200 — correctly blocked
  ✅ http://metadata.google.internal — correctly blocked
  ✅ ftp://evil.com — correctly blocked
  ✅ http://localhost:11434 — correctly allowed
  ✅ http://10.0.0.5:8080 — correctly allowed
  ✅ https://api.anthropic.com — correctly allowed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem: When viewing a running session, the frontend simultaneously:
1. SSE stream (real-time events via /events/stream)
2. Session status polling (every 2s)
3. Auto-load more events (useEffect triggered repeatedly)

This caused a flood of GET /events?limit=100&after_seq=X requests evh SSE was delivering the same events in real-time.

Fixes:
- Skip auto-load useEffect when sseConnected=true (events arrive via SSE)
- Increase session refetchInterval from 2s to 10s when SSE is connected
  (session status changes are also delivered through SSE events)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tus from SSE

SSE already pushes session.status_* events in real-time. No need to
poll GET /sessions/{id} at all when SSE is connected.

Changes:
- refetchInterval: false when sseConnected (was 10s, before that 2s)
- New useEffect: watches SSE events for session.status_* and updates
  the React Query cache directly via setQueryData, so UI reflects
  status changes instantly without any HTTP request
- Polling only activates as fallback when SSE is disconnected

Before: SSE connected → poll every 2s → 30 requests/min wasted
After:  SSE connected → 0 polls → status updates from SSE push

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
P1: QueueFull now sends {lagged: true} signal instead of silently dropping
    - SessionBroadcaster: on QueueFull, drain queue and inject lagged marker
    - SSE endpoint: detect lagged marker and break stream (client reconnects)
    - Frontend sse.ts lagged handling was already implemented but was dead code
      because the backend never sent the signal — now it works end-to-end

P2: DB polling fallback reduced from 15s to 2s
    - When Redis is unavailable, events were delayed up to 15 seconds
    - Now polls every 2 seconds for acceptable latency

P3: SSE stream now detects terminal session status
    - DB fallback loop checks session.status and closes stream when terminated
    - Prevents indefinite empty polling for completed sessions

P4: Heartbeat timeout increased from 15s to 30s
    - Previous 15s timeout triggered a full DB query just to send a heartbeat
    - 30s is the standard SSE heartbeat interval; reduces unnecessary DB load

Architecture summary (service-split):
  Orchestrator → Redis PUBLISH → API Redis subscriber → SSE → Frontend
  Fallback (no Redis): Orchestrator → DB → API polls DB every 2s → SSE

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ReferenceError: Cannot access 'sseConnected' before initialization

The useQuery refetchInterval callback referenced sseConnected (line 195)
but it was declared later (line 275). Moved useSessionStream hook above
useQuery to resolve the temporal dead zone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the EventBus ran Phase 1 (DB persist) sequentially BEFORE
Phase 2 (SSE broadcast). Every event had to wait for the DB batch
write (100ms+ delay) before being pushed to SSE.

Now persist and broadcast run in PARALLEL:
- Python: asyncio.gather([persist(), broadcast1(), broadcast2()])
- Rust: tokio::spawn(persist) then broadcast immediately

Also fixed Rust SessionBroadcaster to log warnings on channel full
(matching Python P1 lagged signal fix).

Before: event → wait DB batch (100ms) → wait DB write → SSE push
After:  event → SSE push immediately + DB write in background

Expected SSE latency improvement: ~100ms+ per event eliminated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Connection semaphore: atomic acquire_nowait() (no TOCTOU race)
2. EventBus.publish: never raises — errors logged but don't kill sessions
3. HITL task deadline reset after confirmation (matching Rust)
4. HITL event buffer bounded at 1000 (prevent OOM on long HITL pauses)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…edis leak, backpressure

Python fixes:
- 1.2: gRPC transport keepalive (30s ping, 10s timeout) — detect NAT/LB dead connections
- 3.1: Bridge register() cancels old bridge on reconnect (prevent double-session)
- 4.5: Redis task mapping cleanup on agent-not-found early exit

Rust fixes (disk, needs cargo build):
- 1.2: Tonic tcp_keepalive + http2_keepalive (30s/10s)
- 3.1: BridgeRegistry.register() cancels old bridge via CancellationToken
- 6.2: rescue_orphaned_tasks pushes to global queue (was log-only)
- 7.1: StartTask send with 10s timeout (prevent indefinite block)

Both:
- All outbound sends now bounded by timeout or try_send
- Bridge double-registration produces WARNING log + old session cancelled
- gRPC keepalive detects silently dropped TCP connections

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant