Add Copilot as a SkillOpt-Sleep model backend (CopilotCliBackend) + research-engine MCP plugin#50
Open
Dongbumlee wants to merge 5 commits into
Open
Conversation
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools (skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only stdio server, mirroring the existing SkillOpt-Sleep plugin layout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add CopilotCliBackend that drives the GitHub Copilot CLI in non-interactive mode (copilot -p ... --output-format json) and parses the JSONL event stream for assistant.message content. Registered as the 'copilot' backend (with aliases) and wired through the CLI, config, experiment harness, and the Copilot MCP server's backend enum. - Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on Windows when responses contain non-cp1252 bytes). - Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and custom instructions disabled, so user MCP servers are not spawned per call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV. Validated end-to-end on real held-out tasks (researcher persona: 0.42 -> 1.00 lift; gate correctly rejects non-improving edits). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d home) Covers _parse_jsonl_response (multi-message concat, junk-line skipping, empty/non-assistant events), get_backend alias resolution, and the isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…shims Adds honest tool-call detection for CopilotCliBackend, mirroring the Claude/Codex backends. Writes per-tool executable shims into the work dir and detects real invocations from a calllog (not self-reported markers). The Copilot backend is Windows-validated, so shims are cross-platform: a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME, --allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix. Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware stub stands in for the CLI, runs the shim, and asserts both JSONL parsing and log-based detection. Validated live on Windows (real Copilot call) and on Linux/WSL (POSIX path). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… azure_openai) The advertised backend choices in scripts/train.py use 'azure_openai', not 'openai'; align the inputSchema description hint accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds (and how it differs from existing Copilot support)
The repo already ships a Copilot MCP client plugin (
plugins/copilot/mcp_server.py) that lets the Copilot CLI trigger a SkillOpt-Sleep cycle. But until now the cycle's actual LLM work could only run on themock,claude, orcodexbackends — there was noCopilotCliBackend, so Copilot could orchestrate a sleep cycle but not run one on itself.This PR adds Copilot as a model backend (
backend="copilot"), closing that loop:mock|claude|codexonly)backend="copilot"It is also the only Sleep CLI backend that is explicitly Windows-safe (UTF-8 decoding + cross-platform tool shims), where
claude/codexare Unix-oriented.Summary
Adds first-class GitHub Copilot support to SkillOpt, in two independent pieces:
CopilotCliBackendfor SkillOpt-Sleep — lets the self-evolution engine run its attempt / judge / reflect loop on the GitHub Copilot CLI, alongside the existing Claude and Codex CLI backends.plugins/copilot/skillopt/) — exposesskillopt_list_configs/skillopt_train/skillopt_evalto Copilot. This is a new, separate plugin from the existing SkillOpt-Sleep MCP server (plugins/copilot/mcp_server.py): it drives the research training/eval scripts rather than the Sleep cycle.Why
SkillOpt-Sleep already supported
claudeandcodexCLI backends, and a Copilot MCP client plugin already let Copilot trigger a cycle — but Copilot itself could not be the backend that runs the cycle. This extends the sameCliBackendcontract to the GitHub Copilot CLI so a Copilot user can drive validation-gated skill optimization end-to-end on Copilot, without switching tools.What changed
Copilot backend (
skillopt_sleep/backend.py)resolve_copilot_path()+CopilotCliBackend, registered inget_backend()with aliasescopilot,github_copilot,copilot_cli,gh_copilot. (Upstreambackend.pyhad no Copilot backend;--backendchoices weremock | claude | codex.)copilot -p <prompt> --output-format json --stream off --no-color --log-level none --allow-all-tools -C <tempdir>, parsingassistant.messageevents from the JSONL stream. (-s/--silentreturns empty stdout on Windows, so JSONL parsing is required.)attempt_with_tools: honest tool-call detection mirroring the Claude/Codex backends — writes per-tool executable shims into the work dir and detects real invocations from a calllog (not self-reported markers). Shims are cross-platform: a.cmdbatch shim on Windows and achmod'd bash shim on POSIX, with an OS-specific tool hint.encoding="utf-8", errors="replace"on the subprocess —text=Truedecodes as cp1252 on Windows and crashes on Copilot's UTF-8 output (byte0x9d).COPILOT_HOME(no user MCP servers spawned — avoids recursively launching SkillOpt's own MCP servers), plus--disable-builtin-mcpsand--no-custom-instructions. Auth is unaffected (stored in the OS credential store).SKILLOPT_SLEEP_COPILOT_HOME,SKILLOPT_SLEEP_COPILOT_MODEL,SKILLOPT_SLEEP_COPILOT_FULL_ENV=1.copilotadded to--backendchoices in__main__.pyandexperiments/run_experiment.py; the config comment and the existing Sleep MCP server's backend enum are updated to match.Research-engine MCP plugin (
plugins/copilot/skillopt/)mcp_server.pydrivingscripts/train.py/scripts/eval_only.py, plusmcp-config.example.json, an instructions snippet, and a README — same structure as the existingplugins/copilot/SkillOpt-Sleep server, but for the research loops. (No research-engine MCP plugin existed upstream.)Tests (
tests/test_sleep_engine.py)TestCopilotBackend(7 tests, no real CLI required):_parse_jsonl_response— multi-message concat, junk-line skipping, empty / non-assistant eventsget_backendalias resolutionattempt_with_toolshonest detection via an OS-aware stub (runs on both Windows and Linux)Validation
ruff checkreports 0 new findings (pre-existing findings onmainare unchanged).experiments/run_experiment.pywith the copilot backend: researcher persona showed a 0.42 → 1.00 lift; the programmer persona's validation gate correctly rejected non-improving edits.attempt_with_toolsverified live: real Copilot call on Windows (tool actually invoked, detected from the calllog) and the POSIX path on Linux/WSL.Notes
skillopt/model/system (asyncModelBackend) and the SkillOpt-Sleepskillopt_sleep/backend.pysystem (syncCliBackend). This PR's Copilot backend follows the SleepCliBackendpattern (alongsideClaudeCliBackend/CodexCliBackend); the research-engine MCP plugin simply shells out to the existingscripts/entry points.