An AI agent that plays Age of Empires 2: Definitive Edition using a two-tier LLM architecture: a Sonnet strategist reads the resource bar via local OCR and sets goals, a Sonnet executor reads YOLO entity detections and executes actions. Both LLM tiers are text-only — no image is ever sent to Claude.
Screenshot → YOLO Detection → Entity List (text)
Screenshot → Local OCR (RapidOCR) → Resource Readings (text)
↓
Resource Readings → Strategist (Sonnet, text) → Goals
↓
Entity List + Goals + Resources → Executor (Sonnet, text) → Actions
↓
Mouse/Keyboard
Two-model design:
| Role | Model | Input | Output | Frequency |
|---|---|---|---|---|
| Strategist | claude-sonnet-4-6 |
Text (resources via local OCR) + game state | Goals + resource readings | Every 10 turns, or on alarm |
| Executor | claude-sonnet-4-6 |
Text only (entities, goals, resources) | Mouse/keyboard actions | Every turn |
The executor runs Sonnet (moved from Haiku for more reliable instruction-following) with a per-call effort knob (default low) for speed. Routine turns take a single-shot structured call; combat/housing turns take an agentic tool loop.
The executor never sees screenshots. All visual information comes from YOLO entity detection (text list of class/position/confidence) and the strategist's cached resource readings.
Each iteration (~3-5 seconds):
- Capture — Screenshot the game window via
mss - Detect — Run YOLO v5 on screenshot → list of entities with IDs, classes, positions
- Classify ownership — Color-based blue-dominance check on military units (own vs enemy)
- Alarm check — Scan for enemy military → inject emergency defense goals if found
- Strategist (periodic) — reads resources from the bar via local OCR (RapidOCR), then Sonnet creates/updates goals from that text
- Build context — Assemble text: entities + goals + resources + memory + game knowledge
- Execute — Haiku reads text context, returns structured actions (Pydantic-validated)
- Act — Execute mouse clicks / keyboard presses via pyautogui
- Remember — Update memory, evaluate goal progress, compute rewards
- Windows 10/11 with AoE2:DE installed
- Python 3.11+ (x64, not ARM64)
- Anthropic API key
The repo is a polyglot monorepo:
- Python (
uv workspace) — 9 workspace members. Deployable units live inapps/(agent, api, arena, autoresearch, detection-server); reusable libraries live inpackages/(core, data, detection, evaluation). Oneuv syncinstalls every member editable into a shared.venv/, plus the dev dependency-group (ruff, basedpyright, pytest, fakeredis, hypothesis, pre-commit). - TypeScript (
bun workspace) — two frontend apps (apps/dashboardVite/React,apps/landingAstro) share a single rootbun.lockand hoistednode_modules/. Onebun installfrom the repo root resolves both apps.
# Install uv: https://docs.astral.sh/uv/getting-started/installation/
uv sync # everything (members + dev)
uv sync --no-dev # slim install, no dev tools
# Or run a specific package's entry point directly:
uv run --package gameplay-agent aoe2-agent
uv run --package arena aoe2-arena race
uv run --package detection-server aoe2-server --model <path>Per-package optional extras (e.g. aoe2_yolo_v5.mlpackage builds need CoreML,
detection training needs ultralytics) are declared on each package's own
pyproject.toml. Reach them via uv sync --all-extras.
The project reads configuration from environment variables. For local development we keep
them in a gitignored .env file (loaded by docker compose automatically, and by the
agent when launched via just agent). A documented template lives at env.example.
cp env.example .env # then edit .env and fill in the values belowAt minimum, set ANTHROPIC_API_KEY — that's all the gameplay agent needs. Every other
variable in env.example is for the Synthetic Arena infrastructure (Langfuse + MinIO +
ClickHouse + Redis + Postgres) and is only consumed by just arena-infra-up. If you're
not running the arena stack yet, leaving those blank is fine.
# Windows VM
set ANTHROPIC_API_KEY=your-key-here
# macOS / Linux
export ANTHROPIC_API_KEY=your-key-here| Env Var | Default | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
— | Claude API authentication (required) |
AOE2_MODEL |
claude-sonnet-4-6 |
Executor model |
AOE2_EXECUTOR_EFFORT |
low |
Executor effort (low/medium/high) |
AOE2_STRATEGIST_MODEL |
claude-sonnet-4-6 |
Strategist model |
AOE2_STRATEGIST_INTERVAL |
10 |
Run strategist every N turns |
AOE2_LOOP_DELAY |
0.3 |
Seconds between iterations |
AOE2_SAVE_SCREENSHOTS |
true |
Save screenshots to logs/ |
AOE2_OCR_BACKEND |
rapidocr |
Resource-bar OCR backend (rapidocr/template/tesseract) |
AOE2_DETECTION_HOST |
— | Remote detection server URL (e.g., http://192.168.64.1:8420) |
AOE2_TEMPERATURE |
0.0 |
Anthropic Messages API temperature (lowest variance) |
AOE2_SEED |
— | Local-RNG seed (build-retry jitter); unset = OS entropy |
The arena runs every event through an EventBroker (see
docs/design/event-broker-architecture.md). The default in-process broker has
zero external dependencies and is fine for single-machine work — local
development, CI, and just arena-smoke all use it implicitly.
For cross-process replay (a producer in one process feeding consumers in another — e.g. the FastAPI web server live-tailing a CLI race), switch to the Redis backend:
| Env Var | Default | Purpose |
|---|---|---|
ARENA_BROKER_BACKEND |
inprocess |
inprocess or redis. Read once at make_broker(); any other value raises ValueError. |
REDIS_URL |
see below | Explicit connection URL when backend is redis. Takes priority over the smart default. |
REDIS_PASSWORD |
unset | If set (and REDIS_URL is not set), the default becomes redis://:${REDIS_PASSWORD}@localhost:6379/0 — the compose-stack path. Otherwise the default is bare redis://localhost:6379/0. |
# Bring up the compose stack (provides Redis with REDIS_PASSWORD auth):
just arena-infra-up
# Point the agent at it. REDIS_PASSWORD is already in .env from the
# compose setup; the broker will auto-build the URL with it:
export ARENA_BROKER_BACKEND=redis
just arena-smoke # or any other arena CLI
# Or set REDIS_URL explicitly for non-compose Redis (managed cloud, remote
# host, etc.):
export REDIS_URL="redis://:secret@redis.example.com:6380/0"Install the Redis client when using this backend (fakeredis covers CI tests;
real-Redis local work needs the broker-redis extra):
pip install -e ".[broker-redis]"Determinism is asymptotic — per arxiv 2408.04667, expect ~5–12% per-decision variance even with temperature=0. Promise statistical replay over N trials, not byte-identical traces.
Three knobs to make runs as reproducible as Anthropic and the Python stack allow:
AOE2_TEMPERATURE=0.0(default) is Anthropic's lowest-variance temperature. Raise it (e.g.0.7) for output diversity at the cost of reproducibility.AOE2_SEED=<int>seeds the local RNG used inexecutor.py's build-retry jitter and Phase 1'sworld_sim.render()default fallback. Two runs with the same seed produce the same RNG sequence. Leave unset to get today's stochastic behavior (OS entropy). Not passed to the Anthropic API —messages.create()doesn't acceptseed=as of late 2025; this is purely for the local code paths.- Pin model snapshots. Set
AOE2_MODELandAOE2_STRATEGIST_MODELto a dated snapshot (e.g.claude-sonnet-4-6-2026-XX-XX) rather than the floating family alias. Floating tags can move under you between runs.
Required only when bringing up the Docker stack (just arena-infra-up). All seven
variables below must be set to non-empty values — Langfuse refuses to boot with empty
secrets. Never commit the populated .env (it's gitignored).
Prerequisites:
- A running Docker daemon (Docker Desktop, OrbStack, or compatible). The current
docker-compose.ymlis tested against OrbStack on macOS; if OrbStack is installed but not running, start it first withorb start—just arena-infra-upwill otherwise fail withdial unix .../docker.sock: no such file or directory. - At least ~10 GiB of free disk before the first pull. The full stack (langfuse, postgres, clickhouse, minio, redis, otel-collector) is ~7 GB of images plus a few GB of volumes. Pulls that run out of space mid-extraction leave Docker's layer database in an inconsistent state (
failed to register layer: file exists), which then poisons all subsequent pulls — see the troubleshooting section below.
| Env Var | How to generate | Notes |
|---|---|---|
LANGFUSE_SALT |
openssl rand -base64 32 |
Password-hashing salt inside Langfuse |
LANGFUSE_NEXTAUTH_SECRET |
openssl rand -base64 32 |
Session-cookie signing secret |
LANGFUSE_ENCRYPTION_KEY |
openssl rand -hex 32 |
Must be 32-byte hex. Encrypts API keys at rest |
LANGFUSE_DB_PASSWORD |
openssl rand -base64 24 | tr -d '=+/' |
Postgres password (no special chars; ends up in a DATABASE_URL) |
CLICKHOUSE_PASSWORD |
openssl rand -base64 24 | tr -d '=+/' |
ClickHouse default user password |
REDIS_PASSWORD |
openssl rand -base64 24 | tr -d '=+/' |
Redis AUTH password |
MINIO_ROOT_USER |
Pick a username (default in env.example: arena) |
MinIO admin user |
MINIO_ROOT_PASSWORD |
openssl rand -base64 24 | tr -d '=+/' |
Min 8 chars; MinIO rejects shorter |
OTEL_EXPORTER_OTLP_ENDPOINT |
Leave as http://localhost:4318 |
Where the native agent sends OTLP spans |
One-shot generator — paste this once after copying env.example to .env, then fill the
output back into .env (or pipe directly with a script of your choosing):
{
echo "LANGFUSE_SALT=$(openssl rand -base64 32)"
echo "LANGFUSE_NEXTAUTH_SECRET=$(openssl rand -base64 32)"
echo "LANGFUSE_ENCRYPTION_KEY=$(openssl rand -hex 32)"
echo "LANGFUSE_DB_PASSWORD=$(openssl rand -base64 24 | tr -d '=+/')"
echo "CLICKHOUSE_PASSWORD=$(openssl rand -base64 24 | tr -d '=+/')"
echo "REDIS_PASSWORD=$(openssl rand -base64 24 | tr -d '=+/')"
echo "MINIO_ROOT_PASSWORD=$(openssl rand -base64 24 | tr -d '=+/')"
}Verify the stack accepts the values:
just arena-infra-up # docker compose up -d --wait
just arena-infra-status # every service should be "healthy"Langfuse UI lands at http://localhost:3000; MinIO console at http://localhost:9001.
langfuse-webandlangfuse-workerhealthchecks targethttp://$(hostname):PORT/...rather thanhttp://localhost:.... The Langfuse v3 image starts Next.js with-H $(hostname), which binds Next.js to the container's external interface only —localhostreturnsConnection refused. Use theCMD-SHELLform (with$$(hostname)to escape compose interpolation) if you adjust these.otel-collectorrunshealthcheck: disable: truebecause the upstream image (otel/opentelemetry-collector-contrib) is distroless: no shell, nowget, nobusybox— any in-container probe fails withOCI runtime exec failed: ... no such file or directory. The collector logs"Everything is ready"itself once started, and nothing in the stackdepends_onits health.
Symptom (from just arena-infra-up) |
Likely cause | Fix |
|---|---|---|
dial unix .../docker.sock: no such file or directory |
Docker daemon not running | orb start (OrbStack) or launch Docker Desktop |
failed to register layer: rename .../tmp/write-set-N .../sha256/<hex>: file exists |
Orphan layers in layerdb from a previously-killed pull (usually caused by disk filling up mid-extraction) |
First free disk: docker image prune -a -f. Then orb restart docker and retry. If errors persist on different SHAs, the daemon has multiple orphan chain-ids whose diff digest isn't referenced by any image manifest — prune won't remove them because they're unreachable from any image. The reliable fix is to compute the set of reachable chain-ids from all manifests in imagedb/content/sha256/ (fold rootfs.diff_ids as chain[i] = sha256("sha256:<chain[i-1]> sha256:<diff[i]>")) and rm -rf the unreachable entries from layerdb/sha256/ plus their cache-id overlay2 backings. Last-resort: orb delete -f docker (nukes all Docker data) |
unexpected EOF mid-pull |
Either flaky network or the Docker daemon crashed (often disk pressure) | Check df -h and orb status — if OrbStack went to Stopped, the daemon died; bring it back with orb start and free disk before retrying |
dependency failed to start: container ... is unhealthy |
A service started but its healthcheck never goes green | docker logs <container> to see if the app is actually up. If yes, the healthcheck itself is wrong (wrong port, wrong host, missing tooling in image) — inspect with docker inspect <container> --format '{{json .State.Health}}' |
# Run the agent
just agent
# Run N iterations
just agent --iterations 20
# Single test iteration (no action execution)
just agent --test
# Run the detection server (macOS host)
just server --model packages/detection/src/inference/models/aoe2_yolo_v5.onnx
# Frontend dev servers (workspace-wide install happens once at repo root)
bun install
just arena-ui-dev # apps/dashboard (Vite + React, dashboard SPA)
just landing-dev # apps/landing (Astro docs site)
# Autoresearch: timed experiment with metrics
uv run --package autoresearch python -m autoresearch.game_runner --time-budget 600 --description "test run"agent/ # monorepo root (uv + bun workspaces)
├── pyproject.toml # uv workspace declaration + shared tool config
├── package.json # bun workspace declaration (apps/dashboard, apps/landing)
├── bun.lock # single root lockfile for the JS workspace
├── justfile # cross-language task runner
├── docs/ # Architecture chapters, ADRs, runbooks
├── tests/ # Cross-package integration tests
├── apps/ # Deployable units (services, CLIs, frontends)
│ ├── agent/ # Python — real-game loop + providers + scenarios (was packages/gameplay-agent)
│ ├── api/ # Python — FastAPI + SSE backend for replay/fork (was packages/arena-web)
│ ├── arena/ # Python — synthetic arena CLI (race / smoke / rank)
│ ├── autoresearch/ # Python — prompt-optimization loop
│ ├── dashboard/ # TypeScript — Vite + React arena replay UI (was ui/)
│ ├── detection-server/ # Python — macOS-hosted YOLO inference endpoint
│ └── landing/ # TypeScript — Astro docs site (was web/)
├── packages/ # Reusable libraries (imported by apps)
│ ├── core/ # Pure types: Event, Payload, WorldState, DetectedEntity
│ ├── data/ # AoE2 game-knowledge SQLite DB
│ ├── detection/ # YOLO inference, training, labeling, SLD extraction
│ └── evaluation/ # Event broker, DuckDB persister, world sim, fork
└── logs/ # Runtime: screenshots, goal logs, DuckDB event files
Each Python workspace member has its own pyproject.toml declaring deps + optional
extras. The dependency graph is one-way (apps/ → packages/, never the reverse),
enforced by uv at install time via [tool.uv.sources] in the workspace root. uv
resolves members by package name, not by directory path — so internal
dependencies like dependencies = ["core", "detection"] work regardless of which
subdirectory the member lives in.
For the full chapter-level walkthrough see docs/index.md.
The strategist creates prioritized goals (e.g., "Reach 10 population", "Advance to Feudal Age"). The executor receives these as context and follows them in priority order. Goals have:
- Type: local (complete quickly) or global (long-term)
- Metric: population, food, wood, gold, stone, age
- Priority: 1-10 (10 = most urgent)
- Progress: 0.0-1.0, auto-computed from game state
Scans YOLO detections for 21 enemy military classes. Uses color-based ownership detection (detection/inference/ownership.py) to distinguish own units (blue, Player 1) from enemy units. On enemy detection:
- Injects priority-10 "Defend base" goal
- Triggers early strategist wake-up
60-class YOLO v5 model with 92.2% mAP50 accuracy. Entities persist across frames via IoU tracking (e.g., sheep_0 stays sheep_0). The executor supports 7 action types (click, right_click, press, drag, wait, scroll, detect) and can target entities by class (target_class: "sheep") or by ID (target_id: "sheep_0").
Offloads YOLO inference to the macOS host's Neural Engine via CoreML (~15ms per tile vs ~1.2s on VM CPU). The agent talks to it over HTTP with automatic fallback to local ONNX.
Action success/failure is tracked via ActionResult objects returned by the executor. Failed actions (e.g., unresolved target_id) are recorded in memory and fed back to the LLM as context for the next turn.
Automated experiment framework. Runs timed games, collects metrics (peak population, food gathered, survival time, action success rate), and scores performance for prompt optimization.
Projects an in-memory WorldState to the same DetectedEntity schema the real YOLO detector emits, so the agent's perception layer can be exercised against a fully deterministic, in-process world — no game, no screenshots, no model weights. This is the first step of the synthetic-arena buildout (Langfuse + the perception projection live in the same tier).
Two API surfaces:
evaluation.world_sim.render(state, width, height, seed=None) -> list[DetectedEntity]— pure projection. Onetown_centeralways,state.populationvillagers, one entity perstate.buildingsentry placed on a stable index-based grid. Villagers invillager_queueare deliberately excluded (queued for production, not yet on the map). Confidence is1.0(ground truth). Samestate+ same dims + same seed ⇒ identical output.detection.inference.mock.mock_detect_from_world(screenshot, id_factory, world_state) -> list[DetectedEntity]— sibling ofmock_detect()that delegates torender(). Use this where the real-game tier would callmock_detect().
Example:
from evaluation.world_sim import WorldState, render
state = WorldState(
food=200.0, wood=200.0, gold=0.0, stone=0.0,
population=5, pop_cap=25, age="Dark Age",
buildings=["mill"], villager_queue=[], age_up_ticks_remaining=0, turn=0,
)
entities = render(state, 1920, 1080) # list[DetectedEntity], confidence=1.0Schema lock. tests/test_detector.py::TestSyntheticRenderSchemaContract is parametrized over both mock_detect (legacy) and mock_detect_from_world (new), asserting 10 invariants on each (id non-empty, class_name in canonical YOLO list, bbox well-ordered and within dims, center inside bbox, confidence ∈ [0,1], area > 0, to_dict() keys, sort order, id uniqueness) plus a state-sensitivity test (population=15 yields 7 more villager entities than population=8). Any future drift between the two perception surfaces fails CI.
Real-game tier impact: zero. Nothing in gameplay_agent/ was modified; the existing mock_detect() keeps its frozen Dark-Age fixture behavior. Synthetic-arena callers (Phase 2 and beyond) reach for the new function explicitly.
See docs/index.md for the full table of contents (8 parts, 23 chapters).
Most useful entry points:
- Real-game agent: Parts 1–4 (game loop, LLM integration, detection, game knowledge).
- Synthetic arena: Part 6 (CLI / broker / ranking / world sim).
- Arena web UI: Part 7 (SSE backend + Vite/React frontend).
- Autoresearch prompt loop: Part 8.
- Why decisions were made: Architecture Decision Records (broker-first, Redis Streams, basedpyright, Bradley-Terry, Vite/React).
- Operational checklists: Runbooks (Redis ops, switching broker backend, debug a stuck fork, Windows VM bring-up).
- Deployment: deployment-guide.md (Mac + Windows VM first-time setup).
- Detection internals: detection/README.md.
MIT