A testing and evaluation framework for Model Context Protocol (MCP) servers. Write deterministic Playwright tests against your MCP tools, or run data-driven eval datasets — including LLM-based evaluation of tool discoverability.
The mcp Playwright fixture connects to your MCP server (stdio or HTTP) and exposes a high-level API for calling tools and asserting responses. Custom matchers keep assertions readable.
import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
test('read_file returns file contents', async ({ mcp }) => {
const result = await mcp.callTool('read_file', { path: '/tmp/test.txt' });
expect(result).toContainToolText('Hello, world');
expect(result).not.toBeToolError();
});
test('server exposes required tools', async ({ mcp }) => {
const tools = await mcp.listTools();
expect(tools.map((t) => t.name)).toContain('read_file');
});Playwright tests are fast, deterministic, and designed for CI. Use them for regression testing, schema validation, and protocol conformance. The framework includes built-in conformance checks for the MCP spec.
Available matchers:
| Matcher | Description |
|---|---|
toMatchToolResponse |
Response exactly matches expected value (deep equal) |
toContainToolText |
Response contains expected substrings |
toMatchToolSchema |
Response validates against a Zod schema |
toMatchToolPattern |
Response matches a regex pattern |
toMatchToolSnapshot |
Response matches a saved baseline |
toBeToolError |
Response is (or is not) an error |
toHaveToolResponseSize |
Response size is within bounds |
toSatisfyToolPredicate |
Response satisfies a custom function |
toHaveToolCalls |
LLM called the expected tools |
toHaveToolCallCount |
LLM made N tool calls |
toPassToolJudge |
LLM evaluates response quality against a rubric |
Eval datasets let you define test cases as JSON files and run them with runEvalDataset(). Each case specifies a tool call and one or more assertions.
{
"name": "file-ops",
"cases": [
{
"id": "read-config",
"toolName": "read_file",
"args": { "path": "/tmp/config.json" },
"expect": {
"schema": "file-content",
"containsText": ["version", "name"]
}
},
{
"id": "read-readme",
"toolName": "read_file",
"args": { "path": "/tmp/README.md" },
"expect": {
"snapshot": "readme-snapshot"
}
}
]
}import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
import { loadEvalDataset, runEvalDataset } from '@gleanwork/mcp-server-tester';
import { z } from 'zod';
test('file operations eval', async ({ mcp }, testInfo) => {
const dataset = await loadEvalDataset('./data/evals.json', {
schemas: { 'file-content': z.object({ content: z.string() }) },
});
const result = await runEvalDataset({ dataset }, { mcp, testInfo });
expect(result.passed).toBe(result.total);
});Supported assertion types:
| Type | Description |
|---|---|
containsText |
Response includes expected substrings |
schema |
Response validates against a Zod schema |
regex |
Response matches a pattern |
snapshot |
Response matches a saved baseline |
judge |
LLM evaluates response quality against a rubric |
toolsTriggered |
LLM called the expected tools (LLM host mode) |
In LLM host mode, a real LLM receives your server's tool list and a natural language prompt, then decides which tools to call. This tests whether your tool names, descriptions, and input schemas are clear enough for autonomous use — a different question from whether the tools return correct output.
{
"id": "find-config",
"mode": "mcp_host",
"scenario": "Find the application config file and return its contents",
"mcpHostConfig": {
"provider": "anthropic",
"model": "claude-opus-4-20250514"
},
"expect": {
"toolsTriggered": {
"calls": [{ "name": "read_file", "required": true }]
}
}
}LLM host mode makes real API calls and produces non-deterministic results. Use iterations to run a case multiple times and measure pass rate rather than expecting 100% on a single run. See the LLM Host Guide for configuration and cost management.
Requires Node.js 22+.
npm install --save-dev @gleanwork/mcp-server-tester @playwright/test zodThe Anthropic SDK is only needed for LLM-as-judge assertions or LLM host mode with the Anthropic provider:
npm install --save-dev @anthropic-ai/sdknpx mcp-server-tester initThe CLI wizard creates a playwright.config.ts, example tests, and a sample eval dataset configured for your server. See the CLI Guide for all options.
Point the framework at your MCP server in playwright.config.ts:
import { defineConfig } from '@playwright/test';
export default defineConfig({
testDir: './tests',
reporter: [['list'], ['@gleanwork/mcp-server-tester/reporters/mcpReporter']],
projects: [
{
name: 'my-server',
use: {
mcpConfig: {
transport: 'stdio',
command: 'node',
args: ['server.js'],
},
},
},
],
});For HTTP servers, set transport: 'http' and serverUrl. For servers that require OAuth, see the Transports Guide and CLI Guide for authentication setup, including CI/CD token management.
- Quick Start — detailed setup and configuration
- Expectations — all assertion types including snapshot sanitizers
- LLM Host Simulation — tool discoverability testing
- API Reference
- Transports — stdio and HTTP configuration, OAuth
- CLI Commands — init, generate, login, token
- UI Reporter — interactive web UI for test results
- Development — contributing and building
- Migration Guide (v0.12 → v1.0) — upgrading from pre-1.0 releases
The examples/ directory contains complete working examples:
- filesystem-server/ — Test suite for Anthropic's Filesystem MCP server: 5 Playwright tests, 11 eval dataset cases, Zod schema validation.
- sqlite-server/ — Test suite for a SQLite MCP server: 11 Playwright tests, 14 eval dataset cases.
- basic-playwright-usage/ — Minimal Playwright patterns.
These MCP protocol features are not currently supported. These are deliberate scope decisions, not bugs:
- MCP resources (
listResources,readResource) - MCP prompts (
listPrompts,getPrompt) - Server-to-client notifications
- Streaming tool responses (
callToolwaits for the complete response)
If any of these affect your use case, please open an issue.
MIT