@gleanwork/mcp-server-tester

A testing and evaluation framework for Model Context Protocol (MCP) servers. Write deterministic Playwright tests against your MCP tools, or run data-driven eval datasets — including LLM-based evaluation of tool discoverability.

Playwright Tests

The mcp Playwright fixture connects to your MCP server (stdio or HTTP) and exposes a high-level API for calling tools and asserting responses. Custom matchers keep assertions readable.

import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';

test('read_file returns file contents', async ({ mcp }) => {
  const result = await mcp.callTool('read_file', { path: '/tmp/test.txt' });
  expect(result).toContainToolText('Hello, world');
  expect(result).not.toBeToolError();
});

test('server exposes required tools', async ({ mcp }) => {
  const tools = await mcp.listTools();
  expect(tools.map((t) => t.name)).toContain('read_file');
});

Playwright tests are fast, deterministic, and designed for CI. Use them for regression testing, schema validation, and protocol conformance. The framework includes built-in conformance checks for the MCP spec.

Available matchers:

Matcher	Description
`toMatchToolResponse`	Response exactly matches expected value (deep equal)
`toContainToolText`	Response contains expected substrings
`toMatchToolSchema`	Response validates against a Zod schema
`toMatchToolPattern`	Response matches a regex pattern
`toMatchToolSnapshot`	Response matches a saved baseline
`toBeToolError`	Response is (or is not) an error
`toHaveToolResponseSize`	Response size is within bounds
`toSatisfyToolPredicate`	Response satisfies a custom function
`toHaveToolCalls`	LLM called the expected tools
`toHaveToolCallCount`	LLM made N tool calls
`toPassToolJudge`	LLM evaluates response quality against a rubric

Eval Datasets

Eval datasets let you define test cases as JSON files and run them with runEvalDataset(). Each case specifies a tool call and one or more assertions.

{
  "name": "file-ops",
  "cases": [
    {
      "id": "read-config",
      "toolName": "read_file",
      "args": { "path": "/tmp/config.json" },
      "expect": {
        "schema": "file-content",
        "containsText": ["version", "name"]
      }
    },
    {
      "id": "read-readme",
      "toolName": "read_file",
      "args": { "path": "/tmp/README.md" },
      "expect": {
        "snapshot": "readme-snapshot"
      }
    }
  ]
}

import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
import { loadEvalDataset, runEvalDataset } from '@gleanwork/mcp-server-tester';
import { z } from 'zod';

test('file operations eval', async ({ mcp }, testInfo) => {
  const dataset = await loadEvalDataset('./data/evals.json', {
    schemas: { 'file-content': z.object({ content: z.string() }) },
  });
  const result = await runEvalDataset({ dataset }, { mcp, testInfo });
  expect(result.passed).toBe(result.total);
});

Supported assertion types:

Type	Description
`containsText`	Response includes expected substrings
`schema`	Response validates against a Zod schema
`regex`	Response matches a pattern
`snapshot`	Response matches a saved baseline
`judge`	LLM evaluates response quality against a rubric
`toolsTriggered`	LLM called the expected tools (LLM host mode)

LLM host mode

In LLM host mode, a real LLM receives your server's tool list and a natural language prompt, then decides which tools to call. This tests whether your tool names, descriptions, and input schemas are clear enough for autonomous use — a different question from whether the tools return correct output.

{
  "id": "find-config",
  "mode": "mcp_host",
  "scenario": "Find the application config file and return its contents",
  "mcpHostConfig": {
    "provider": "anthropic",
    "model": "claude-opus-4-20250514"
  },
  "expect": {
    "toolsTriggered": {
      "calls": [{ "name": "read_file", "required": true }]
    }
  }
}

LLM host mode makes real API calls and produces non-deterministic results. Use iterations to run a case multiple times and measure pass rate rather than expecting 100% on a single run. See the LLM Host Guide for configuration and cost management.

Installation

Requires Node.js 22+.

npm install --save-dev @gleanwork/mcp-server-tester @playwright/test zod

The Anthropic SDK is only needed for LLM-as-judge assertions or LLM host mode with the Anthropic provider:

npm install --save-dev @anthropic-ai/sdk

Quick Start

npx mcp-server-tester init

The CLI wizard creates a playwright.config.ts, example tests, and a sample eval dataset configured for your server. See the CLI Guide for all options.

Configuration

Point the framework at your MCP server in playwright.config.ts:

import { defineConfig } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  reporter: [['list'], ['@gleanwork/mcp-server-tester/reporters/mcpReporter']],
  projects: [
    {
      name: 'my-server',
      use: {
        mcpConfig: {
          transport: 'stdio',
          command: 'node',
          args: ['server.js'],
        },
      },
    },
  ],
});

For HTTP servers, set transport: 'http' and serverUrl. For servers that require OAuth, see the Transports Guide and CLI Guide for authentication setup, including CI/CD token management.

Documentation

Quick Start — detailed setup and configuration
Expectations — all assertion types including snapshot sanitizers
LLM Host Simulation — tool discoverability testing
API Reference
Transports — stdio and HTTP configuration, OAuth
CLI Commands — init, generate, login, token
UI Reporter — interactive web UI for test results
Development — contributing and building
Migration Guide (v0.12 → v1.0) — upgrading from pre-1.0 releases

Examples

The examples/ directory contains complete working examples:

filesystem-server/ — Test suite for Anthropic's Filesystem MCP server: 5 Playwright tests, 11 eval dataset cases, Zod schema validation.
sqlite-server/ — Test suite for a SQLite MCP server: 11 Playwright tests, 14 eval dataset cases.
basic-playwright-usage/ — Minimal Playwright patterns.

Known Limitations

These MCP protocol features are not currently supported. These are deliberate scope decisions, not bugs:

MCP resources (listResources, readResource)
MCP prompts (listPrompts, getPrompt)
Server-to-client notifications
Streaming tool responses (callTool waits for the complete response)

If any of these affect your use case, please open an issue.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github		.github
data		data
docs		docs
examples		examples
schema		schema
scripts		scripts
snippets		snippets
src		src
tests		tests
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.markdown-coderc.json		.markdown-coderc.json
.mise.toml		.mise.toml
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
mise.toml		mise.toml
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
postcss.config.cjs		postcss.config.cjs
tailwind.config.cjs		tailwind.config.cjs
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
ui.png		ui.png
vitest.config.mts		vitest.config.mts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@gleanwork/mcp-server-tester

Playwright Tests

Eval Datasets

LLM host mode

Installation

Quick Start

Configuration

Documentation

Examples

Known Limitations

License

About

Uh oh!

Releases 22

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@gleanwork/mcp-server-tester

Playwright Tests

Eval Datasets

LLM host mode

Installation

Quick Start

Configuration

Documentation

Examples

Known Limitations

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages