Skip to content

gleanwork/mcp-server-tester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

237 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@gleanwork/mcp-server-tester

GA npm version CI License: MIT

A testing and evaluation framework for Model Context Protocol (MCP) servers. Write deterministic Playwright tests against your MCP tools, or run data-driven eval datasets — including LLM-based evaluation of tool discoverability.

Playwright Tests

The mcp Playwright fixture connects to your MCP server (stdio or HTTP) and exposes a high-level API for calling tools and asserting responses. Custom matchers keep assertions readable.

import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';

test('read_file returns file contents', async ({ mcp }) => {
  const result = await mcp.callTool('read_file', { path: '/tmp/test.txt' });
  expect(result).toContainToolText('Hello, world');
  expect(result).not.toBeToolError();
});

test('server exposes required tools', async ({ mcp }) => {
  const tools = await mcp.listTools();
  expect(tools.map((t) => t.name)).toContain('read_file');
});

Playwright tests are fast, deterministic, and designed for CI. Use them for regression testing, schema validation, and protocol conformance. The framework includes built-in conformance checks for the MCP spec.

Available matchers:

Matcher Description
toMatchToolResponse Response exactly matches expected value (deep equal)
toContainToolText Response contains expected substrings
toMatchToolSchema Response validates against a Zod schema
toMatchToolPattern Response matches a regex pattern
toMatchToolSnapshot Response matches a saved baseline
toBeToolError Response is (or is not) an error
toHaveToolResponseSize Response size is within bounds
toSatisfyToolPredicate Response satisfies a custom function
toHaveToolCalls LLM called the expected tools
toHaveToolCallCount LLM made N tool calls
toPassToolJudge LLM evaluates response quality against a rubric

Eval Datasets

Eval datasets let you define test cases as JSON files and run them with runEvalDataset(). Each case specifies a tool call and one or more assertions.

{
  "name": "file-ops",
  "cases": [
    {
      "id": "read-config",
      "toolName": "read_file",
      "args": { "path": "/tmp/config.json" },
      "expect": {
        "schema": "file-content",
        "containsText": ["version", "name"]
      }
    },
    {
      "id": "read-readme",
      "toolName": "read_file",
      "args": { "path": "/tmp/README.md" },
      "expect": {
        "snapshot": "readme-snapshot"
      }
    }
  ]
}
import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';
import { loadEvalDataset, runEvalDataset } from '@gleanwork/mcp-server-tester';
import { z } from 'zod';

test('file operations eval', async ({ mcp }, testInfo) => {
  const dataset = await loadEvalDataset('./data/evals.json', {
    schemas: { 'file-content': z.object({ content: z.string() }) },
  });
  const result = await runEvalDataset({ dataset }, { mcp, testInfo });
  expect(result.passed).toBe(result.total);
});

Supported assertion types:

Type Description
containsText Response includes expected substrings
schema Response validates against a Zod schema
regex Response matches a pattern
snapshot Response matches a saved baseline
judge LLM evaluates response quality against a rubric
toolsTriggered LLM called the expected tools (LLM host mode)

LLM host mode

In LLM host mode, a real LLM receives your server's tool list and a natural language prompt, then decides which tools to call. This tests whether your tool names, descriptions, and input schemas are clear enough for autonomous use — a different question from whether the tools return correct output.

{
  "id": "find-config",
  "mode": "mcp_host",
  "scenario": "Find the application config file and return its contents",
  "mcpHostConfig": {
    "provider": "anthropic",
    "model": "claude-opus-4-20250514"
  },
  "expect": {
    "toolsTriggered": {
      "calls": [{ "name": "read_file", "required": true }]
    }
  }
}

LLM host mode makes real API calls and produces non-deterministic results. Use iterations to run a case multiple times and measure pass rate rather than expecting 100% on a single run. See the LLM Host Guide for configuration and cost management.

Installation

Requires Node.js 22+.

npm install --save-dev @gleanwork/mcp-server-tester @playwright/test zod

The Anthropic SDK is only needed for LLM-as-judge assertions or LLM host mode with the Anthropic provider:

npm install --save-dev @anthropic-ai/sdk

Quick Start

npx mcp-server-tester init

The CLI wizard creates a playwright.config.ts, example tests, and a sample eval dataset configured for your server. See the CLI Guide for all options.

Configuration

Point the framework at your MCP server in playwright.config.ts:

import { defineConfig } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  reporter: [['list'], ['@gleanwork/mcp-server-tester/reporters/mcpReporter']],
  projects: [
    {
      name: 'my-server',
      use: {
        mcpConfig: {
          transport: 'stdio',
          command: 'node',
          args: ['server.js'],
        },
      },
    },
  ],
});

For HTTP servers, set transport: 'http' and serverUrl. For servers that require OAuth, see the Transports Guide and CLI Guide for authentication setup, including CI/CD token management.

Documentation

Examples

The examples/ directory contains complete working examples:

  • filesystem-server/ — Test suite for Anthropic's Filesystem MCP server: 5 Playwright tests, 11 eval dataset cases, Zod schema validation.
  • sqlite-server/ — Test suite for a SQLite MCP server: 11 Playwright tests, 14 eval dataset cases.
  • basic-playwright-usage/ — Minimal Playwright patterns.

Known Limitations

These MCP protocol features are not currently supported. These are deliberate scope decisions, not bugs:

  • MCP resources (listResources, readResource)
  • MCP prompts (listPrompts, getPrompt)
  • Server-to-client notifications
  • Streaming tool responses (callTool waits for the complete response)

If any of these affect your use case, please open an issue.

License

MIT

About

Playwright-based testing and eval framework for MCP servers with LLM-as-a-judge

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages