Evaluate tools

In this guide, you’ll learn how to evaluate your custom to ensure they function correctly with the AI assistant, including defining evaluation cases and using different critics.

We’ll create evaluation cases to test our greet and measure its performance.

Prerequisites

Create an MCP Server
Install the evaluation dependencies:

uv

Terminal


uv pip install 'arcade-mcp[evals]'

Create an evaluation suite

Navigate to your Server’s directory

Terminal


cd my_server

Create a new Python file for your evaluations, e.g., eval_server.py.

For evals, the file name should start with eval_ and be a Python script (using the .py extension).

Define your evaluation cases

Open eval_server.py and add the following code:

Python


from arcade_evals import (
    EvalSuite, tool_eval, EvalRubric,
    ExpectedToolCall, BinaryCritic
)
from arcade_core import ToolCatalog
 
from server import greet
 
# Create a catalog of tools to include in the evaluation
catalog = ToolCatalog()
catalog.add_tool(greet, "Greet")
 
# Create rubric with tool calls
rubric = EvalRubric(
    fail_threshold=0.8,
    warn_threshold=0.9,
)
 
@tool_eval()
def hello_eval_suite() -> EvalSuite:
    """Create an evaluation suite for the hello tool."""
    suite = EvalSuite(
        name="MCP Server Evaluation",
        catalog=catalog,
        system_message="You are a helpful assistant.",
        rubric=rubric,
    )
 
    suite.add_case(
        name="Simple Greeting",
        user_message="Greet Alice",
        expected_tool_calls=[
            ExpectedToolCall(
                func=greet,
                args={
                    "name": "Alice",
                },
            )
        ],
        critics=[
            BinaryCritic(critic_field="name", weight=1.0),
        ],
    )
 
    return suite

Run the evaluation

From the server directory, ensure you have an OpenAI set in the OPENAI_API_KEY environment variable. Then run:

Terminal


export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
arcade evals .

This command executes your evaluation suite and provides a report.

By default, the evaluation suite will use the gpt-4o model. You can specify a different model and provider using the --models and --provider options. If you are using a different provider, you will need to set the appropriate in an environment variable, or use the --provider-api-key option. For more information, see the Run evaluations guide.

How it works

The evaluation framework in Arcade allows you to define test cases (EvalCase) with expected calls and use critics to assess the AI’s performance.

By running the evaluation suite, you can measure how well the AI assistant is using your .

Next steps

Explore different types of critics and evaluation criteria to thoroughly test your .

Learn more about Critic classes

Critic classes

Arcade provides several critic classes to evaluate different aspects of usage.

BinaryCritic

Checks if a parameter value matches exactly.

Python


BinaryCritic(critic_field="name", weight=1.0)

SimilarityCritic

Evaluates the similarity between expected and actual values.

Python


from arcade_evals import SimilarityCritic
 
SimilarityCritic(critic_field="message", weight=1.0)

NumericCritic

Assesses numeric values within a specified tolerance.

Python


from arcade_evals import NumericCritic
 
NumericCritic(critic_field="score", tolerance=0.1, weight=1.0)

DatetimeCritic

Evaluates the closeness of datetime values within a specified tolerance.

Python


from datetime import timedelta
from arcade_evals import DatetimeCritic
 
DatetimeCritic(critic_field="start_time", tolerance=timedelta(seconds=10), weight=1.0)

Advanced evaluation cases

You can add more evaluation cases to test different scenarios.

Example: Greeting with emotion

Modify your hello to accept an emotion parameter:

Python


from arcade_mcp_server import tool
from typing import Annotated
 
class Emotion(str, Enum):
    HAPPY = "happy"
    SLIGHTLY_HAPPY = "slightly happy"
    SAD = "sad"
    SLIGHTLY_SAD = "slightly sad"
 
@tool
def greet(
    name: Annotated[str, "The name of the person to greet"] = "there",
    emotion: Annotated[Emotion, "The emotion to convey"] = Emotion.HAPPY
) -> Annotated[str, "A greeting to the user"]:
    """
    Say hello to the user with a specific emotion.
    """
    return f"Hello {name}! I'm feeling {emotion.value} today."

Add an evaluation case for this new parameter:

Python


suite.add_case(
    name="Greeting with Emotion",
    user_message="Say hello to Bob sadly",
    expected_tool_calls=[
        ExpectedToolCall(
            func=greet,
            args={
                "name": "Bob",
                "emotion": Emotion.SAD,
            },
        )
    ],
    critics=[
        BinaryCritic(critic_field="name", weight=0.5),
        SimilarityCritic(critic_field="emotion", weight=0.5),
    ],
)

Add an evaluation case with additional conversation :

Python


suite.add_case(
    name="Greeting with Emotion",
    user_message="Say hello to Bob based on my current mood.",
    expected_tool_calls=[
        ExpectedToolCall(
            func=greet,
            args={
                "name": "Bob",
                "emotion": Emotion.HAPPY,
            },
        )
    ],
    critics=[
        BinaryCritic(critic_field="name", weight=0.5),
        SimilarityCritic(critic_field="emotion", weight=0.5),
    ],
    # Add some context to the evaluation case
    additional_messages= [
        {"role": "user", "content": "Hi, I'm so happy!"},
        {
            "role": "assistant",
            "content": "That's awesome! What's got you feeling so happy today?",
        },
    ]
)

Add an evaluation case with multiple expected calls:

Python


suite.add_case(
    name="Greeting with Emotion",
    user_message="Say hello to Bob based on my current mood. And then say hello to Alice with slightly less of that emotion.",
    expected_tool_calls=[
        ExpectedToolCall(
            func=greet,
            args={
                "name": "Bob",
                "emotion": Emotion.HAPPY,
            },
        ),
        ExpectedToolCall(
            func=greet,
            args={
                "name": "Alice",
                "emotion": Emotion.SLIGHTLY_HAPPY,
            },
        )
    ],
    critics=[
        BinaryCritic(critic_field="name", weight=0.5),
        SimilarityCritic(critic_field="emotion", weight=0.5),
    ],
    # Add some context to the evaluation case
    additional_messages= [
        {"role": "user", "content": "Hi, I'm so happy!"},
        {
            "role": "assistant",
            "content": "That's awesome! What's got you feeling so happy today?",
        },
    ]
)

Ensure that your greet and evaluation cases are updated accordingly and that you rerun arcade evals . to test your changes.