Evaluations & Observability - Measure What Matters

Evaluations & Observability – Measure What Matters

We’ve reached the final day of Launch Week. Over the past four days, we’ve given you the tools to build production-grade AI agents:

Day 1: Tool Groups to eliminate context pollution
Day 2: Custom Tools for surgical precision
Day 3: Token Optimization to maximize efficiency
Day 4: Enterprise Integrations to break down silos

Today we’re addressing one of the top requests we’ve been hearing from customers: How do you know if your agent is working as expected?

We’re releasing: Evaluations Framework and Observability Dashboard.

The Challenge: Visibility into Agent Behavior

You’ve built an e-commerce agent. You’ve scoped it to the right tools. You’ve optimized token usage. Now you need visibility into production:

Which tools are actually being called?
Are the tools being used correctly?
Where are agents failing?
What’s your actual usage and cost?
How do new tool configurations impact success rates?

Without visibility, you’re flying blind. You can’t optimize what you can’t measure.

This is especially critical when you’re working with Tool Groups. When you switch from groups=ecommerce to a custom tool selection, did you accidentally break a critical workflow? You won’t know until a customer complains.

The Solution: Two-Layer Visibility

We’ve built a complete visibility stack with two complementary systems:

1. MCP Evaluations Framework (Development & Testing)

Automated testing framework powered by mcpjam that validates agent behavior before production

2. Observability Dashboard (Production Monitoring)

Real-time usage analytics dashboard in Bright Data’s Control Panel that tracks every API call in production

Let’s dive into each layer.

Layer 1: MCP Evaluations Framework

What is mcpjam?

mcpjam is the official evaluation CLI for Model Context Protocol servers. Think of it as “integration testing for AI agents.”

You write test cases as natural language queries, specify which tools should be called, and mcpjam runs your agent through the workflow automatically.

How We Use It

We’ve built a comprehensive evaluation suite for every Tool Group we shipped on Day 1. When you configure a new tool selection, you can run these evals to verify everything works before deploying.

Project Structure

mcp-evals/
├── server-configs/           # Server connection configs per tool group
│   ├── server-config.ecommerce.json
│   ├── server-config.social.json
│   ├── server-config.business.json
│   ├── server-config.browser.json
│   └── ...
├── tool-groups.json/         # Test cases per tool group
│   ├── tool-groups.ecommerce.json
│   ├── tool-groups.social.json
│   ├── tool-groups.business.json
│   ├── tool-groups.browser.json
│   └── ...
└── llms.json                 # LLM provider API keys

Each tool group gets its own test suite with real-world queries that agents should be able to handle.

Example: E-commerce Eval

From mcp-evals/tool-groups.json/tool-groups.ecommerce.json:

{
  "title": "Test E-commerce - Amazon product search",
  "query": "Search for wireless headphones on Amazon and show me the top products with reviews",
  "runs": 1,
  "model": "gpt-5.1-2025-11-13",
  "provider": "openai",
  "expectedToolCalls": ["web_data_amazon_product_search"],
  "selectedServers": ["ecommerce-server"],
  "advancedConfig": {
    "instructions": "You are a shopping assistant helping users find products on Amazon",
    "temperature": 0.1,
    "maxSteps": 5,
    "toolChoice": "required"
  }
}

This test validates that:

The agent correctly interprets the user query
It calls the right tool (web_data_amazon_product_search)
It passes appropriate parameters (product keyword, Amazon URL)
It completes within the configured timeout
It returns a coherent response

Running Evals: Quick Start

Install mcpjam:

npm install -g @mcpjam/cli

Run e-commerce tool group tests:

mcpjam evals run \
  -t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \
  -e mcp-evals/server-configs/server-config.ecommerce.json \
  -l mcp-evals/llms.json

Expected Output:

Running tests
Connected to 1 server: ecommerce-server
Found 13 total tools
Running 2 tests

Test 1: Test E-commerce - Amazon product search
Using openai:gpt-5.1-2025-11-13

run 1/1
user: Search for wireless headphones on Amazon and show me the top products with reviews
[tool-call] web_data_amazon_product_search
{
  "keyword": "wireless headphones",
  "url": "https://www.amazon.com"
}
[tool-result] web_data_amazon_product_search
{
  "content": [...]
}
assistant: Here are some of the top wireless headphones currently on Amazon...

Expected: [web_data_amazon_product_search]
Actual:   [web_data_amazon_product_search]
PASS (23.8s)
Tokens • input 20923 • output 1363 • total 22286

What Gets Tested

We’ve built eval suites for all 8 Tool Groups from Day 1:

Tool Group	Test Coverage	Example Queries
ecommerce	Amazon, Walmart, Best Buy product searches	“Compare iPhone 15 prices across retailers”
social	TikTok content, Instagram posts, Twitter trends	“Find trending TikTok videos about AI”
business	LinkedIn profiles, Crunchbase funding data, Google Maps locations	“Find the LinkedIn profile for the CEO of Microsoft”
research	GitHub repos, Reuters news, academic sources	“Find Python repos for web scraping with 1k+ stars”
finance	Stock data, market trends, financial news	“Get the latest stock price for NVIDIA”
app_stores	iOS App Store, Google Play reviews & ratings	“Find top-rated meditation apps on iOS”
browser	Scraping Browser automation workflows	“Navigate to Amazon and add an item to cart”
advanced_scraping	Batch operations, custom scraping	“Scrape product data from a custom website”

Each test suite contains 2-5 core test cases covering the most common agent workflows for that domain.

Why This Matters

Evals give you:

Regression Testing: Run evals after every config change to ensure you didn’t break existing workflows
Performance Benchmarking: Track token usage and latency across different LLM models
Tool Validation: Verify that tool selection logic is working correctly
Documentation: Test cases serve as executable examples of what your agent can do

Before Day 1’s Tool Groups, we had no systematic way to test whether switching from groups=ecommerce to groups=ecommerce,social would break agent behavior. Now we do.

Layer 2: Observability Dashboard

Real-Time Production Monitoring

While evals handle pre-deployment testing, the Observability Dashboard gives you real-time visibility into production usage.

We’ve integrated a new MCP Usage panel into Bright Data’s Control Panel that tracks every API call made through your MCP server.

What You See

The dashboard displays a comprehensive usage table with:

Date	Tool	Client Name	URL	Status
2025-11-26 14:32:15	web_data_amazon_product	my-ecommerce-agent	https://amazon.com/…	Success
2025-11-26 14:31:52	search_engine	my-research-bot	N/A	Success
2025-11-26 14:30:18	web_data_linkedin_person_profile	lead-gen-agent	https://linkedin.com/in/…	Success
2025-11-26 14:29:03	scraping_browser_navigate	automation-agent	https://example.com	Failed

Key Metrics

1. Tool Usage Breakdown

See which tools are being called most frequently:

web_data_amazon_product:        1,243 calls
search_engine:                    892 calls
web_data_linkedin_person_profile: 634 calls
scrape_as_markdown:              421 calls

This tells you which datasets are most valuable to your agents. If you’re paying for unused tool groups, you’ll see it here.

2. Client Identification

Every agent instance can be tagged with a client name (via the client_name parameter in the connection URL):

npx -y @brightdata/mcp

The dashboard groups usage by client, so you can track costs per agent/workflow.

3. Success vs. Failure Rates

Monitor agent reliability:

Total Requests:     3,190
Successful:         3,102 (97.2%)
Failed:                88 (2.8%)

Click into failed requests to see error details and debug issues.

4. URL Tracking

For dataset tools, the dashboard shows which URLs/resources were accessed. This helps you:

Identify rate-limiting issues (too many requests to same domain)
Track which specific products/profiles/pages are being scraped
Audit compliance (ensure agents aren’t accessing restricted sites)

How to Access

Log into Bright Data Control Panel
Navigate to MCP Usage (new section in the sidebar)
View real-time usage data for all your MCP connections

Filters:

Date range (last 24 hours, 7 days, 30 days, custom)
Tool name (filter by specific tools)
Client name (filter by agent instance)
Status (success/failure)

Export:

Download usage data as CSV for deeper analysis or BI tool integration.

Combined Workflow: Development → Production

Here’s how the two systems work together:

Phase 1: Development (Pre-Deployment)

Configure Tool Groups using Day 1’s featurenpx -y @brightdata/mcp
Run Evals to validate tool selectionmcpjam evals run \ -t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \ -e mcp-evals/server-configs/server-config.ecommerce.json \ -l mcp-evals/llms.json
Review Results: Ensure all tests pass
- Token usage is within budget
- Correct tools are being called
- Responses are accurate
Iterate: If tests fail, adjust tool selection or system prompts

Phase 2: Production (Post-Deployment)

Deploy Agent with client name taggingnpx -y @brightdata/mcp
Monitor Dashboard: Check real-time usage
- Are success rates consistent with eval results?
- Are unexpected tools being called?
- Any rate limiting or authentication issues?
Analyze Trends: Over time, look for:
- Usage spikes (need to scale?)
- Failure pattern changes (tool degradation?)
- Cost anomalies (optimize token usage)
Optimize: Use dashboard insights to refine tool selection
- Remove unused tools (reduce token costs)
- Add missing tools (improve success rates)
- Adjust rate limits (avoid throttling)
Re-Run Evals: After any config change, run evals again to ensure no regressions

Performance Stats: Launch Week Recap

Let’s bring it all together. Here’s the cumulative impact of all 5 days:

Day 1: Tool Groups

Impact: 60% reduction in system prompt tokens
Example: Full suite (200+ tools) → Single group (25 tools)
Token Savings: ~8,000 tokens per request (system prompt)

Day 2: Custom Tools

Impact: 85% reduction vs. full suite when selecting 4 specific tools
Example: Full suite (200+ tools) → Custom (4 tools)
Token Savings: ~9,500 tokens per request (system prompt)

Day 3: Token Optimization

Impact: 30-60% reduction in tool response tokens
Example: Web scraping + dataset tools in single workflow
Token Savings: ~10,250 tokens per request (tool outputs)

Combined Effect: E-commerce Agent Workflow

Scenario: “Find top 5 Amazon headphones under $100, summarize reviews”

Configuration	System Prompt	Tool Outputs	Total Tokens	Cost per Request
Full Suite (No Optimization)	15,000	22,500	37,500	$0.45
+ Tool Groups	6,000	22,500	28,500	$0.34
+ Custom Tools	2,250	22,500	24,750	$0.30
+ Token Optimization	2,250	12,250	14,500	$0.17

Total Reduction: 61.3% fewer tokens, 62.2% lower cost

At 1,000 requests/day, that’s $280/day savings or $102,200/year.

Day 4: Enterprise Integrations

Impact: Eliminated custom ETL overhead
Time Savings: Weeks of engineering work → Minutes of configuration
Maintenance: Zero (handled by Bright Data)

Day 5: Evals + Observability

Impact: Proactive quality control + production visibility
Failure Reduction: 10-15% improvement in success rates (via early issue detection)
Cost Avoidance: Catch regressions before production (save hundreds of failed requests)

Try It Out: Get Started Today

Step 1: Run Your First Eval

# Install mcpjam
npm install -g @mcpjam/cli

# Clone The Web MCP repo
git clone https://github.com/brightdata/brightdata-mcp-sse.git
cd brightdata-mcp-sse

# Configure your API keys in mcp-evals/llms.json
# Configure your Bright Data token in server configs

# Run e-commerce evals
mcpjam evals run \
  -t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \
  -e mcp-evals/server-configs/server-config.ecommerce.json \
  -l mcp-evals/llms.json

Step 2: Access the Observability Dashboard

Sign up at Bright Data
Navigate to MCP Usage in the Control Panel
Deploy an agent and watch real-time usage data appear

Step 3: Iterate

Use evals to test configurations. Use the dashboard to monitor production. Rinse and repeat.

Resources

MCP Evaluations:

mcpjam GitHub — Official evaluation CLI
Model Context Protocol — Official MCP specification

Observability Dashboard:

Bright Data Control Panel — Access your usage dashboard
API Documentation — Full API reference

The Web MCP Server:

GitHub Repository — Open-source server code
NPM Package — Install via npm

Launch Week Recap:

Day 1: Tool Groups — Eliminate context pollution
Day 2: Custom Tools — Surgical tool selection
Day 3: Token Optimization — Maximize efficiency
Day 4: Enterprise Integrations — Break down silos
Day 5: Evals & Observability — Measure what matters (you are here)

Launch Week: A Final Word

Five days. Five major releases. One mission: Make AI Agents Production-Ready.

We started with the insight that context pollution is the biggest bottleneck in agentic workflows. We gave you Tool Groups to scope your context.

Then we realized even groups aren’t precise enough. We shipped Custom Tools for surgical precision.

Next, we tackled the output side: token-bloated responses. We integrated markdown stripping via Strip-Markdown and intelligent payload cleaning with Parsed Light.

After that, we brought Bright Data to the platforms enterprises actually use: Google ADK, IBM watsonx, Databricks, and Snowflake.

And today, we closed the loop with evaluations and observability. Because you can’t improve what you can’t measure.

This is the full stack for production AI agents:

Tool Groups → Reduce context pollution
Custom Tools → Maximize precision
Token Optimization → Minimize costs
Enterprise Integrations → Deploy anywhere
Evals + Observability → Maintain quality

Thank You

To everyone who followed along this week: thank you.

To the developers building the next generation of AI agents: we can’t wait to see what you build.

To the enterprises deploying AI at scale: we’re here to make it reliable.

And to the open-source community that made MCP possible: this is just the beginning.

Let’s build the future of AI together.