Skip to content

MCP Gateway Observability Guide

This guide covers how to access and query metrics collected by the MCP Gateway metrics service.

Table of Contents

Architecture Overview

The MCP Gateway collects comprehensive metrics through a dual-path observability system:

  1. SQLite Storage: All metrics are stored in specialized database tables for detailed querying and analysis
  2. OpenTelemetry Export: Metrics are simultaneously exported to OpenTelemetry for real-time monitoring via Prometheus and Grafana

Metrics Collection Flow

Auth Server Middleware → Metrics Service API → Dual Path:
                                               ├─> SQLite Database (detailed storage)
                                               └─> OpenTelemetry (Prometheus/Grafana)

Database Tables

  • auth_metrics: Authentication requests and validation
  • tool_metrics: Tool execution details (calls, methods, client info)
  • discovery_metrics: Tool discovery/search queries
  • metrics: Raw metrics data (all types)
  • api_keys: API key management for metrics service

Accessing Metrics

Access SQLite Database

The metrics database is stored in a Docker volume and accessed via the metrics-db container:

# Connect to the metrics-db container
docker compose exec metrics-db sh

# Access SQLite database
sqlite3 /var/lib/sqlite/metrics.db

# Enable better formatting
.mode column
.headers on

Alternative: Copy Database Locally

# Copy database from container to host
docker compose cp metrics-db:/var/lib/sqlite/metrics.db ./metrics.db

# Install sqlite3 locally if needed
sudo apt-get install -y sqlite3

# Query locally
sqlite3 ./metrics.db

SQLite Database Queries

Database Overview

List All Tables

.tables

Output:

_health            auth_metrics       metrics
api_keys           discovery_metrics  tool_metrics

Count Metrics by Table

SELECT 'auth_metrics' as table_name, COUNT(*) as count FROM auth_metrics
UNION ALL
SELECT 'tool_metrics', COUNT(*) FROM tool_metrics
UNION ALL
SELECT 'discovery_metrics', COUNT(*) FROM discovery_metrics
UNION ALL
SELECT 'metrics', COUNT(*) FROM metrics;

Sample Output:

table_name         count
-----------------  -----
auth_metrics       212
tool_metrics       183
discovery_metrics  0
metrics            475

Authentication Metrics

Recent Auth Requests

SELECT
    datetime(timestamp) as time,
    server,
    success,
    method,
    duration_ms,
    user_hash,
    error_code
FROM auth_metrics
ORDER BY timestamp DESC
LIMIT 20;

Sample Output:

time                 server               success  method   duration_ms       user_hash  error_code
-------------------  -------------------  -------  -------  ----------------  ---------  ----------
2025-10-02 04:43:22  mcpgw                0        unknown  14.0132130181883             500
2025-10-02 04:43:22  currenttime          0        unknown  13.9779029996134             500
2025-10-02 04:43:22  realserverfaketools  0        unknown  12.8724499954842             500
2025-10-02 04:43:22  sre-gateway          0        unknown  8.54846101719886             500

Auth Success Rate by Server

SELECT
    server,
    COUNT(*) as total,
    SUM(success) as successful,
    ROUND(100.0 * SUM(success) / COUNT(*), 2) as success_pct,
    ROUND(AVG(duration_ms), 2) as avg_ms
FROM auth_metrics
GROUP BY server
ORDER BY total DESC;

Hourly Request Volume (Last 24 Hours)

SELECT
    strftime('%Y-%m-%d %H:00', timestamp) as hour,
    COUNT(*) as requests
FROM auth_metrics
WHERE timestamp > datetime('now', '-24 hours')
GROUP BY hour
ORDER BY hour DESC;

Tool Execution Metrics

Recent Tool Executions

SELECT
    datetime(timestamp) as time,
    tool_name,
    server_name,
    success,
    ROUND(duration_ms, 2) as dur_ms,
    method,
    client_name
FROM tool_metrics
ORDER BY timestamp DESC
LIMIT 20;

Sample Output:

time                 tool_name   server_name          success  dur_ms  method      client_name
-------------------  ----------  -------------------  -------  ------  ----------  -----------
2025-10-02 04:43:22  initialize  mcpgw                0        14.01   initialize  claude-code
2025-10-02 04:43:22  initialize  currenttime          0        13.98   initialize  claude-code
2025-10-02 04:43:22  initialize  fininfo              0        10.47   initialize  claude-code
2025-10-02 04:42:59  initialize  currenttime          0        7.61    initialize  Roo Code
2025-10-02 04:42:59  initialize  mcpgw                0        10.24   initialize  Roo Code

Tool Usage Summary

SELECT
    tool_name,
    COUNT(*) as calls,
    SUM(success) as successful,
    ROUND(AVG(duration_ms), 2) as avg_ms,
    COUNT(DISTINCT client_name) as unique_clients
FROM tool_metrics
GROUP BY tool_name
ORDER BY calls DESC;

Client Usage Statistics

SELECT
    client_name,
    client_version,
    COUNT(*) as calls,
    COUNT(DISTINCT tool_name) as unique_tools,
    COUNT(DISTINCT server_name) as unique_servers
FROM tool_metrics
WHERE client_name IS NOT NULL
GROUP BY client_name, client_version
ORDER BY calls DESC;

Slowest Tool Executions

SELECT
    tool_name,
    server_name,
    ROUND(duration_ms, 2) as duration_ms,
    datetime(timestamp) as time,
    success
FROM tool_metrics
ORDER BY duration_ms DESC
LIMIT 20;

Sample Output:

tool_name                  server_name          duration_ms  time                 success
-------------------------  -------------------  -----------  -------------------  -------
initialize                 mcpgw                637.67       2025-10-02 03:32:51  0
initialize                 fininfo              73.62        2025-10-02 03:08:40  0
initialize                 sre-gateway          45.2         2025-10-02 03:15:49  0
initialize                 sre-gateway          39.86        2025-10-02 03:42:27  0
initialize                 realserverfaketools  36.31        2025-10-02 03:42:27  0

Error Analysis

SELECT
    error_code,
    COUNT(*) as count,
    GROUP_CONCAT(DISTINCT tool_name) as affected_tools
FROM tool_metrics
WHERE success = 0 AND error_code IS NOT NULL
GROUP BY error_code
ORDER BY count DESC;

Tool Discovery Metrics

Recent Discovery Queries

SELECT
    datetime(timestamp) as time,
    query,
    results_count,
    ROUND(duration_ms, 2) as dur_ms,
    ROUND(embedding_time_ms, 2) as embed_ms,
    ROUND(faiss_search_time_ms, 2) as search_ms
FROM discovery_metrics
ORDER BY timestamp DESC
LIMIT 20;

Discovery Performance Analysis

SELECT
    COUNT(*) as total_queries,
    ROUND(AVG(results_count), 2) as avg_results,
    ROUND(AVG(duration_ms), 2) as avg_duration_ms,
    ROUND(AVG(embedding_time_ms), 2) as avg_embedding_ms,
    ROUND(AVG(faiss_search_time_ms), 2) as avg_search_ms
FROM discovery_metrics;

Advanced Queries

Tool Method Distribution

SELECT
    method,
    COUNT(*) as count,
    COUNT(DISTINCT server_name) as servers_using,
    ROUND(AVG(duration_ms), 2) as avg_ms
FROM tool_metrics
WHERE method IS NOT NULL
GROUP BY method
ORDER BY count DESC;

Daily Active Clients

SELECT
    DATE(timestamp) as date,
    COUNT(DISTINCT client_name) as unique_clients,
    COUNT(*) as total_calls
FROM tool_metrics
WHERE client_name IS NOT NULL
GROUP BY DATE(timestamp)
ORDER BY date DESC;

Server Performance Comparison

SELECT
    server_name,
    COUNT(*) as total_calls,
    SUM(success) as successful,
    ROUND(100.0 * SUM(success) / COUNT(*), 2) as success_rate,
    ROUND(AVG(duration_ms), 2) as avg_duration_ms,
    ROUND(MIN(duration_ms), 2) as min_ms,
    ROUND(MAX(duration_ms), 2) as max_ms
FROM tool_metrics
GROUP BY server_name
ORDER BY total_calls DESC;

Time-Based Performance Analysis

SELECT
    strftime('%H', timestamp) as hour_of_day,
    COUNT(*) as requests,
    ROUND(AVG(duration_ms), 2) as avg_duration_ms,
    ROUND(100.0 * SUM(success) / COUNT(*), 2) as success_rate
FROM tool_metrics
GROUP BY hour_of_day
ORDER BY hour_of_day;

OpenTelemetry Metrics

The metrics service exports metrics to OpenTelemetry in two formats:

Prometheus Endpoint

Access raw Prometheus metrics:

curl http://localhost:9465/metrics

Available Metrics: - mcp_auth_requests_total - Counter of authentication requests - mcp_auth_request_duration_seconds - Histogram of auth request durations - mcp_tool_executions_total - Counter of tool executions - mcp_tool_execution_duration_seconds - Histogram of tool execution durations - mcp_tool_discovery_total - Counter of discovery requests - mcp_tool_discovery_duration_seconds - Histogram of discovery durations - mcp_protocol_latency_seconds - Histogram of protocol flow latencies - mcp_health_checks_total - Counter of health checks - mcp_health_check_duration_seconds - Histogram of health check durations

OTLP Export

If configured, metrics are also exported to an OTLP endpoint (e.g., OpenTelemetry Collector).

Configuration in .env:

OTEL_OTLP_ENDPOINT=http://otel-collector:4318

Configuring OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data to multiple backends (AWS CloudWatch, Datadog, New Relic, etc.).

Step 1: Add OTel Collector to Docker Compose

Add this service to your docker-compose.yml:

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./config/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4318:4318"   # OTLP HTTP receiver
      - "4317:4317"   # OTLP gRPC receiver
      - "8888:8888"   # Prometheus metrics exposed by the collector
      - "8889:8889"   # Prometheus exporter metrics
    restart: unless-stopped

Step 2: Create OTel Collector Configuration

Create config/otel-collector-config.yaml:

Basic Configuration (Prometheus Export)

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: mcp_gateway

  logging:
    loglevel: info

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, logging]

Advanced Configuration (Multiple Backends)

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: insert
      - key: service.namespace
        value: mcp-gateway
        action: insert

  # Filter metrics if needed
  filter:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - mcp_.*

exporters:
  # Export to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: mcp_gateway

  # Export to AWS CloudWatch
  awscloudwatch:
    region: us-east-1
    namespace: MCP/Gateway
    endpoint: https://monitoring.us-east-1.amazonaws.com

  # Export to Datadog
  datadog:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com

  # Export to New Relic
  otlphttp/newrelic:
    endpoint: https://otlp.nr-data.net:4318
    headers:
      api-key: ${NEW_RELIC_API_KEY}

  # Export to Grafana Cloud
  otlphttp/grafanacloud:
    endpoint: ${GRAFANA_CLOUD_OTLP_ENDPOINT}
    headers:
      authorization: Basic ${GRAFANA_CLOUD_AUTH}

  # Export to Honeycomb
  otlphttp/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}

  # Logging for debugging
  logging:
    loglevel: info

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, resource, filter]
      exporters: [prometheus, awscloudwatch, logging]
      # Add other exporters as needed: datadog, otlphttp/newrelic, etc.

Step 3: Update Metrics Service Configuration

Update your .env file:

# Enable OTLP export to collector
OTEL_OTLP_ENDPOINT=http://otel-collector:4318

Update docker-compose.yml metrics-service environment:

  metrics-service:
    environment:
      - OTEL_OTLP_ENDPOINT=http://otel-collector:4318
      # ... other env vars
    depends_on:
      - metrics-db
      - otel-collector  # Add dependency

Step 4: Configure Prometheus to Scrape OTel Collector

Update config/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Scrape metrics directly from metrics-service
  - job_name: 'mcp-metrics-service'
    static_configs:
      - targets: ['metrics-service:9465']

  # Scrape metrics from OTel Collector
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Step 5: Deploy and Verify

# Restart services
docker compose down
docker compose up -d

# Check OTel Collector logs
docker compose logs -f otel-collector

# Verify metrics are being received
curl http://localhost:8889/metrics | grep mcp_

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "otel-collector")'

Cloud Provider Specific Configurations

AWS CloudWatch

exporters:
  awscloudwatch:
    region: us-east-1
    namespace: MCP/Gateway
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [[service, metric_type]]
        metric_name_selectors:
          - mcp_.*

Required IAM Permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

Environment Variables:

AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key

Datadog

exporters:
  datadog:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com  # or datadoghq.eu for EU
    host_metadata:
      enabled: true
      hostname_source: config_or_system

Environment Variables:

DATADOG_API_KEY=your-datadog-api-key

Grafana Cloud

exporters:
  otlphttp/grafanacloud:
    endpoint: ${GRAFANA_CLOUD_OTLP_ENDPOINT}
    headers:
      authorization: Basic ${GRAFANA_CLOUD_AUTH}

Setup: 1. Get OTLP endpoint from Grafana Cloud console 2. Create service account and get API key 3. Base64 encode: echo -n "instance_id:api_key" | base64

Environment Variables:

GRAFANA_CLOUD_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net/otlp
GRAFANA_CLOUD_AUTH=base64_encoded_credentials

New Relic

exporters:
  otlphttp/newrelic:
    endpoint: https://otlp.nr-data.net:4318
    headers:
      api-key: ${NEW_RELIC_API_KEY}

Environment Variables:

NEW_RELIC_API_KEY=your-new-relic-license-key

Troubleshooting OTel Collector

Check Collector Health

# View collector logs
docker compose logs otel-collector

# Check internal metrics
curl http://localhost:8888/metrics

# Verify receivers are active
docker compose exec otel-collector wget -qO- http://localhost:13133/

Common Issues

Metrics not flowing to backend:

# Enable debug logging in collector config
exporters:
  logging:
    loglevel: debug

# Check for export errors in logs
docker compose logs otel-collector | grep -i error

Connection refused to OTLP endpoint:

# Verify collector is reachable from metrics-service
docker compose exec metrics-service ping otel-collector

# Check port is open
docker compose exec metrics-service nc -zv otel-collector 4318

Authentication failures:

# Verify API keys are set
docker compose exec otel-collector env | grep -i key

# Test exporter authentication separately

Best Practices

  1. Use Batch Processor: Reduces network overhead

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
    

  2. Add Resource Attributes: Tag metrics with environment/deployment info

    processors:
      resource:
        attributes:
          - key: environment
            value: ${ENVIRONMENT}
            action: insert
    

  3. Filter Metrics: Only export what you need

    processors:
      filter:
        metrics:
          include:
            match_type: regexp
            metric_names:
              - mcp_auth_.*
              - mcp_tool_.*
    

  4. Enable Health Check: Monitor collector itself

    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
    
    service:
      extensions: [health_check]
    

  5. Set Retention Policies: Configure backend retention based on use case

  6. Real-time alerts: 7-30 days
  7. Compliance/auditing: 1-2 years
  8. General monitoring: 90 days

Example Complete Setup

See config/otel-collector-config.example.yaml for a complete production-ready configuration template.

Grafana Dashboards

Access Grafana dashboards at: http://localhost:3000

Default credentials: - Username: admin - Password: admin

Pre-configured Dashboards

The MCP Gateway includes pre-configured Grafana dashboards for:

  1. Authentication Metrics
  2. Success rates by server
  3. Request volume over time
  4. Error code distribution
  5. Average response times

  6. Tool Execution Metrics

  7. Most used tools
  8. Client distribution
  9. Success rates
  10. Performance trends

  11. Discovery Metrics

  12. Search query volume
  13. Result counts
  14. Performance breakdown (embedding vs. FAISS search)

  15. System Health

  16. Overall request volume
  17. Error rates
  18. Performance percentiles (p50, p95, p99)

Prometheus Queries

Access Prometheus at: http://localhost:9090

Sample PromQL Queries:

# Authentication success rate
rate(mcp_auth_requests_total{success="true"}[5m]) / rate(mcp_auth_requests_total[5m])

# Average tool execution duration by server
rate(mcp_tool_execution_duration_seconds_sum[5m]) / rate(mcp_tool_execution_duration_seconds_count[5m])

# Top 5 most used tools
topk(5, sum by (tool_name) (rate(mcp_tool_executions_total[5m])))

# 95th percentile request duration
histogram_quantile(0.95, rate(mcp_auth_request_duration_seconds_bucket[5m]))

Monitoring Best Practices

Key Metrics to Monitor

  1. Authentication Success Rate: Should be >95%
  2. Tool Execution Success Rate: Should be >90%
  3. Average Response Time: Should be <100ms for auth, <500ms for tools
  4. Error Rate: Should be <5%
  5. Discovery Query Performance: Embedding time should be <50ms

Setting Up Alerts

Configure alerts in Grafana or Prometheus for:

  • Authentication failure rate >10%
  • Tool execution errors >5%
  • Response time p95 >1000ms
  • Discovery query failures

Data Retention

  • SQLite database: 90 days (configurable via METRICS_RETENTION_DAYS)
  • Prometheus: 200 hours (configurable in prometheus.yml)
  • Adjust retention based on storage capacity and compliance requirements

Troubleshooting

No Metrics Being Collected

  1. Check metrics service is running:

    docker compose ps metrics-service
    

  2. Verify API keys are configured:

    docker compose logs metrics-service | grep "API key"
    

  3. Check middleware is enabled in auth-server logs:

    docker compose logs auth-server | grep "metrics"
    

Database Connection Issues

# Check database volume
docker volume inspect mcp-gateway-registry_metrics-db-data

# Check database file permissions
docker compose exec metrics-db ls -la /var/lib/sqlite/

# Test database connectivity
docker compose exec metrics-db sqlite3 /var/lib/sqlite/metrics.db "SELECT COUNT(*) FROM metrics;"

OpenTelemetry Export Issues

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check metrics-service OTEL configuration
docker compose logs metrics-service | grep -i otel

Schema Reference

auth_metrics Table

Column Type Description
id INTEGER Primary key
request_id TEXT Unique request identifier
timestamp TEXT ISO 8601 timestamp
service TEXT Service name (e.g., "auth-server")
duration_ms REAL Request duration in milliseconds
success BOOLEAN Whether auth was successful
method TEXT Auth method used
server TEXT MCP server name
user_hash TEXT Hashed user identifier
error_code TEXT Error code if failed
created_at TEXT Record creation time

tool_metrics Table

Column Type Description
id INTEGER Primary key
request_id TEXT Unique request identifier
timestamp TEXT ISO 8601 timestamp
service TEXT Service name
duration_ms REAL Execution duration in milliseconds
tool_name TEXT Tool or method name
server_path TEXT Server path
server_name TEXT MCP server name
success BOOLEAN Whether execution succeeded
error_code TEXT Error code if failed
input_size_bytes INTEGER Request payload size
output_size_bytes INTEGER Response payload size
client_name TEXT Client application name
client_version TEXT Client version
method TEXT MCP protocol method
user_hash TEXT Hashed user identifier
created_at TEXT Record creation time

discovery_metrics Table

Column Type Description
id INTEGER Primary key
request_id TEXT Unique request identifier
timestamp TEXT ISO 8601 timestamp
service TEXT Service name
duration_ms REAL Total query duration
query TEXT Search query text
results_count INTEGER Number of results returned
top_k_services INTEGER Number of services requested
top_n_tools INTEGER Number of tools requested
embedding_time_ms REAL Time to generate embeddings
faiss_search_time_ms REAL Time for FAISS search
created_at TEXT Record creation time

Additional Resources