MCP Guide

MCP Performance Optimization Guide

Make your MCP servers fast with caching, connection pooling, and async patterns.

Updated February 2025 · 9 min read

Every millisecond of MCP latency adds to AI response time. Users waiting 10+ seconds for a response will abandon your tool. This guide covers the patterns that make MCP servers fast.

Why Performance Matters

Slow MCP servers create a poor user experience. When an AI assistant calls your tool and waits 5 seconds for a response, the entire conversation feels sluggish. Fast servers = better UX = more usage.

Async Everything

MCP is inherently async. Don't block the event loop:

# BAD - blocks the event loop
@server.tool()
def slow_tool():
    result = requests.get("https://api.example.com")  # Blocking!
    return result.json()

# GOOD - non-blocking
@server.tool()
async def fast_tool():
    async with aiohttp.ClientSession() as session:
        async with session.get("https://api.example.com") as response:
            return await response.json()

Connection Pooling

Create connections once, reuse them:

import aiohttp
import asyncpg

class MCPServer:
    def __init__(self):
        self.session = None
        self.db_pool = None
    
    async def startup(self):
        # HTTP connection pool
        self.session = aiohttp.ClientSession(
            connector=aiohttp.TCPConnector(limit=100)
        )
        # Database connection pool
        self.db_pool = await asyncpg.create_pool(
            DATABASE_URL, 
            min_size=5, 
            max_size=20
        )
    
    async def shutdown(self):
        await self.session.close()
        await self.db_pool.close()

Caching Strategies

In-Memory Cache

For frequently accessed, rarely changing data:

from functools import lru_cache
from cachetools import TTLCache

# Simple LRU cache
@lru_cache(maxsize=1000)
def get_config(key):
    return load_from_database(key)

# Async TTL cache
cache = TTLCache(maxsize=1000, ttl=300)  # 5 minute TTL

async def cached_fetch(url):
    if url in cache:
        return cache[url]
    
    result = await fetch(url)
    cache[url] = result
    return result

Redis for Distributed Caching

When running multiple MCP server instances:

import redis.asyncio as redis

class CachedMCPServer:
    def __init__(self):
        self.redis = redis.Redis(host='localhost', port=6379)
    
    async def get_cached(self, key, fetch_func, ttl=300):
        # Try cache first
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)
        
        # Cache miss - fetch and store
        result = await fetch_func()
        await self.redis.setex(key, ttl, json.dumps(result))
        return result

Batch Operations

Combine multiple requests into one:

@server.tool()
async def get_users_batch(user_ids: list[str]):
    # BAD: N database queries
    # users = [await db.get_user(id) for id in user_ids]
    
    # GOOD: 1 database query
    users = await db.get_users_where_id_in(user_ids)
    return users

Streaming Responses

For large results, stream instead of buffering:

@server.tool()
async def stream_large_file(path: str):
    async def generate():
        async with aiofiles.open(path, 'r') as f:
            async for line in f:
                yield line
    
    return StreamingResponse(generate())

Timeout Handling

Don't let slow operations hang forever:

import asyncio

@server.tool()
async def fetch_with_timeout(url: str):
    try:
        return await asyncio.wait_for(
            fetch(url),
            timeout=5.0  # 5 second timeout
        )
    except asyncio.TimeoutError:
        return {"error": "Request timed out"}

Benchmarking

Measure before optimizing:

import time
import statistics

async def benchmark_tool(tool_func, iterations=100):
    times = []
    for _ in range(iterations):
        start = time.perf_counter()
        await tool_func()
        times.append(time.perf_counter() - start)
    
    return {
        "mean": statistics.mean(times) * 1000,  # ms
        "median": statistics.median(times) * 1000,
        "p95": sorted(times)[int(iterations * 0.95)] * 1000,
        "min": min(times) * 1000,
        "max": max(times) * 1000,
    }

Target Metrics

Metric
Target
Acceptable
Tool latency (p50)
<100ms
<500ms
Tool latency (p95)
<500ms
<2s
Memory per request
<10MB
<50MB
Connections reused
>90%
>70%

Performance Checklist

  • ☐ All I/O operations are async
  • ☐ Connection pools for HTTP and database
  • ☐ Caching for repeated queries (TTL appropriate)
  • ☐ Batch operations where possible
  • ☐ Timeouts on all external calls
  • ☐ Streaming for large responses
  • ☐ Profiled and benchmarked critical paths

Next Steps

Get updates in your inbox

Tutorials, updates, and best practices for Model Context Protocol.

No spam. Unsubscribe anytime.

Written by Kai Gritun. Building tools for AI developers.