Headroom is an LLM framework designed to help developers build, deploy, and manage AI applications efficiently with modern tooling.

What makes Headroom different from other frameworks?

Headroom stands out with its focus on developer experience and practical implementation, providing a streamlined experience compared to more complex alternatives.

Can Headroom be used for production applications?

Yes, Headroom is production-ready. Many teams use it for AI-powered applications with proper configuration and monitoring.

Headroom: Compress LLM Inputs by 60-95%

prompts.chat: 163k+ Prompts – The Open-Source Prompt Library • Model Context Protocol (MCP) Deep Dive

┌──────────────────────────────────────────────────────┐
│              Headroom Compression Pipeline            │
│                                                      │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────┐  │
│  │  Tool Output│  │  Log Files  │  │  RAG Chunks  │  │
│  │  (JSON)    │  │  (.log)     │  │  (embeddings)│  │
│  └─────┬──────┘  └──────┬──────┘  └──────┬───────┘  │
│        │                │                 │          │
│        ▼                ▼                 ▼          │
│  ┌───────────────────────────────────────────────┐   │
│  │          Headroom Compressor Engine            │   │
│  │  • Deduplication  • Summarization             │   │
│  │  • Pruning        • Format optimization       │   │
│  └──────────────────────────┬────────────────────┘   │
│                             │ 60-95% fewer tokens    │
│  ┌──────────────────────────▼────────────────────┐   │
│  │              LLM API Call                      │   │
│  │  (Claude Code / Codex / Copilot / Gemini CLI)  │   │
│  └───────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────┘

Headroom pipeline: input → compress → LLM with 60-95% fewer tokens

Introduction #

If you’re paying for LLM API calls in 2026, you’re probably burning 40-70% of your token budget on redundant context: duplicate tool outputs, verbose log files, and bloated RAG chunks that the LLM reads but never uses. Headroom (19,745 GitHub stars) is the open-source tool that sits between your AI agent and the LLM, compressing inputs by 60-95% while preserving answer quality. It ships as a Python library, a CLI proxy, and an MCP server — compatible with Claude Code, Codex CLI, Copilot, Gemini CLI, and any OpenAI-compatible API. Single dependency, 10 lines to integrate, and real benchmarks show the same answers with fraction of the cost.

What Is Headroom? #

Headroom is a token compression layer for LLM pipelines that reduces input token counts before they reach the model. It is not a summarization tool — it is a structural optimizer. It understands the difference between “important signal” and “noisy context” in tool outputs, logs, files, and retrieval-augmented chunks.

Key capabilities:

Input compression — Deduplicate, prune, and summarize tool outputs before LLM consumption
Multi-format support — Handles JSON, logs, markdown, code files, and RAG embeddings
3 deployment modes — Python library, CLI proxy, and MCP server
Model-agnostic — Works with Claude, GPT-4o, Gemini, and any OpenAI-compatible endpoint
Quality-preserving — Benchmarked to produce equivalent answers at 60-95% token reduction
Zero-config start — Ships with sensible defaults; optimize later with custom rules

The project is built with Python, uses minimal dependencies (just tiktoken for token counting), and integrates via standard HTTP APIs. It stores compression state in memory or Redis for multi-session scenarios.

How Headroom Works #

Headroom operates through a three-stage pipeline:

Stage 1: Input Ingestion #

# Install the library
pip install "headroom-ai[all]"

# Basic compression of a tool output
python -c "
import headroom
result = headroom.compress('''
[Very long JSON output from a tool call... 5000 tokens]
''')
print(f'Original: {result.original_tokens} tokens')
print(f'Compressed: {result.compressed_tokens} tokens')
print(f'Savings: {result.savings_pct}%')
"

Stage 2: Compression Engine #

The compression engine applies multiple strategies:

# Custom compression rules
from headroom import Compressor

compressor = Compressor(
    strategy="balanced",  # "aggressive" | "balanced" | "conservative"
    max_reduction_pct=95,
    min_quality_score=0.85,
    dedup_threshold=0.9,
    summary_length_ratio=0.3,
)

# Apply to mixed inputs
compressed = compressor.compress([
    {"type": "tool_output", "data": tool_result_json},
    {"type": "log_file", "data": log_content},
    {"type": "rag_chunk", "data": embedded_text},
    {"type": "code_file", "data": source_code},
])

Stage 3: LLM Integration #

# Start the proxy server
headroom serve --port 8787 --compressor balanced

# Point your AI agent to the proxy instead of the LLM directly
# Agent -> Headroom Proxy (8787) -> Compressed -> LLM API

// .env — Configure which LLM to proxy through
HEADROOM_PROXY_PORT=8787
LLM_ENDPOINT=https://api.anthropic.com/v1/messages
LLM_MODEL=claude-sonnet-4-20250514
LLM_API_KEY=${ANTHROPIC_API_KEY}
COMPRESSION_STRATEGY=balanced

Deploy Headroom: Compress LLM Inputs by 60-95% on DigitalOcean

Installation & Setup #

Quick Start (Library Mode) #

# Install
pip install "headroom-ai[all]"

# One-line compression
python -c "
import headroom
compressed = headroom.compress(your_long_input)
print(compressed.text)
"

Node.js Setup #

# Install
npm install headroom-ai

Proxy Mode (Recommended for AI Agents) #

# Install and start
pip install "headroom-ai[all]"
headroom serve --host 0.0.0.0 --port 8787

# Test compression
curl -X POST http://localhost:8787/compress \
  -H "Content-Type: application/json" \
  -d '{"input": "Very long context..."}' | jq

# Expected response:
# {
#   "original_tokens": 4523,
#   "compressed_tokens": 891,
#   "savings_pct": 80.3,
#   "compressed_text": "..."
# }

MCP Server Mode #

# Start as MCP server
headroom mcp-serve --port 9090

# Connect from Claude Code
claude-code --mcp http://localhost:9090

# The MCP server exposes:
# - headroom/compress — Compress text input
# - headroom/benchmark — Run compression benchmark
# - headroom/config — Get/update compression settings

Advanced Usage / Production Hardening #

Context-Aware Compression #

Headroom adapts its compression strategy based on the input type. Code-heavy inputs retain more structure, while verbose log files get aggressive pruning.

# Context-aware compression example
from headroom import ContextCompressor

ctx_compressor = ContextCompressor(
    code_preserve=0.9,    # Keep 90% of code structure
    log_prune=0.95,       # Prune 95% of repetitive log lines
    json_dedup=0.9,       # Deduplicate similar JSON fields
)

# Auto-detect input type and apply best strategy
result = ctx_compressor.compress(input_data)
print(f"Type detected: {result.input_type}")
print(f"Savings: {result.savings_pct}%")

Token Accounting Dashboard #

Track your savings in real-time when running the proxy server in dashboard mode.

# Start proxy with built-in monitoring dashboard
headroom serve --port 8787 --dashboard --dashboard-port 3000

# View dashboard at http://localhost:3000
# See real-time token savings, compression ratios, and cost tracking

Integration with Claude Code, Codex CLI, Copilot, and Gemini CLI #

Headroom works with any agent that sends HTTP requests to an LLM API. Here’s how to integrate with popular tools:

Claude Code #

# Method 1: Use as MCP server
headroom mcp-serve --port 9090
# Then in Claude Code: add-mcp headroom http://localhost:9090

# Method 2: Set as API proxy in .claude-env
export CLAUDE_API_BASE_URL=http://localhost:8787/v1
# Claude Code automatically routes through Headroom

Codex CLI #

# Point Codex through Headroom proxy
export OPENAI_API_BASE=http://localhost:8787/v1
codex --model gpt-4o --prompt "Fix the auth bug"
# All context goes through Headroom compression first

OpenRouter Aggregation #

# Use Headroom with OpenRouter for multi-model cost savings
headroom serve \
  --proxy http://api.openrouter.ai/api/v1 \
  --model meta-llama/llama-3.1-405b \
  --compressor balanced

# Headroom compresses inputs, then sends to OpenRouter
# You pay for compressed tokens, not raw tokens

For self-hosted proxy infrastructure, HTStack DigitalOcean droplets provide stable low-latency connections. Consider WebShare data-center proxies for multi-region deployment. For token trading integrations, connect to Binance or OKX via headroom’s proxy layer.

Benchmarks / Real-World Use Cases #

Compression Benchmarks #

Testing on 100 real-world tool outputs (mix of terminal output, git diffs, file contents, and RAG chunks):

Headroom: Compress LLM Inputs by 60-95%

Introduction #

What Is Headroom? #

How Headroom Works #

Stage 1: Input Ingestion #

Stage 2: Compression Engine #

Stage 3: LLM Integration #

Installation & Setup #

Quick Start (Library Mode) #

Node.js Setup #

Proxy Mode (Recommended for AI Agents) #

MCP Server Mode #

Advanced Usage / Production Hardening #

Context-Aware Compression #

Token Accounting Dashboard #

Integration with Claude Code, Codex CLI, Copilot, and Gemini CLI #

Claude Code #

Codex CLI #

OpenRouter Aggregation #

Benchmarks / Real-World Use Cases #

Compression Benchmarks #

📦 Featured in collections

💬 Discussion

Introduction #

What Is Headroom? #

How Headroom Works #

Stage 1: Input Ingestion #

Stage 2: Compression Engine #

Stage 3: LLM Integration #

Installation & Setup #

Quick Start (Library Mode) #

Node.js Setup #

Proxy Mode (Recommended for AI Agents) #

MCP Server Mode #

Advanced Usage / Production Hardening #

Context-Aware Compression #

Token Accounting Dashboard #

Integration with Claude Code, Codex CLI, Copilot, and Gemini CLI #

Claude Code #

Codex CLI #

OpenRouter Aggregation #

Benchmarks / Real-World Use Cases #

Compression Benchmarks #

🔗 Related Resources

📦 Featured in collections

💬 Discussion