lang: zh slug: litellm title: ‘LiteLLM: 22,500 Stars’ description: ‘LiteLLM (litellm) is an open-source AI gateway providing a single API for 100+ LLMs. Compatible with OpenAI, Anthropic, Ollama, Cohere, Gemini, Bedrock. Covers Docker deployment, virtual keys, load balancing, caching, and production hardening.’ tags: [“open-source”] date: 2026-05-19 00:00:00+08:00 lastmod: 2026-05-19 00:00:00+08:00 tech_stack: [] application_domain: Llm Frameworks source_version: ’' licensing_model: Open Source license_type: MIT file_size: ’' file_md5: ’' download_url: ’' backup_url: ’' github_repo: ‘https://github.com/BerriAI/litellm' last_maintained: ‘2026-05-19’ draft: false categories: [’llm-frameworks’] aliases:- /posts/litellm/
- /resources/llm-frameworks/litellm-unified-api-tutorial/
- /posts/litellm-unified-api-tutorial/
faqs:
- q: ‘What is LiteLLM and what problem does it solve?’ a: ‘LiteLLM is an open-source AI gateway and Python SDK that exposes a single OpenAI-compatible API endpoint to proxy requests to 100+ LLM providers such as OpenAI, Anthropic, Azure, Google Vertex AI, AWS Bedrock, Cohere, and Ollama. It eliminates the operational tax of maintaining separate SDKs, retry logic, and billing dashboards by adding automatic fallbacks, load balancing, virtual keys, and cost tracking through a single config.yaml file.’
- q: ‘Can I use LiteLLM with my existing OpenAI SDK code?’ a: ‘Yes, you only change two lines: set base_url to your LiteLLM proxy (e.g. http://localhost:4000) and api_key to a virtual key. All other OpenAI SDK code stays the same, which is the primary reason teams adopt LiteLLM.’
- q: ‘What database does LiteLLM require for production?’ a: ‘PostgreSQL 14+ is required for production features including virtual keys, spend tracking, and team management. The proxy can run without a database for basic pass-through routing, but you lose budgeting, key management, and the Admin UI.’
- q: ‘How does LiteLLM’’s automatic fallback work?’ a: ‘You define fallback chains in config.yaml. If a model returns a 429, 500, or timeout, LiteLLM transparently retries the request against the next model in the chain within the same client request, so the client sees a single response and failover happens automatically (under 500ms in practice).’
- q: ‘How does LiteLLM compare to OpenRouter?’ a: ‘LiteLLM is a self-hosted open-source gateway with zero markup and full data control, while OpenRouter is a managed multi-model API that charges 5.5% on credit purchases but requires no infrastructure work. For teams with over $5K/month LLM spend and DevOps capacity, LiteLLM is cheaper long-term; for quick prototyping, OpenRouter deploys faster.’
featureImage: /images/articles/litellm-별-22500개.png
—{{< resource-info >}}
import litellm; completion(...) in your code, provider-agnostic
- Proxy Server — a self-hosted HTTP gateway at
:4000that any OpenAI SDK client can point toThe proxy mode is what most production teams use. It adds virtual keys, team management, budget controls, rate limiting, caching, and observability — all configured through a singleconfig.yamlfile.—## How LiteLLM Works **Request flow:**1. Your application sends an OpenAI-formatted request to
http://litellm-proxy:4000/v1/chat/completions
- LiteLLM validates the virtual key, checks the team’s budget and rate limits
- The router selects the best model deployment based on configured strategy (latency-based, cost-based, or simple load balancing)
- If the primary provider returns a 429/5xx, automatic fallback triggers within milliseconds
- The response streams back in OpenAI format, regardless of which provider handled it
- Spend, latency, and token count are logged to PostgreSQL; Prometheus metrics are emittedCore components:| Component | Purpose | External Dependency | |———–|———|——————-| | Proxy Server | HTTP API, routing, auth | None (Python/FastAPI) | | PostgreSQL | Virtual keys, spend logs, team data | Required for production | | Redis | Rate-limit coordination, caching | Recommended | | Admin UI | Web dashboard for keys/models | Built-in |—## Installation & Setup### Prerequisites- Docker 24+ and Docker Compose v2
- PostgreSQL 14+ (local container or managed like DigitalOcean Managed Postgres)
- 2 vCPU / 4 GB RAM minimum for the proxy container### Step 1: Download the Docker Compose Template``` bas h
mkdir -p litellm-gateway && cd litellm-gateway
Download official docker-compose.yml #
curl -O https://raw.githubusercontent.com/BerriAI/litellm/main/docker-compose.yml
Create environment file #
cat > .env « ‘EOF’
LITELLM_MASTER_KEY=“sk-litellm-admin-$(openssl rand -hex 16)”
LITELLM_SALT_KEY=“sk-salt-$(openssl rand -hex 32)”
OPENAI_API_KEY=“sk-your-openai-key”
ANTHROPIC_API_KEY=“sk-your-anthropic-key”
DATABASE_URL=“postgresql://llmproxy:dbpassword9090@db:5432/litellm”
EOF
### Step 2: Create config.yaml
yam
l
litellm_config.yaml #
model_list:
- model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: os.environ/OPENAI_API_KEY rpm: 500 tpm: 150000 - model_name: claude-sonnet litellm_params: model: anthropic/claude-sonnet-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY rpm: 200 tpm: 40000 - model_name: gemini-flash litellm_params: model: gemini/gemini-2.0-flash api_key: os.environ/GEMINI_API_KEY rpm: 1000 - ``` yam l
litellm_config.yaml #
model_list:
-
model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: os.environ/OPENAI_API_KEY rpm: 500 tpm: 150000
-
model_name: claude-sonnet litellm_params: model: anthropic/claude-sonnet-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY rpm: 200 tpm: 40000
-
model_name: gemini-flash litellm_params: model: gemini/gemini-2.0-flash api_key: os.environ/GEMINI_API_KEY rpm: 1000
-
model_name: ollama-llama litellm_params: model: ollama/llama3.3 api_base: http://ollama:11434 model_info: mode: chat
Embedding model #
- model_name: text-embedding litellm_params: model: openai/text-embedding-3-small api_key: os.environ/OPENAI_API_KEY
general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: os.environ/DATABASE_URL max_budget: 10000.00 budget_duration: 30d alerting: - slack alerting_threshold: 300 global_max_parallel_requests: 200
litellm_settings: drop_params: true num_retries: 3 request_timeout: 120
Automatic fallbacks #
fallbacks: - gpt-4o: - claude-sonnet - gemini-flash - claude-sonnet: - gpt-4o - gemini-flash
Redis caching #
cache: true cache_params: type: redis host: redis port: 6379 ttl: 3600
Observability callbacks #
success_callback: [“prometheus”] failure_callback: [“prometheus”]
r
e
r
$LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What is LiteLLM?"}]
}'# Test embeddings
curl http://localhost:4000/v1/embeddings \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding",
"input": ["LiteLLM is an AI gateway"]
}'
```---## Integration with Popular Tools### OpenAI SDK (Python)```
pytho
n
from openai import OpenAIclient = OpenAI(
base_url="http://localhost:4000",
api_key="sk-your-litellm-virtual-key"
)response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain load balancing"}]
)
print(response.choices[0].message.content)
```### LangChain```
pytho
n
from langchain_openai import ChatOpenAIllm = ChatOpenAI(
model="claude-sonnet",
openai_api_key="sk-your-virtual-key",
openai_api_base="http://localhost:4000"
)result = llm.invoke("What are the types of LLM gateways?")
print(result.content)
```### Anthropic SDK (Native Compatibility)```
pytho
n
from anthropic import Anthropicclient = Anthropic(
base_url="http://localhost:4000/anthropic",
api_key="sk-your-virtual-key"
)response = client.messages.create(
model="claude-sonnet",
max_tokens=1024,
messages=[{"role": "user", "content": "Compare LiteLLM vs OpenRouter"}]
)
print(response.content[0].text)
```### Ollama (Local Models)```
yam
l
# Ad```
bas
h
# Pull and start all services
docker compose up -d
# Verify services are healthy
docker compose ps
# Check proxy logs
docker compose logs -f litellm
```_info:
mode: chat
bas h
Test local model through LiteLLM #
curl http://localhost:4000/v1/chat/completions
-H “Authorization: Bearer $LITELLM_MASTER_KEY”
-H “Content-Type: application/json”
-d ‘{
“model”: “local-llama”,
“messages”: [{“role”: “user”, “content”: “Hello local model”}]
}’
### Cohere
yam
l
model_list:
- model_n``` bas h
Test chat completions #
curl http://localhost:4000/v1/chat/completions
-H “Authorization: Bearer $LITELLM_MASTER_KEY”
-H “Content-Type: application/json”
-d ‘{
“model”: “gpt-4o”,
“messages”: [{“role”: “user”, “content”: “What is LiteLLM?”}]
}’
Test embeddings #
curl http://localhost:4000/v1/embeddings
-H “Authorization: Bearer $LITELLM_MASTER_KEY”
-H “Content-Type: application/json”
-d ‘{
“model”: “text-embedding”,
“input”: [“LiteLLM is an AI gateway”]
}’
n
d
external API customers:| Metric | Before LiteLLM | After LiteLLM |
|--------|---------------|---------------|
| Provider SDKs maintained | 4 (OpenAI, Anthropic, Gemini, Ollama) | 1 (OpenAI-compatible) |
| API key management | Shared keys in env vars | Virtual keys per team/customer |
| Cost attribution | Manual CSV export | Per-key spend in real-time UI |
| Outage response | Human-paged, 15-min MTTR | Automatic fallback, <500ms |
| Monthly LLM spend | $8,500 (unoptimized) | $6,200 (-27% with routing) |### Performance Benchmarks (Self-Hosted, 4 vCPU / 8 GB RAM)```
pytho
n
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000",
api_key="sk-your-litellm-virtual-key"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain load balancing"}]
)
print(response.choices[0].message.content)
``` — | 3ms | 8ms |**Note:** Gateway overhead excludes LLM API response time. LiteLLM adds a small, predictable latency penalty. For flows where every millisecond matters, deploy the proxy in the same VPC as your application.---## Advanced Usage / Production Hardening### Virtual Keys and Team ManagementVirtual keys are```
pytho
n
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="claude-sonnet",
openai_api_key="sk-your-virtual-key",
openai_api_base="http://localhost:4000"
)
result = llm.invoke("What are the types of LLM gateways?")
print(result.content)
```ocalho
s
t
:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "frontend-team-key",
"team_id": "frontend-team",
"models": ["gpt-4o", "gemini-flash"],
"max_budget": 500.00,
"budget_duration": "30d",
"rpm_l```
pytho
n
from anthropic import Anthropic
client = Anthropic(
base_url="http://localhost:4000/anthropic",
api_key="sk-your-virtual-key"
)
response = client.messages.create(
model="claude-sonnet",
max_tokens=1024,
messages=[{"role": "user", "content": "Compare LiteLLM vs OpenRouter"}]
)
print(response.content[0].text)
```vid
e
r
_budget_config:
openai:
monthly_budget: 5000.00
anthropic:
monthly_budget: 3000.00
gemini:
monthly_budget: 1000.00
```### Latency-Based Routing```
yam
l
router_settings:
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60
allowed_fails: 3
cooldown_time: 60
num_retries: 2
timeout: 90
retry_af```
yam
l
# Add to litellm_config.yaml
model_list:
- model_name: local-llama
litellm_params:
model: ollama/llama3.3
api_base: http://localhost:11434
model_info:
mode: chat
```duct
i
o
n
# Run behind Nginx or AWS ALB with TLS termination # Disable verbose logging
litellm_settings:
set_verbose: false # Encrypt keys at rest
litellm_settings:
key_generatio```
bas
h
# Test local model through LiteLLM
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "local-llama",
"messages": [{"role": "user", "content": "Hello local model"}]
}'
–set replicaCount=3
–set ingress.enabled=true
–set ingress.hosts[0].host=litellm.yourdomain.com
–set env.LITELLM_MASTER_KEY=“sk-$(openssl rand -hex 16)”
–set env.DATABASE_URL=“postgresql://user:pass@neon-host/litellm”
### Monitoring with Prometheus + Grafana
yam
l
Ad``` #
yam l model_list:
- model_name: cohere-command litellm_params: model: cohere/command-r-plus api_key: os.environ/COHERE_API_KEY
c
s
`:```
promq
l
# Request rate by model
rate(litellm_request_total_requests[5m])# Error rate
rate(litellm_requests_total_failed[5m])# Remain```
pytho
n
from openai import OpenAI
client = OpenAI(base_url="http://localhost:4000", api_key="sk-virtual-key")
response = client.chat.completions.create(
model="cohere-command",
messages=[{"role": "user", "content": "Summarize this"}]
)
```n
a
_dashboard.json) for pre-built panels showing requests/sec, token usage, cost per team, and latency percentiles.---## Comparison with Alternatives| Feature | LiteLLM | Portkey | OpenRouter | Helicone |
|---------|---------|---------|------------|----------|
| **License** | MIT (Open Source) | Closed core + Open SDK | Closed (Hosted) | Closed (Hosted + Self-host) |
| **Deployment** | Self-hosted / Docker / K8s | Cloud + Hybrid | Hosted only | Cloud + Self-host |
| **Models supported** | 100+ providers | 200+ | 300+ | Provider-dependent |
| **Self-hosting cost** | $200–800/mo infra | N/A (managed) | N/A (hosted) | $0–100/mo (self-host) |
| **Virtual keys / budgets** | Per-key + per-team | Per-key + per-user | Basic per-key | Per-org |
| **Automatic fallback** | Configurable chains | Circuit breakers | Provider routing | Limited |
| **Semantic caching** | Redis + Qdrant | Built-in | No | No |
| **Observability** | Prometheus + external | Built-in deep traces | Basic usage stats | Primary focus |
| **Compliance** | DIY (SOC2 via infra) | SOC 2, ISO 27001, HIPAA | Partial | SOC 2 |
| **Best for** | Full control, zero lock-in | Enterprise governance | Quick model access | Observability-first |**When to choose what:**- **LiteLLM** — You have DevOps capacity, want zero vendor lock-in, and need full control over routing, caching, and data residency.
- **Portkey** — You need enterprise governance (SOC 2, audit logs), prompt management UI, and are willing to pay SaaS pricing.
- **OpenRouter** — You want instant access to 300+ models with zero infrastructure work, and the 5.5% credit fee is acceptable.
- **Helicone** — Observability is your primary concern; you need detailed tracing and cost attribution across LLM calls.---## Limitations / Honest AssessmentLiteLLM```
bas
h
# Create a virtual key for the "frontend-team"
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "frontend-team-key",
"team_id": "frontend-team",
"models": ["gpt-4o", "gemini-flash"],
"max_budget": 500.00,
"budget_duration": "30d",
"rpm_limit": 100,
"tpm_limit": 50000,
"metadata": {
"service": "customer-chat-widget",
"env": "production"
}
}'
# Response:
# {
# "key": "sk-litellm-abc123...",
# "expires": null,
# "max_budget": 500.00,
# "models": ["gpt-4o", "gemini-flash"]
# }
``` architect your own multi-region failover with DNS or a global load balancer. LiteLLM is a single-region proxy by default.5. **Enterprise SSO costs money** — SAML/SSO, audit logs, and advanced guardrails are part of LiteLLM Enterprise. The OSS version handles virtual keys and basic budgets only.---## Frequently Asked Questions**Q: How does LiteLLM compare to OpenRouter?**LiteLLM is a self-hosted open-source gateway; OpenRouter is a managed multi-model API. LiteLLM gives you zero markup and full control over your data. OpenRouter charges 5.5% on credit purchases but requires zero infrastructure work. For teams with >$5K/month LLM spend and DevOps capacity, LiteLLM is ch```
yam
l
general_settings:
provider_budget_config:
openai:
monthly_budget: 5000.00
anthropic:
monthly_budget: 3000.00
gemini:
monthly_budget: 1000.00
```r
o
x
y
and `api_key` to a virtual key. Everything else stays the same. This is the primary reason teams adopt LiteLLM; zero code changes beyond configuration.**Q: What database does LiteLLM require?**Pos```
yam
l
router_settings:
routing_strategy: latency-based-routing
routing_strategy_args:
ttl: 60
allowed_fails: 3
cooldown_time: 60
num_retries: 2
timeout: 90
retry_after: 5
```manageme
n
t
, and the Admin UI.**Q: How does the fallback mechanism work?**You define fallback chains in `config.yaml`. If a model returns a 429, 500, or timeout, LiteLLM retries the request against the next mo```
yam
l
# Security-hardened config.yaml
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
# Force HTTPS in production
# Run behind Nginx or AWS ALB with TLS termination
# Disable verbose logging
litellm_settings:
set_verbose: false
# Encrypt keys at rest
litellm_settings:
key_generation_algorithm: "rsa"
allow_user_auth: false
```H
P
A
for auto-scaling under Kubernetes.**Q: How do I monitor LiteLLM in production?**Enable the Prometheus callback in `config.yaml`, scrape the `/metrics` endpoint, and import the official Grafana dashboard. Set alerts on `litellm_requests_total_failed` (error rate) and `litellm_remaining_requests` (budget exhaustion). Wire `success_callback` to Langfuse for per-request tracing.---## ConclusionLiteLLM solves the messy realit```
bas
h
# Add LiteLLM Helm repo
helm pull oci://docker.litellm.ai/berriai/litellm-helm
# Install with custom values
helm install litellm-gateway ./litellm-helm \
--namespace litellm \
--create-namespace \
--set replicaCount=3 \
--set ingress.enabled=true \
--set ingress.hosts[0].host=litellm.yourdomain.com \
--set env.LITELLM_MASTER_KEY="sk-$(openssl rand -hex 16)" \
--set env.DATABASE_URL="postgresql://user:pass@neon-host/litellm"
``` setup above, add Redis caching, then scale to Kubernetes with Helm as traffic grows.**Action items:**1. Clone the [LiteLLM GitHub repo](https://github.com/BerriAI/litellm) and run the Docker Compose quick-start
2. Create virtual keys for each team and set per-key budgets
3. Enable Redis caching and Prometheus monitoring
4. Join the [LiteLLM Discord community](https://discord.gg/wupm9ySymB) for support and feature discussions*Some links in this article are affiliate links. We may```
yam
l
# Add to config.yaml
litellm_settings:
success_callback: ["prometheus"]
failure_callback: ["prometheus"]
```文含联盟营销链接。通过链接购买主机服务我们可能获得佣金——这不会影响价格或推荐。*
---
## Recommended Hosting & InfrastructureBefore you deploy any of the tools above into production, you'll need s```
promq
l
# Request rate by model
rate(litellm_request_total_requests[5m])
# Error rate
rate(litellm_requests_total_failed[5m])
# Remaining budget per key
litellm_remaining_requests
# Gateway overhead histogram
histogram_quantile(0.95, litellm_overhead_latency_ms_bucket)
```c
k
" "footer-cta-legacy" "HTStack" >}}** — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.*Affiliate links — they don't cost you extra and they help keep dibi8.com running.*## Sources & Further Reading- [LiteLLM GitHub Repository](https://github.com/BerriAI/litellm) — Official source code, 22,500+ stars
- [LiteLLM Documentation](https://docs.litellm.ai/docs/) — Complete proxy and SDK reference
- [LiteLLM Docker Quick Start](https://docs.litellm.ai/docs/proxy/docker_quick_start) — Official Docker setup guide
- [LiteLLM Config Reference](https://docs.litellm.ai/docs/proxy/configs) — All config.yaml options
- [LiteLLM Helm Deployment](https://docs.litellm.ai/docs/proxy/deploy) — Kubernetes and Helm charts
- [LiteLLM Admin UI Docs](https://docs.litellm.ai/docs/proxy/ui) — Virtual key and team management
- [LiteLLM Caching Guide](https://docs.litellm.ai/docs/caching/all_caches) — Redis, semantic, and disk caching
- [Portkey vs LiteLLM Comparison](https://portkey.ai/lp/portkey-vs-litellm) — Vendor comparison page
- [OpenRouter Documentation](https://openrouter.ai/docs) — Alternative gateway reference
- [Helicone Documentation](https://docs.helicone.ai) — Observability-focused alternative
💬 留言讨论