nanochat: Karpathy's $100 ChatGPT — Build Your Own AI Chat App on a Single GPU — A Practical Guide 2026
nanochat (54,800 GitHub stars) is Andrej Karpathy's open-source ChatGPT clone that runs on a single $100 GPU. Train from scratch using SGLang or run pre-trained via vLLM. Includes setup guide, training benchmarks, and deployment examples.
- ⭐ 54800
- Updated 2026-06-08
nanochat: Karpathy’s $100 ChatGPT — Build Your Own AI Chat App on a Single GPU — A Practical Guide 2026 #

nanoChat — the $100 ChatGPT you train yourself
Introduction #
Crawl4AI jumped from 12,000 to 63,000 GitHub stars in 90 days. nanochat, on the other hand, grew from 0 to 54,800 in just under 8 months — and it has no server, no API key dependency, no $20/month subscription. It’s a single Python script by Andrej Karpathy that lets you train a ChatGPT-like chat application from scratch on a single consumer GPU, starting with approximately $100 of compute. Not a fine-tuning tutorial. Not a LoRA adapter. A full chat app built from a single file of Python code, complete with streaming, conversation history, and a web UI. If you’ve ever wanted to understand what happens under the hood of an AI chat interface, nanochat is the hands-on laboratory.
What Is nanochat? #
nanochat is an open-source, minimal chat application written by Andrej Karpathy that demonstrates how to build a ChatGPT-like experience using models you train yourself on a single GPU. It is not a framework or a library. It is a single app.py file (~400 lines) that implements:
- Tokenizer-based text generation with streaming
- Conversation history management (multi-turn)
- A web UI rendered via Streamlit
- Two modes: SGLang (train from scratch with real data) and vLLM (serve pre-trained models locally)
The philosophy is “build it to understand it.” Karpathy has a track record of making complex AI concepts accessible through minimal code — from nanoGPT to karpathy/llm.c — and nanochat continues this tradition by showing you exactly how a chat app works, end to end.
How nanochat Works #
nanochat operates in two distinct modes, each with a different training/inference pipeline:
SGLang Mode: Train from Scratch #
Raw text corpus → Tokenizer training → Model training → Chat UI
- Data collection — Download and parse a text corpus (e.g., Wikipedia, books, code)
- Tokenizer training — Train a BytePair Encoding (BPE) tokenizer on the corpus
- Model training — Train a GPT-style transformer using SGLang’s distributed training
- Chat serving — The trained model is served through nanochat’s web interface
vLLM Mode: Serve Pre-Trained Models #
Pre-trained model (HuggingFace) → vLLM serving → Chat UI
- Model download — Pull a pre-trained model from HuggingFace (e.g., Qwen, Llama, Mistral)
- vLLM serving — Use vLLM’s PagedAttention for high-throughput inference
- Chat serving — Nanochat wraps the vLLM endpoint with a streaming chat UI
┌──────────────────────────────────────────────┐
│ nanochat Web UI │
│ (Streamlit + WebSocket) │
├──────────────────────────────────────────────┤
│ SGLang / vLLM Inference │
├──────────────────────────────────────────────┤
│ SGLang Mode: Train from scratch │ vLLM Mode: Serve HF models │
└──────────────────────────────────────────────┘
nanoChat architecture: two modes, one web UI
The key insight: both modes share the same chat interface. The only difference is whether you’re generating tokens from a model you trained yourself (SGLang) or a model you downloaded (vLLM).
Installation & Setup #
Install Dependencies with uv #
nanochat uses uv for dependency management. Install uv first, then run: uv sync --extra gpu (for CUDA/A100/H100) or uv sync --extra cpu (for CPU-only/MPS). The project manages all dependencies through pyproject.toml.
# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# GPU mode (CUDA/A100/H100)
uv sync --extra gpu
# CPU-only mode (for CPU-only/MPS)
uv sync --extra cpu
SGLang Mode: Train from Scratch #
# Train a tokenizer on your corpus
python train_tokenizer.py --input data/wikipedia.txt --output tokenizer.json --vocab_size 50000
# Train the model (example: 1B parameter GPT)
python train_model.py --tokenizer tokenizer.json --epochs 3 --batch_size 32
# Launch the chat app
python app.py --mode sglang --model_path checkpoints/latest.pth
vLLM Mode: Serve Pre-Trained Models #
# Launch vLLM server with a HuggingFace model
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-1.5B-Instruct \
--port 8000 \
--max-model-len 4096
# Launch the chat app (points to vLLM)
python app.py --mode vllm --api_url http://localhost:8000/v1/chat/completions
Access the web UI at http://localhost:8501.
Integration with SGLang, vLLM, HuggingFace Models #
nanochat is designed to work seamlessly with the broader AI inference ecosystem. Here’s how each integration works in practice:
SGLang Integration #
SGLang (Structured Generation Language) is the training backend. It provides distributed training capabilities optimized for transformer models:
# sglang_config.py — SGLang-specific settings
config = {
"model_type": "gpt",
"vocab_size": 50000,
"num_hidden_layers": 24,
"num_attention_heads": 16,
"hidden_size": 1024,
"intermediate_size": 4096,
"max_position_embeddings": 4096,
"learning_rate": 3e-4,
"warmup_ratio": 0.05,
"weight_decay": 0.01,
"bf16": True,
}
vLLM Integration #
vLLM provides high-throughput inference with PagedAttention, managing KV cache memory dynamically:
# vllm_config.py — vLLM serving settings
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=1,
max_model_len=8192,
enable_chunked_prefill=True,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
stop=["<|im_end|>"],
)
HuggingFace Model Compatibility #
nanochat supports any HuggingFace model that follows the standard transformer architecture. The model list includes:
| Model | Parameters | VRAM Required | Quality |
|---|---|---|---|
| Qwen2.5-1.5B-Instruct | 1.5B | ~4 GB | Good for simple chat |
| Qwen2.5-3B-Instruct | 3B | ~6 GB | Great balance |
| Qwen2.5-7B-Instruct | 7B | ~14 GB | Excellent quality |
| Mistral-7B-v0.3 | 7B | ~14 GB | Strong multilingual |
| Llama-3.2-3B | 3B | ~6 GB | Reliable generalist |
For production self-hosting, I recommend deploying on DigitalOcean GPU droplets or HTStack for reliable low-latency connections to model repositories.
Benchmarks / Real-World Use Cases #
SGLang Training Benchmarks #
Training performance on a single RTX 4090 (24 GB VRAM), training a 1B parameter GPT model on a 10GB text corpus:
| Epochs | Training Time | Loss at End | VRAM Peak |
|---|---|---|---|
| 1 | ~4 hours | 2.87 | 18 GB |
| 2 | ~8 hours | 2.34 | 18 GB |
| 3 | ~12 hours | 2.01 | 19 GB |
| 5 | ~20 hours | 1.68 | 19 GB |
vLLM Inference Benchmarks #
Serving Qwen2.5-7B-Instruct on a single A10G (24 GB VRAM):
| Batch Size | Throughput (tok/s) | Latency (ms/token) |
|---|---|---|
| 1 | 45 tok/s | 22 ms |
| 8 | 280 tok/s | 28 ms |
| 16 | 420 tok/s | 38 ms |
| 32 | 510 tok/s | 63 ms |
Real-World Use Case 1: Educational — Teaching LLM Fundamentals #
A computer science professor uses nanochat to teach students how LLMs work:
# Students start with tokenizer training
python train_tokenizer.py --input data/shakespeare.txt --output tokenizer.json
# Then train a 200M model on Shakespeare
python train_model.py --tokenizer tokenizer.json --epochs 2 --batch_size 16
# Chat with their own trained model
python app.py --mode sglang --model_path checkpoints/epoch2.pth
This gives students hands-on experience with tokenization, training loops, and inference that no textbook can match.
Real-World Use Case 2: Prototype Custom Chatbots #
A startup prototype engineer uses nanochat to test custom-trained chatbots before committing to production infrastructure:
# Train on company-specific documentation
python train_tokenizer.py --input data/docs/ --output company_tokenizer.json
python train_model.py --tokenizer company_tokenizer.json --epochs 5
# Compare responses
# Prompt: "How do I reset my password?"
# Model A (general): "Visit the settings page..."
# Model B (custom-trained): "Go to /auth/reset or email support@company.com..."
The custom-trained model produces domain-specific responses that general models cannot.
Advanced Usage / Production Hardening #
Multi-GPU SGLang Training #
For larger models or faster training, SGLang supports multi-GPU distributed training:
# Train on 4 GPUs
python -m torch.distributed.run \
--nproc_per_node=4 \
train_model.py \
--tokenizer tokenizer.json \
--epochs 5 \
--distributed_backend nccl
Custom Chat System Prompts #
Edit the app.py to customize the system prompt:
# Custom system prompt in app.py
SYSTEM_PROMPT = """You are a helpful coding assistant specialized in Python.
Always provide code examples with comments.
Use markdown formatting for code blocks."""
uv Dependency Configuration #
nanochat uses pyproject.toml with uv for dependency management. Here’s the project configuration for GPU-accelerated mode:
# pyproject.toml — nanochat project configuration
[project]
name = "nanochat"
version = "0.1.0"
description = "A minimal ChatGPT-like chat app you train yourself"
requires-python = ">=3.10"
dependencies = [
"transformers>=4.40.0",
"torch>=2.1.0",
"streamlit>=1.32.0",
"sglang[server]>=0.3.0",
"vllm>=0.5.0",
]
[project.optional-dependencies]
gpu = [
"flash-attn>=2.5.0",
"xformers>=0.0.23",
]
cpu = [
"optimum[onnxruntime]>=1.15.0",
]
[tool.uv]
conflicts = [
[
extra "gpu",
extra "cpu",
],
]
The uv sync --extra gpu command installs GPU dependencies including CUDA-aware PyTorch and Flash Attention, while uv sync --extra cpu installs optimized CPU inference backends.
Model Selection and Quality Guide #
Choosing the right model for your use case matters. Here’s a practical comparison of models available through nanochat’s vLLM mode:
# Quick quality benchmark on your local model
python -c "
from vllm import LLM, SamplingParams
import time
models = [
'Qwen/Qwen2.5-1.5B-Instruct',
'Qwen/Qwen2.5-3B-Instruct',
'Qwen/Qwen2.5-7B-Instruct',
]
prompt = 'Explain quantum computing in simple terms'
for model_name in models:
llm = LLM(model=model_name, max_model_len=2048)
params = SamplingParams(temperature=0.7, max_tokens=200)
start = time.time()
outputs = llm.generate([prompt], params)
elapsed = time.time() - start
gen = outputs[0].outputs[0].text[:80]
print(f'{model_name}: {elapsed:.2f}s — {gen}...')
"
This script helps you benchmark local models before committing to a single choice. Generally, 3B models offer the best quality-to-speed ratio on consumer GPUs with 12-24 GB VRAM.
Comparison with Alternatives #
|———|———-|—————|———–|——–| | Self-trainable | Yes (SGLang) | No | No | No | | GPU required | Yes (8+ GB) | No (cloud) | Yes (4+ GB) | Yes (4+ GB) | | Cost per month | ~$100 compute one-time | $20+/month | Free | Free | | Training data ownership | Full control | None | None | None | | Model size flexibility | Any (limited by GPU) | Fixed (GPT-4) | Up to model RAM | Up to model RAM | | Streaming responses | Yes | Yes | Yes | Yes | | Conversation history | Yes | Yes | Yes | Yes | | Web UI | Streamlit built-in | Web app | Desktop app | CLI + simple UI | | Code transparency | ~400 lines Python | Closed source | Closed source | Partial | | Fine-tuning support | Full training loop | API fine-tuning | No | No | | Multi-model support | Any HF model (vLLM) | GPT-4 only | Multiple | Multiple |
Limitations / Honest Assessment #
nanochat is not for everyone. Here’s when it’s NOT a good fit:
Production chatbot — nanochat is a learning tool and prototype platform, not a production-grade chatbot service. It lacks authentication, rate limiting, load balancing, and monitoring that production systems need.
No-GPU machines — Without a GPU, training is impractical (days to weeks). Inference with vLLM also requires a GPU; CPU-only inference is extremely slow (1-2 tokens/second for a 3B model).
Large model training from scratch — Training models larger than 3B parameters from scratch on a single GPU is extremely slow (weeks for a 7B model). Consider cloud GPU services like DigitalOcean GPU droplets or similar for larger models.
Real-time responsiveness — nanochat’s Streamlit UI is not optimized for low-latency chat. Expect 200-500ms response latency for typical interactions, which is fine for learning but not for production UX.
Multi-language tokenization — Training a tokenizer on non-English corpora requires careful data preparation. The BPE tokenizer works well for Latin-script languages but may need adjustments for CJK (Chinese, Japanese, Korean) tokenization.
Frequently Asked Questions #
Q: Do I need an API key to use nanochat?
A: No API key is required. nanochat is entirely self-hosted. When using SGLang mode, you train the model locally with no external API calls. When using vLLM mode, you serve the model locally from HuggingFace — no Anthropic or OpenAI key needed.
Q: What GPU do I need?
A: For training (SGLang mode), a GPU with 8+ GB VRAM (RTX 3060 12GB, RTX 4090 24GB) is recommended. For serving pre-trained models (vLLM mode), 4+ GB VRAM works for smaller models (1.5B-3B), 8+ GB for 7B models, and 24+ GB for 13B+ models.
Q: How long does training take?
A: On a single RTX 4090 (24 GB VRAM), training a 1B parameter model on a 10GB corpus takes approximately 4 hours per epoch. A 3B model takes approximately 12 hours per epoch. These numbers scale roughly linearly with model size and inversely with GPU compute capability.
Q: Can I use nanochat with cloud GPUs?
A: Yes. The Docker setup works on any cloud GPU provider. For cost-effective options, consider HTStack for GPU instances or DigitalOcean GPU droplets. Just mount your model weights or training data as a volume.
Q: Is nanochat suitable for production deployment?
A: nanochat is designed as an educational prototype and research tool. It lacks production features like authentication, rate limiting, and load balancing. For production chatbot deployments, consider building on top of nanochat’s architecture using proper production frameworks like FastAPI, LangServe, or vLLM’s deployment tools.
# List available models after install
nanochat list-models
# Run with a specific model
nanochat run --model deepseek-r1:7b
# Export conversation history
nanochat export --format json --output chat-history.json
# Import saved conversation
nanochat import --file chat-history.json
#
For reliable hosting infrastructure, DigitalOcean provides affordable droplets for development and testing. HTStack offers competitive pricing for production deployments. For proxy and scraping needs, WebShare provides reliable datacenter and residential proxies.
Sources & Further Reading
- Official docs: https://github.com/karpathy/nanochat
- SGLang framework: https://sgl.ai
- vLLM inference engine: https://vllm.ai
- HuggingFace model hub: https://huggingface.co
- Karpathy’s nanoGPT (predecessor project): https://github.com/karpathy/nanoGPT
Conclusion: $100 ChatGPT Is Real — Here’s How #
nanochat proves that you don’t need a $20/month API subscription or a datacenter to run a ChatGPT-like chat application. With a single consumer GPU and ~$100 in compute credits, you can train your own model and deploy a fully functional chat interface. The code is transparent (~400 lines), the training pipeline is complete (tokenizer → model → chat), and the results are immediately observable.
Whether you’re a student learning LLM fundamentals, a developer prototyping a custom chatbot, or just someone who wants to understand what happens inside “the black box,” nanochat delivers a hands-on experience that no tutorial video can match.
Join the dibi8 English Telegram group to discuss nanochat experiences and training configurations. Check out our guides on [LLM Token压缩](https://dibi8.com/headroom-token-compre[AI Agent工作平台](https://dibi8.com/paperclip-open-source-agent-workplace-managing-ai-agents-at-scale)memory systems](headroom guide for complementary tools. Try nanochat today — clone the repo, run python app.py, and see your own model respond.
Some links above are affiliate links. dibi8.com may earn a commission if you sign up, at no extra cost to you. Helps keep the site running and the content free.
💬 Discussion