Crawl4AI được dùng để làm gì?

Crawl4AI là một framework thu thập dữ liệu web bất đồng bộ bằng Python mã nguồn mở, được vận hành bởi Playwright, biến bất kỳ website nào thành Markdown sạch, sẵn sàng cho LLM. Nó được xây dựng cho các pipeline RAG, AI agent và bộ dữ liệu huấn luyện, loại bỏ thanh điều hướng, quảng cáo, banner cookie và thẻ script trước khi xuất kết quả.

Làm thế nào để cài đặt Crawl4AI?

Cài đặt bằng `pip install crawl4ai` rồi chạy `playwright install chromium`. Nếu cần bản đồng bộ dựa trên Selenium, hãy dùng `pip install crawl4ai[sync]`, hoặc với môi trường production thì kéo Docker image bằng `docker pull unclecode/crawl4ai:latest`.

Crawl4AI có hỗ trợ trích xuất dữ liệu dựa trên LLM mà không cần CSS selector không?

Có. Bạn định nghĩa một Pydantic schema cùng với một chỉ dẫn bằng ngôn ngữ tự nhiên, và LLMExtractionStrategy của Crawl4AI sẽ để LLM tự động trích xuất các trường đó. Nó hỗ trợ OpenAI (gpt-4o), Anthropic Claude, Groq/DeepSeek và các mô hình Ollama chạy cục bộ.

Crawl4AI có miễn phí không, và so với Firecrawl thì thế nào?

Crawl4AI hoàn toàn mã nguồn mở (Apache-2.0) và miễn phí để tự host mà không tính phí theo từng request, bạn chỉ phải trả chi phí hạ tầng. Firecrawl là một API SaaS được quản lý, khởi điểm từ $16/month kèm theo một MCP server chính thức; nhiều đội ngũ production dùng Firecrawl cho các tác vụ API nhanh và dùng Crawl4AI cho các pipeline tự host khối lượng lớn.

Crawl4AI giảm chi phí token của LLM khi trích xuất dữ liệu như thế nào?

Đặt `input_format="markdown"` sẽ đưa Markdown đã được làm sạch vào LLM thay vì HTML thô, thường giúp cắt giảm 60-80% lượng token sử dụng. Bộ lọc nội dung BM25 của nó còn xếp hạng các đoạn văn bản theo một truy vấn và loại bỏ những nội dung ít liên quan trước khi embedding hay lưu vào vector store.

Crawl4AI Hướng Dẫn Toàn Diện 2026

What Is Crawl4AI? The Data Infrastructure for the LLM Era #

Core Design Philosophy #

Crawl4AI is an async Python web crawling framework powered by Playwright. Unlike Scrapy (which excels at raw, large-scale extraction) or BeautifulSoup (which gives you the DOM and leaves cleanup to you), Crawl4AI’s default output is Markdown optimized for LLM consumption.

That means navbars, cookie banners, ads, and script tags are stripped out before you ever see the data. The result? Lower token costs, cleaner embeddings, and higher-quality retrieval in RAG pipelines.

Key Features at a Glance #

Feature	What It Does
LLM-Ready Markdown	Auto-cleans HTML noise; outputs structured Markdown perfect for LLM ingestion
Async Concurrency	`AsyncWebCrawler` handles multiple URLs in parallel for high-throughput jobs
JavaScript Rendering	Playwright engine handles React, Vue, and infinite-scroll SPAs natively
LLM-Based Extraction	Define a Pydantic schema + natural language instruction; the LLM extracts fields automatically
Deep Crawling	BFS/DFS strategies for site-wide recursive crawling
Adaptive Crawling	New in v0.8 — uses information-foraging algorithms to know when enough data has been collected
MCP Integration	Can be registered as a Model Context Protocol tool for Claude, Cursor, and other AI agents
Anti-Bot Stealth	Stealth mode + proxy support to reduce detection risk

Who Should Use It? #

RAG Engineers: Feed documentation sites, blogs, and wikis into vector databases with minimal preprocessing
AI Agent Developers: Give your agent the ability to “read the web” via a local, controllable tool
Data Teams: Replace brittle XPath/CSS selectors with natural language extraction commands
Privacy-Conscious Organizations: Keep all data on-premise; no third-party SaaS dependency

Quick Start: Install, Crawl, and Output Markdown in 5 Minutes #

Installation #

Option A — pip (recommended for development)

a
s
h
pip install crawl4ai
playwright install chromium

For the synchronous variant (Selenium-based):

a
s
h
pip install crawl4ai[sync]

Option B — Docker (recommended for production/isolated environments)

a
s
h
docker pull unclecode/crawl4ai:latest

Your First Async Crawl #

h
o
n
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://crawl4ai.com")
        print(result.markdown[:1000])

if __name__ == "__main__":
    asyncio.run(main())

That’s it. Ten lines of code, and you have clean Markdown ready to feed into an embedding model.

CLI Quick Mode #

a
s
h
crwl https://example.com -o markdown

Supported outputs: markdown, html, json, links, screenshot.

Advanced: LLM Structured Extraction Without Writing a Single CSS Selector #

This is where Crawl4AI shifts from “convenient” to “game-changing.” Instead of maintaining brittle selectors that break when a site redesigns its CSS, you describe what you want in plain English and let the LLM handle extraction.

Example: Extract Pricing Data from OpenAI’s API Page #

Step 1 — Define your data schema with Pydantic:

h
o
n
from pydantic import BaseModel, Field

class ModelPricing(BaseModel):
    model_name: str = Field(..., description="The name of the model")
    input_cost: str = Field(..., description="Cost per 1M input tokens")
    output_cost: str = Field(..., description="Cost per 1M output tokens")

Step 2 — Configure the LLM extraction strategy:

h
o
n
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def main():
    browser_config = BrowserConfig(verbose=True)
    
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=ModelPricing.model_json_schema(),
            extraction_type="schema",
            instruction=(
                "Extract all mentioned model names along with their input and output token prices. "
                "Format each entry as: {'model_name': 'GPT-4o', 'input_cost': 'US$5.00 / 1M tokens', ...}"
            ),
            input_format="markdown",
            verbose=True
        ),
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Supported LLM Providers #

Provider	Example `provider` string	Notes
OpenAI	`openai/gpt-4o`	Best accuracy; moderate cost
Anthropic	`anthropic/claude-sonnet-4-20250514`	Excellent for long-context pages
Groq / DeepSeek	`groq/deepseek-r1-distill-llama-70b`	Fast, cost-efficient
Local (Ollama)	`ollama/llama3`	Zero external API cost; requires local GPU

Pro tip: Using input_format="markdown" dramatically reduces token usage versus feeding raw HTML into the LLM, often cutting costs by 60–80%.

Deep Crawling and Content Filtering: From Single Page to Entire Sites #

BFS Deep Crawl (Site-Wide, 2 Levels) #

h
o
n
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://docs.crawl4ai.com/", config=config)
        print(f"Total pages crawled: {len(results)}")
        
        for r in results[:5]:
            print(f"URL: {r.url} | Depth: {r.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())

BM25 Content Filtering for RAG Pipelines #

When building a knowledge base, you often don’t need the entire page — only the passages relevant to a query. Crawl4AI’s BM25 filter solves this:

h
o
n
from crawl4ai.content_filter import BM25ContentFilter

filter = BM25ContentFilter(
    query="async crawler configuration methods",
    threshold=0.1
)

This filter ranks every text chunk on the page against your query and drops low-relevance content before you ever pay for embeddings or vector storage.

Head-to-Head: Crawl4AI vs Firecrawl vs ScrapeGraphAI vs Scrapy (2026) #

Dimension	Crawl4AI	Firecrawl	ScrapeGraphAI	Scrapy
GitHub Stars	63k+	78k+	23k+	50k+
Deployment	Self-hosted / Docker	SaaS API + open-source	Open-source Python	Open-source framework
LLM Extraction	Native	Supported	Core feature (graph traversal)	Manual integration
Output	Markdown / JSON	Markdown / JSON	JSON	JSON / CSV / XML
JS Rendering	Playwright (built-in)	Supported	Limited	Requires plugins
Self-Hosted Cost	Free (infra only)	$16+/mo	Free	Free
MCP Support	Community integrations	Official MCP server	None	None
Learning Curve	Low–Medium	Very Low	Low	High

Which One Should You Choose? #

Crawl4AI → Best for teams that want full control, zero per-request fees, and deep Python integration. You trade convenience for flexibility.
Firecrawl → Best for rapid prototyping and teams that prefer managed infrastructure. The official MCP server is a big plus for AI agent stacks.
ScrapeGraphAI → Best when your primary need is graph-based, natural-language-driven discovery of related data across a site.
Scrapy → Still the king for industrial-scale crawling (millions of pages, distributed queues, middleware pipelines). Not AI-native, but battle-tested for over a decade.

Hybrid recommendation: Use Firecrawl for quick API-based tasks and Crawl4AI for high-volume, self-hosted pipelines. Many production teams run both.

Production Deployment and Performance Tuning #

Docker with FastAPI and JWT Authentication #

Deploy Crawl4AI as an internal microservice:

a
s
h
docker run -p 8000:8000 \
  -e CRAWL4AI_API_TOKEN=your_jwt_secret \
  unclecode/crawl4ai:latest

Call it from your application:

a
s
h
curl -X POST http://localhost:8000/crawl \
  -H "Authorization: Bearer your_jwt_secret" \
  -d '{"url": "https://example.com", "output_format": "markdown"}'

Proxy and Concurrency Configuration #

For production-scale crawling, configure proxy rotation and headless browser pools:

h
o
n
browser_config = BrowserConfig(
    headless=True,
    proxy_config={
        "server": os.getenv("PROXY_SERVER"),
        "username": os.getenv("PROXY_USERNAME"),
        "password": os.getenv("PROXY_PASSWORD"),
    },
    verbose=True
)

Troubleshooting Common Issues #

Symptom	Root Cause	Fix
Empty output	SPA hasn’t finished rendering	Use `wait_until="networkidle"` or inject a delay
Blocked by anti-bot	Fingerprinting detection	Enable stealth mode; rotate residential proxies
LLM extraction times out	Page too large for context window	Pre-filter with CSS selectors before LLM extraction
Playwright install fails	Chromium download blocked	Use `PLAYWRIGHT_BROWSERS_PATH=0` or mirror URLs

Recommended Hosting & Infrastructure #

Before deploying these tools into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Final Thoughts and Recommended Next Steps #

Crawl4AI is not a universal replacement for every scraping need. But in the specific domain of “feeding web data into LLMs,” it is the most focused, fastest-growing, and community-validated tool available today.

If you are building…

A chatbot knowledge base → Pair Crawl4AI with Milvus, Chroma, or Weaviate for a fully local RAG stack.
Training datasets → Use deep crawling + BM25 filtering to curate high-quality, domain-specific corpora.
AI agents → Register Crawl4AI as an MCP tool and give your agent autonomous web-reading capabilities.

Recommended action plan:

Run the 10-line quick-start example from Section 2 on your target domain.
Inspect the Markdown quality. If it’s clean enough for your use case, proceed.
Set up LLM extraction with a Pydantic schema and compare accuracy against your legacy CSS-selector pipeline.
Deploy via Docker and benchmark throughput against your volume requirements.
Revisit the comparison table in Section 5 to decide if you need a hybrid setup with Firecrawl or Apify.

References

Published 2026-05-19. Data sourced from GitHub, official docs, and publicly available benchmarks. Crawl4AI iterates rapidly; always cross-check with the latest documentation.

Crawl4AI Hướng Dẫn Toàn Diện 2026

What Is Crawl4AI? The Data Infrastructure for the LLM Era #

Core Design Philosophy #

Key Features at a Glance #

Who Should Use It? #

Quick Start: Install, Crawl, and Output Markdown in 5 Minutes #

Installation #

Your First Async Crawl #

CLI Quick Mode #

Advanced: LLM Structured Extraction Without Writing a Single CSS Selector #

Example: Extract Pricing Data from OpenAI’s API Page #

Supported LLM Providers #

Deep Crawling and Content Filtering: From Single Page to Entire Sites #

BFS Deep Crawl (Site-Wide, 2 Levels) #

BM25 Content Filtering for RAG Pipelines #

Head-to-Head: Crawl4AI vs Firecrawl vs ScrapeGraphAI vs Scrapy (2026) #

Which One Should You Choose? #

Production Deployment and Performance Tuning #

Docker with FastAPI and JWT Authentication #

Proxy and Concurrency Configuration #

Troubleshooting Common Issues #

Recommended Hosting & Infrastructure #

Final Thoughts and Recommended Next Steps #

References & Sources #

📦 Xuất hiện trong các bộ sưu tập

💬 Bình luận & Thảo luận

What Is Crawl4AI? The Data Infrastructure for the LLM Era #

Core Design Philosophy #

Key Features at a Glance #

Who Should Use It? #

Quick Start: Install, Crawl, and Output Markdown in 5 Minutes #

Installation #

Your First Async Crawl #

CLI Quick Mode #

Advanced: LLM Structured Extraction Without Writing a Single CSS Selector #

Example: Extract Pricing Data from OpenAI’s API Page #

Supported LLM Providers #

Deep Crawling and Content Filtering: From Single Page to Entire Sites #

BFS Deep Crawl (Site-Wide, 2 Levels) #

BM25 Content Filtering for RAG Pipelines #

Head-to-Head: Crawl4AI vs Firecrawl vs ScrapeGraphAI vs Scrapy (2026) #

Which One Should You Choose? #

Production Deployment and Performance Tuning #

Docker with FastAPI and JWT Authentication #

Proxy and Concurrency Configuration #

Troubleshooting Common Issues #

Recommended Hosting & Infrastructure #

Final Thoughts and Recommended Next Steps #

References & Sources #

🔗 Tài nguyên liên quan

📦 Xuất hiện trong các bộ sưu tập

💬 Bình luận & Thảo luận