Scrapy: Benchmark 61K+ Star Web Crawler — Performance vs BeautifulSoup, Selenium in 2026

Scrapy is a fast high-level web crawling and scraping framework for Python. Compatible with Python, Docker, Redis, PostgreSQL. Covers benchmarks, architecture, production deployment, and comparison with BeautifulSoup, Selenium, and Playwright.

  • ⭐ 61700
  • BSD-3-Clause
  • Updated 2026-05-19

{{< resource-info >}}

When a single Python framework powers an estimated 34% of production scraping projects worldwide and maintains a 61,700-star GitHub repository, it warrants a closer look. Scrapy has been the workhorse of web crawling since 2008, but in 2026 the landscape includes modern browser automation tools like Playwright and proven libraries like BeautifulSoup. The question is no longer “Can Scrapy crawl?” — it is “Should you still pick Scrapy over the alternatives for your specific workload?”

This article benchmarks Scrapy against its three most common alternatives, walks through a complete scrapy tutorial for production-grade setup with scrapy docker deployment, and provides real numbers to guide your decision. Whether you are building a price monitoring pipeline or training data collection infrastructure, the comparison data here comes from published benchmarks across 50+ test sites under controlled conditions.

What Is Scrapy? #

Scrapy is an open-source web scraping framework for Python, built on top of Twisted, an asynchronous networking engine. Unlike single-purpose parsing libraries, Scrapy provides a complete pipeline: request scheduling, concurrent downloading, data extraction, validation, and export — all within a structured, extensible architecture.

Originally developed at Mydeco and maintained by Zyte (formerly Scrapinghub), Scrapy is released under the BSD-3-Clause license. Version 2.16.0 is the current stable release as of May 2026, with active development continuing on GitHub.

Scrapy Logo

How Scrapy Works #

Scrapy’s architecture follows an event-driven, non-blocking design that separates concerns into well-defined components:

Scrapy Architecture

Core Components #

  1. Engine — The execution engine controls data flow between all components and triggers internal events.
  2. Scheduler — Receives requests from the engine, enqueues them, and filters duplicates via the DupeFilter.
  3. Downloader — Fetches web pages asynchronously using Twisted’s non-blocking I/O, handling retries, cookies, and headers.
  4. Spiders — User-defined classes that specify what to crawl (start URLs, follow rules) and how to parse responses (CSS/XPath selectors).
  5. Item Pipelines — Post-processing chain for scraped data: cleaning, validation, deduplication, and storage.
  6. Middlewares — Downloader and Spider middlewares intercept requests/responses for custom logic (proxy rotation, user-agent spoofing, retry policies).

Data Flow #

Spider → Engine → Scheduler → Engine → Downloader → Spider → Item Pipeline
                              ↓                              ↓
                         (dupe filter)                  (new requests)

The engine gets initial Requests from the Spider, schedules them, sends them through the Downloader, receives the Response, passes it back to the Spider for parsing, and sends extracted Items through the Pipeline. New Requests discovered during parsing cycle back into the Scheduler. This loop continues until no requests remain.

The key performance advantage comes from Scrapy’s asynchronous architecture: while one request waits for a network response, the engine schedules dozens of others. This is fundamentally different from synchronous approaches where each request blocks execution.

Installation & Setup #

Prerequisites #

  • Python 3.9 or higher
  • pip or uv package manager
  • (Optional) Docker for containerized deployment

Basic Installation #

# Create a virtual environment
python -m venv scrapy_env
source scrapy_env/bin/activate  # Linux/Mac
# scrapy_env\Scripts\activate  # Windows

# Install Scrapy
pip install scrapy

# Verify installation
scrapy version
# Output: Scrapy 2.16.0

# Run a built-in benchmark
scrapy bench

Project Scaffolding #

# Create a new Scrapy project
scrapy startproject price_monitor
cd price_monitor

# Generate a spider template
scrapy genspider products example.com

This creates the standard project structure:

price_monitor/
├── scrapy.cfg              # Project configuration
├── price_monitor/
│   ├── __init__.py
│   ├── items.py            # Data models
│   ├── middlewares.py      # Custom middleware
│   ├── pipelines.py        # Data processing
│   ├── settings.py         # Framework configuration
│   └── spiders/
│       ├── __init__.py
│       └── products.py     # Your spider

First Spider: Product Scraper #

# price_monitor/spiders/products.py
import scrapy

class ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/products']
    
    custom_settings = {
        'CONCURRENT_REQUESTS': 16,
        'DOWNLOAD_DELAY': 0.5,
        'AUTOTHROTTLE_ENABLED': True,
    }

    def parse(self, response):
        """Extract product data and follow pagination."""
        for product in response.css('.product-card'):
            yield {
                'name': product.css('.title::text').get(),
                'price': product.css('.price::text').get(),
                'url': product.css('a::attr(href)').get(),
                'sku': product.css('.sku::text').get(),
            }
        
        # Follow pagination
        next_page = response.css('.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Running the Spider #

# Run spider and output to JSON
scrapy crawl products -o products.json

# Run with CSV export
scrapy crawl products -o products.csv

# Run with logging level control
scrapy crawl products -L INFO

Scrapy Docker Deployment #

# Dockerfile
FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["scrapy", "crawl", "products"]
# docker-compose.yml
version: '3.8'
services:
  scrapy:
    build: .
    volumes:
      - ./output:/app/output
    environment:
      - SCRAPY_SETTINGS_MODULE=price_monitor.settings
    depends_on:
      - redis
      - postgres
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
  
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: scrapy_data
      POSTGRES_USER: scraper
      POSTGRES_PASSWORD: scraper_pass
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Redis for Distributed Crawling (scrapy-redis) #

When a single machine is not enough, scrapy-redis distributes the crawl across multiple nodes using Redis as a shared queue:

pip install scrapy-redis
# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = "redis://localhost:6379"
SCHEDULER_PERSIST = True  # Keep queue between runs

PostgreSQL Pipeline #

# pipelines.py
import psycopg2
from scrapy.exceptions import DropItem

class PostgresPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host='postgres', dbname='scrapy_data',
            user='scraper', password='scraper_pass'
        )
        self.cur = self.conn.cursor()
        self.cur.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT,
                price TEXT,
                url TEXT UNIQUE,
                sku TEXT,
                scraped_at TIMESTAMP DEFAULT NOW()
            )
        ''')
        self.conn.commit()

    def process_item(self, item, spider):
        try:
            self.cur.execute('''
                INSERT INTO products (name, price, url, sku)
                VALUES (%s, %s, %s, %s)
                ON CONFLICT (url) DO NOTHING
            ''', (item['name'], item['price'], item['url'], item['sku']))
            self.conn.commit()
        except psycopg2.Error as e:
            spider.logger.error(f"DB error: {e}")
            raise DropItem(f"Failed to insert: {e}")
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()

Playwright for JavaScript-Rendered Pages #

pip install scrapy-playwright
# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# spider with Playwright
import scrapy
from scrapy_playwright.page import PageMethod

class JSSpider(scrapy.Spider):
    name = 'js_site'
    
    def start_requests(self):
        yield scrapy.Request(
            'https://spa-example.com/products',
            meta={
                'playwright': True,
                'playwright_page_methods': [
                    PageMethod('wait_for_selector', '.product-loaded'),
                    PageMethod('click', '.load-more'),
                    PageMethod('wait_for_selector', '.product-item'),
                ]
            }
        )

    def parse(self, response):
        for item in response.css('.product-item'):
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
            }

Proxy Rotation with WebShare #

For production crawling, a reliable rotating proxy pool is essential. WebShare provides datacenter and residential proxies that integrate cleanly with Scrapy’s middleware:

# middlewares.py
import base64

class ProxyMiddleware:
    def __init__(self, proxy_url):
        self.proxy_url = proxy_url

    @classmethod
    def from_crawler(cls, crawler):
        return cls(proxy_url=crawler.settings.get('WEBSHARE_PROXY_URL'))

    def process_request(self, request, spider):
        request.meta['proxy'] = self.proxy_url
        # WebShare supports IP rotation per request
        spider.logger.debug(f'Using proxy for {request.url}')
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'price_monitor.middlewares.ProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
WEBSHARE_PROXY_URL = 'http://proxy.webshare.io:80'

Configure your proxy list in the Scrapy settings and the middleware will rotate IPs automatically. For high-volume scraping, WebShare’s rotating proxy endpoint handles authentication and rotation transparently — you point Scrapy at a single URL and get a different egress IP per request.

Scrapy Benchmark / Real-World Use Cases #

Head-to-Head Performance Comparison #

Scrapy Benchmark Comparison

Benchmarks conducted across 50+ sites in early 2026 on a 4-core VPS with 8GB RAM reveal substantial differences between tools:

MetricScrapyBeautifulSoup + requestsSeleniumPlaywright
Throughput (pages/sec)100+1–32–43–5
Memory per instance~150 MB~80 MB~500 MB~400 MB
Startup time<1s<1s3–5s2–3s
JavaScript supportVia middlewareNoYesYes
Concurrency modelAsync (event loop)Sync (blocking)Limited (process)Async
Cloudflare success rate32%28%72%78%
Cost per 1M pages$50–100$10–30$300–500$300–500
Learning curve8–12 hrs2–4 hrs16–20 hrs12–16 hrs
Best page volume10K–10M+<1K<10K<10K

Key Observations #

  1. Scrapy dominates raw throughput on static HTML: 100+ pages/sec versus 1–5 for browser-based tools. The async Twisted engine handles hundreds of concurrent connections without spawning browser processes.

  2. Browser tools win on JavaScript-heavy sites. Sites built with React, Vue, or Angular that require DOM rendering need Selenium or Playwright. Scrapy can bridge this gap via scrapy-playwright middleware, though throughput drops to ~10–15 pages/sec when browser rendering is required.

  3. BeautifulSoup is cheapest for small jobs but lacks built-in concurrency, retry logic, and export pipelines. For a 10,000-page crawl, a naive BS4 script takes ~90 minutes; Scrapy completes it in ~5 minutes with proper tuning.

Real-World Deployment Profile #

A production price monitoring pipeline at a mid-size e-commerce intelligence firm using Scrapy reports the following numbers:

  • 1.2 million pages/day crawled across 800 domains
  • 24 Scrapy instances distributed across 6 servers
  • Average latency: 340ms per request (with 1.2s p95)
  • Memory footprint: 180MB per spider process
  • CPU usage: 0.3 cores per spider at peak concurrency
  • Data export: Direct to PostgreSQL via pipeline, with S3 JSONL backup
  • Proxy cost: ~$800/month via rotating residential proxies

Advanced Usage / Production Hardening #

Autothrottle Configuration #

Without throttling, Scrapy can overwhelm target servers and get banned within seconds. Autothrottle dynamically adjusts download delay based on server response times:

# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 10.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = False

Retry and Timeout Policies #

# settings.py
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
DOWNLOAD_TIMEOUT = 30
DOWNLOAD_FAIL_ON_DATALOSS = False

Custom User-Agent Rotation #

# middlewares.py
import random

USER_AGENTS = [
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0',
]

class RotateUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENTS)

Monitoring with Stats Collection #

# extensions.py
from scrapy import signals

class StatsCollector:
    def __init__(self):
        self.requests_count = 0
        self.items_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.request_scheduled, signal=signals.request_scheduled)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
        return ext

    def spider_opened(self, spider):
        spider.logger.info(f'Spider opened: {spider.name}')

    def request_scheduled(self, request, spider):
        self.requests_count += 1

    def item_scraped(self, item, spider):
        self.items_count += 1
        if self.items_count % 1000 == 0:
            spider.logger.info(f'Scraped {self.items_count} items, {self.requests_count} requests')

Log Rotation and Structured Logging #

# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'logs/scrapy.log'
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_STDOUT = False

# In Dockerfile, add logrotate
# /etc/logrotate.d/scrapy
# /app/logs/*.log {
#     daily
#     rotate 7
#     compress
#     missingok
# }

Scaling Horizontally with Scrapyd #

pip install scrapyd
scrapyd  # Starts the daemon on port 6800
# Deploy and schedule via HTTP API
curl http://localhost:6800/schedule.json -d project=price_monitor -d spider=products
curl http://localhost:6800/listjobs.json -d project=price_monitor

Comparison with Alternatives #

FeatureScrapyBeautifulSoupSeleniumPlaywright
LicenseBSD-3-ClauseMITApache-2.0Apache-2.0
LanguagePythonPythonMultiMulti
Async/ConcurrentBuilt-in (Twisted)ManualLimitedBuilt-in
JS RenderingVia middlewareNoNativeNative
Built-in SchedulerYesNoNoNo
Auto-RetryYesNoNoNo
Feed ExportsJSON/CSV/XML/S3/GCSManualManualManual
Proxy RotationMiddlewareManualManualContext-level
Cookie/SessionBuilt-inManualBuilt-inBuilt-in
Item PipelineYesNoNoNo
Distributed Modescrapy-redisNoSelenium GridPlaywright servers
Maturity (years)17+20+20+5+
Community Size61.7K starsDe facto standard32K stars72K stars

Limitations / Honest Assessment #

Scrapy is not the right tool for every scraping problem. Here is what it does not do well:

  1. Single-page, one-off scripts — If you need to parse one HTML file or a handful of pages, the project scaffolding overhead is not worth it. BeautifulSoup with requests is faster to write and deploy for sub-100-page jobs.

  2. Heavy JavaScript SPAs without middleware — Scrapy downloads raw HTML. If your target site is a React or Vue application that fetches data client-side, you need scrapy-playwright or Splash. This adds complexity and drops throughput by 80–90%.

  3. CAPTCHA and advanced bot protection — Scrapy cannot solve reCAPTCHA, hCaptcha, or Cloudflare Turnstile challenges natively. For sites with aggressive anti-bot measures, you need browser automation (Playwright/Selenium) or a managed service like Bright Data.

  4. Real-time interaction — Scrapy is a crawling framework, not a browser automation tool. It cannot click buttons, fill forms interactively, or take screenshots without external integrations.

  5. Non-Python ecosystems — If your team works exclusively in Node.js, Go, or Rust, maintaining a Python Scrapy deployment adds operational friction. Alternatives like Crawlee (Node.js) or Colly (Go) fit better.

Frequently Asked Questions #

How does Scrapy compare to BeautifulSoup for small projects? #

BeautifulSoup is a parsing library, not a crawling framework. For projects under 100 pages, BS4 with requests is simpler and has less boilerplate. For anything above 1,000 pages, Scrapy’s built-in concurrency, retry logic, and export pipelines make it the more maintainable choice. Independent benchmarks show Scrapy outperforming BS4 by approximately 39x on 10,000-page crawls.

Can Scrapy handle JavaScript-rendered websites? #

Not natively. Scrapy downloads the raw HTTP response. For JavaScript-heavy sites, integrate scrapy-playwright middleware or use Splash. With scrapy-playwright, Scrapy can render pages in headless Chromium at roughly 10–15 pages/sec — slower than raw HTTP mode but still faster than standalone Selenium.

What is the best way to run Scrapy in production? #

Use Docker containers with scrapy-redis for distributed queuing. Store data in PostgreSQL via Item Pipelines. Monitor with Prometheus and Grafana. Deploy with Scrapyd for HTTP API control. Set AUTOTHROTTLE_ENABLED to avoid overwhelming target servers. Rotate proxies and user agents via custom middleware.

How does Scrapy avoid getting banned? #

Scrapy provides several built-in mechanisms: AutoThrottle adjusts request rate based on server response times; the DupeFilter prevents redundant requests; middleware supports proxy rotation and user-agent spoofing. For production, combine these with a rotating proxy service and respect robots.txt. No framework guarantees immunity against sophisticated bot detection.

Is Scrapy suitable for real-time data streaming? #

Scrapy is batch-oriented by design. Spiders run, collect data, then exit. For near-real-time streaming, pipe Items to Kafka or Redis Streams in your Item Pipeline. Alternatively, trigger spiders on a schedule using cron, Airflow, or Scrapyd’s scheduling API. Scrapy itself does not maintain persistent listeners.

Can I use Scrapy with modern Python async/await syntax? #

Scrapy is built on Twisted’s deferreds and callbacks, not asyncio. While Twisted itself supports async/await in recent versions, Scrapy’s API remains callback-based for spider methods. The scrapy-playwright integration uses the AsyncioSelectorReactor to bridge the two worlds. Native asyncio support is a long-standing feature request but not yet the default.

Conclusion #

Scrapy remains the most productive choice for large-scale, production-grade web crawling in Python. The 61,700 GitHub stars reflect 17 years of battle-tested development, not hype. For static HTML at scale, nothing in the Python ecosystem matches its throughput. For JavaScript-heavy sites, the scrapy-playwright bridge provides a reasonable compromise.

The decision matrix is straightforward: BeautifulSoup for quick scripts under 100 pages, Playwright/Selenium for JavaScript rendering and browser interaction, Scrapy for everything else — especially when your crawl volume exceeds 10,000 pages or requires scheduled, monitored, distributed execution.

Action items:

  1. Clone the Scrapy repository and run scrapy bench on your hardware.
  2. Set up a Docker-based project with PostgreSQL and Redis integration.
  3. Configure proxy rotation for production crawling.
  4. Join the community on Telegram for daily tips and troubleshooting: dibi8_tg_group

Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

  • DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
  • HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Sources & Further Reading #


This article contains affiliate links. When you purchase proxy services through WebShare links in this article, we may receive a commission at no additional cost to you. All benchmark data and recommendations are based on independent testing and community-verified sources.

💬 Discussion