spider

agent
SUMMARY

Web crawler and scraper for Rust

README.md

Spider

Build Status
Crates.io
Downloads
Documentation
License: MIT
Discord

Website |
Guides |
API Docs |
Examples |
Discord

A high-performance web crawler and scraper for Rust. 200-1000x faster than popular alternatives, with HTTP, headless Chrome, and WebDriver rendering in a single library.

Quick Start

Command Line

cargo install spider_cli
spider --url https://example.com

Rust

[dependencies]
spider = "2"
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;
    println!("Pages found: {}", website.get_links().len());
}

Streaming

Process each page the moment it's crawled, not after:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(0).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            println!("- {}", page.get_url());
        }
    });

    website.crawl().await;
    website.unsubscribe();
}

Headless Chrome

Add one feature flag to render JavaScript-heavy pages:

[dependencies]
spider = { version = "2", features = ["chrome"] }
use spider::features::chrome_common::RequestInterceptConfiguration;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com")
        .with_chrome_intercept(RequestInterceptConfiguration::new(true))
        .with_stealth(true)
        .build()
        .unwrap();

    website.crawl().await;
}

Also supports WebDriver (Selenium Grid, remote browsers) and AI-driven automation. See examples for more.

Benchmarks

Crawling 185 pages (source, 10 samples averaged):

Apple M1 Max (10-core, 64 GB RAM):

Crawler Language Time vs Spider
spider Rust 73 ms baseline
node-crawler JavaScript 15 s 205x slower
colly Go 32 s 438x slower
wget C 70 s 959x slower

Linux (2-core, 7 GB RAM):

Crawler Language Time vs Spider
spider Rust 50 ms baseline
node-crawler JavaScript 3.4 s 68x slower
colly Go 30 s 600x slower
wget C 60 s 1200x slower

The gap grows with site size. Spider handles 100k+ pages in minutes where other crawlers take hours. This comes from Rust's async runtime (tokio), lock-free data structures, and optional io_uring on Linux. Full details

Why Spider?

Most crawlers force a choice between fast HTTP-only or slow-but-flexible browser automation. Spider supports both, and you can mix them in the same crawl.

Supports HTTP, Chrome, and WebDriver. Switch rendering modes with a feature flag. Use HTTP for speed, Chrome CDP for JavaScript-heavy pages, and WebDriver for Selenium Grid or cross-browser testing.

Built for production. Caching (memory, disk, hybrid), proxy rotation, anti-bot fingerprinting, ad blocking, depth budgets, cron scheduling, and distributed workers. All of this has been hardened through Spider Cloud.

AI automation included. spider_agent adds multimodal LLM-driven automation: navigate pages, fill forms, solve challenges, and extract structured data with OpenAI or any compatible API.

Features

Crawling
  • Concurrent and streaming crawls with backpressure
  • Decentralized crawling for horizontal scaling
  • Caching: memory, disk (SQLite), or hybrid Chrome cache
  • Proxy support with rotation
  • Cron job scheduling
  • Depth budgeting, blacklisting, whitelisting
  • Smart mode that auto-detects JS-rendered content and upgrades to Chrome
Browser Automation Data Processing AI Agent
  • spider_agent: concurrent-safe multimodal web automation agent
  • Multiple LLM providers (OpenAI, any OpenAI-compatible API, Chrome built-in AI)
  • Web research with search providers (Serper, Brave, Bing, Tavily)
  • 110 built-in automation skills for web challenges

Spider Cloud

For managed proxy rotation, anti-bot bypass, and CAPTCHA handling, Spider Cloud plugs in with one line:

let mut website = Website::new("https://protected-site.com")
    .with_spider_cloud("your-api-key")  // enable with features = ["spider_cloud"]
    .build()
    .unwrap();
Mode Strategy Best For
Proxy (default) All traffic through Spider Cloud proxy General crawling with IP rotation
Smart (recommended) Proxy + auto-fallback on bot detection Production (speed + reliability)
Fallback Direct first, API on failure Cost-efficient, most sites work without help
Unblocker All requests through unblocker Aggressive bot protection

Free credits on signup. Get started at spider.cloud

Spider Browser Cloud

Connect to a remote Rust-based browser via CDP over WebSocket for automation, scraping, and AI extraction:

use spider::configuration::SpiderBrowserConfig;

// Simple — just an API key
let mut website = Website::new("https://example.com")
    .with_spider_browser("your-api-key")  // features = ["spider_cloud", "chrome"]
    .build()
    .unwrap();

// Full config — stealth, country targeting, custom options
let browser_cfg = SpiderBrowserConfig::new("your-api-key")
    .with_stealth(true)
    .with_country("us");

let mut website = Website::new("https://example.com")
    .with_spider_browser_config(browser_cfg)
    .build()
    .unwrap();

WebSocket endpoint: wss://browser.spider.cloud/v1/browser — supports CDP and WebDriver BiDi protocols.

Parallel Backends (LightPanda / Servo)

Race alternative browser engines alongside the primary crawl. The best HTML response wins — higher reliability and coverage for JS-heavy pages.

use spider::configuration::{BackendEndpoint, BackendEngine, ParallelBackendsConfig};

let mut website = Website::new("https://example.com");

// Race a remote LightPanda instance alongside the primary crawl.
website.configuration.parallel_backends = Some(ParallelBackendsConfig {
    backends: vec![BackendEndpoint {
        engine: BackendEngine::LightPanda,
        endpoint: Some("ws://127.0.0.1:9222".to_string()),
        binary_path: None,
        protocol: None,
        proxy: None, // inherits from website proxies config
    }],
    grace_period_ms: 500,       // wait up to 500ms for a better result
    fast_accept_threshold: 80,  // accept immediately if quality >= 80
    ..Default::default()
});

website.crawl().await;

Features: lightpanda (LightPanda via CDP), servo (Servo via WebDriver), parallel_backends_full (both).

Lock-free, zero overhead when disabled, automatic backend health tracking with auto-disable after consecutive failures.

Get Spider

Package Language Install
spider Rust cargo add spider
spider_cli CLI cargo install spider_cli
spider-nodejs Node.js npm i @spider-rs/spider-rs
spider-py Python pip install spider_rs
spider_agent Rust cargo add spider --features agent
spider_mcp MCP cargo install spider_mcp

MCP Server

Use Spider as tools in Claude Code, Claude Desktop, or any MCP client:

cargo install spider_mcp
{ "mcpServers": { "spider": { "command": "spider-mcp" } } }

Then ask: "Scrape https://example.com as markdown" or "Crawl https://example.com up to 5 pages"

Cloud and Remote

Package Description
Spider Cloud Managed crawling infrastructure, no setup needed
spider-clients SDKs for Spider Cloud in multiple languages
spider-browser Remote access to Spider's Rust browser

Resources

Contributing

Contributions welcome. See CONTRIBUTING.md for setup and guidelines.

Spider has been actively developed for the past 4 years. Join the Discord for questions and discussion.

License

MIT

Yorumlar (0)

Sonuc bulunamadi