Crawlberg

Name: crawlberg
Author: xberg-io

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.

What and Why?

Crawlberg is the crawling substrate: everything you need to scrape and crawl a site end-to-end from a single Rust core — HTML→Markdown, headless-Chrome fallback, robots/sitemap parsing, per-domain throttling, and an SSRF-safe policy — with identical results across 14 language bindings.

Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling, billing) live in xberg-enterprise, the reference operational implementation. Every extension point (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, …) is a trait you inject via CrawlEngineBuilder::with_<trait>(...).

Features

Feature	Description
Structured extraction	Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, response headers
Markdown conversion	Clean Markdown with citations, document structure, and fit-content mode
Concurrent crawling	Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
14 language bindings	Rust, Python, Node.js, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly
Smart filtering	BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, sitemap discovery
Browser rendering	Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
Batch & streaming	Scrape or crawl hundreds of URLs concurrently; real-time crawl events via async streams
SSRF-safe by default	Refuses loopback, private, link-local, and cloud-metadata addresses; opt out via env var or `CrawlConfig`
Auth & rate limiting	HTTP Basic, Bearer, and custom-header auth with cookie jars; per-domain request throttling
MCP server & REST API	Model Context Protocol integration for AI agents plus an HTTP server with OpenAPI spec

Supported Platforms

Precompiled binaries for Linux (x86_64/aarch64), macOS (ARM64), and Windows (x64) across every binding. See the platform support reference for the full matrix.

⭐ Star this repo to show your support — it helps others discover Crawlberg.

Quick Start

Language Packages

Python

pip install crawlberg

See Python README for full documentation.

Node.js

npm install @kreuzberg/crawlberg

See Node.js README for full documentation.

Rust

cargo add crawlberg

See Rust README for full documentation.

go get github.com/xberg-io/crawlberg/packages/go

See Go README for full documentation.

Java

Available on Maven Central as dev.kreuzberg.crawlberg:crawlberg. See Java README for the dependency snippet and current version.

dotnet add package Crawlberg

See C# README for full documentation.

Ruby

gem install crawlberg

See Ruby README for full documentation.

PHP

composer require xberg-io/crawlberg

See PHP README for full documentation.

Elixir

Add {:crawlberg, "~> 0.3"} to your mix.exs dependencies. See Elixir README for full documentation.

Dart / Flutter

dart pub add crawlberg

See Dart README for full documentation.

Kotlin (Android)

Available on Maven Central as dev.kreuzberg.crawlberg.android:crawlberg-android. See Kotlin README for the dependency snippet and current version.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Zig

See Zig README for installation and usage.

WebAssembly

npm install @kreuzberg/crawlberg-wasm

See WebAssembly README for full documentation.

C/C++ (FFI)

C header + shared library from GitHub Releases. See FFI crate for full documentation.

CLI

cargo install crawlberg-cli

brew install xberg-io/tap/crawlberg

See CLI README for full documentation.

AI Coding Assistants

Install the Crawlberg plugin from the xberg-io/plugins marketplace. It ships the Crawlberg agent skills (site crawling, HTML→Markdown scraping, headless-Chrome fallback) plus the crawlberg MCP server, and works with every major coding agent — expand your harness below.

Claude Code

/plugin marketplace add xberg-io/plugins
/plugin install crawlberg@kreuzberg

Codex CLI

/plugins add https://github.com/xberg-io/plugins

Then search for crawlberg and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select crawlberg.

Gemini CLI

gemini extensions install https://github.com/xberg-io/plugins

Factory Droid

droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install crawlberg@kreuzberg

GitHub Copilot CLI

copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install crawlberg@kreuzberg

opencode

Add the package to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@kreuzberg/opencode-crawlberg"]
}

Documentation

Full guides, per-language API references, the substrate/operational model, antibot strategy, and observability live at docs.crawlberg.xberg.io.

Contributing

Contributions are welcome! See our Contributing Guide.

Part of Kreuzberg.dev

Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

License

Elastic License 2.0

crawlberg