kreuzcrawl

mcp
Security Audit
Fail
Health Warn
  • License — License: NOASSERTION
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Fail
  • rm -rf — Recursive force deletion command in .github/workflows/ci.yaml
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This tool is a high-performance web crawling and scraping engine written in Rust, designed for structured data extraction. It offers native bindings across 11 programming languages and acts as a Model Context Protocol (MCP) server.

Security Assessment
As a web crawler, the tool inherently makes outbound network requests to fetch external website data. The codebase does not request dangerous local permissions, nor did the scan find hardcoded secrets. However, the audit flagged a recursive force deletion command (`rm -rf`) inside a CI workflow file (`.github/workflows/ci.yaml`). While this is a common practice for cleaning up build environments, poorly sanitized variables in CI scripts can sometimes lead to pipeline vulnerabilities. Overall risk is rated as Medium.

Quality Assessment
The project is highly active, with its most recent code push occurring today. Despite this active development, the tool has extremely low community visibility, boasting only 5 GitHub stars. There is a discrepancy in its licensing: the automated scan detected "NOASSERTION," but the project's README displays an "Elastic-2.0" badge. Developers should manually verify the license terms before integration, as Elastic-2.0 has specific source-available restrictions.

Verdict
Use with caution — it is actively maintained and lacks dangerous local permissions, but its low community adoption and CI script flags warrant a closer manual review before deploying in sensitive environments.
SUMMARY

High-performance web crawling engine with bindings for 11 languages

README.md

Kreuzcrawl

Kreuzcrawl

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 10 languages — same engine, identical results across every runtime.

Key Features

  • Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
  • Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
  • Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
  • 10 language bindings — Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly
  • Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
  • Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
  • Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
  • Streaming — Real-time crawl events via async streams for progress tracking
  • Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
  • Rate limiting — Per-domain request throttling with configurable delays
  • Asset download — Download, deduplicate, and filter images, documents, and other linked assets
  • MCP server — Model Context Protocol integration for AI agents
  • REST API — HTTP server with OpenAPI spec

Documentation | API Reference

Installation

Language Package Install
Python kreuzcrawl pip install kreuzcrawl
Node.js @kreuzberg/kreuzcrawl npm install @kreuzberg/kreuzcrawl
Rust kreuzcrawl cargo add kreuzcrawl
Go pkg.go.dev go get github.com/kreuzberg-dev/kreuzcrawl/packages/go
Java Maven Central See README
C# NuGet dotnet add package Kreuzcrawl
Ruby kreuzcrawl gem install kreuzcrawl
PHP kreuzberg-dev/kreuzcrawl composer require kreuzberg-dev/kreuzcrawl
Elixir kreuzcrawl {:kreuzcrawl, "~> 0.1"}
WASM @kreuzberg/kreuzcrawl-wasm npm install @kreuzberg/kreuzcrawl-wasm
C FFI GitHub Releases C header + shared library
CLI crates.io cargo install kreuzcrawl-cli
CLI (Homebrew) kreuzberg-dev/tap brew install kreuzberg-dev/tap/kreuzcrawl

Quick Start

PythonFull docs
from kreuzcrawl import create_engine, scrape

engine = create_engine()
result = scrape(engine, "https://example.com")

print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))
Node.js / TypeScriptFull docs
import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";

const engine = createEngine();
const result = await scrape(engine, "https://example.com");

console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);
RustFull docs
let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;

println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());
GoFull docs
engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")

fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))
JavaFull docs
var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");

System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());
C#Full docs
var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");

Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);
RubyFull docs
engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")

puts result.metadata.title
puts result.markdown.content
puts result.links.length
PHPFull docs
$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");

echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);
ElixirFull docs
{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")

IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))

Platform Support

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python
Node.js
WASM
Ruby
Elixir
Go
Java
C#
PHP
Rust
C (FFI)
CLI

Architecture

Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
    │
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
    │
Rust Core Engine (async, concurrent, SIMD-optimized)
    │
    ├── HTTP Client (reqwest + tower middleware stack)
    ├── HTML Parser (html5ever + lol_html)
    ├── Markdown Converter (html-to-markdown-rs)
    ├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
    ├── Link Discovery (robots.txt, sitemaps, anchor analysis)
    └── Browser Rendering (optional headless Chrome/Firefox)

Contributing

Contributions are welcome! See our Contributing Guide.

License

Elastic License 2.0

Links

Reviews (0)

No results found