kreuzcrawl
mcp
Basarisiz
Health Uyari
- License — License: NOASSERTION
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Basarisiz
- rm -rf — Recursive force deletion command in .github/workflows/ci.yaml
Permissions Gecti
- Permissions — No dangerous permissions requested
Purpose
This tool is a high-performance web crawling and scraping engine written in Rust, designed for structured data extraction. It offers native bindings across 11 programming languages and acts as a Model Context Protocol (MCP) server.
Security Assessment
As a web crawler, the tool inherently makes outbound network requests to fetch external website data. The codebase does not request dangerous local permissions, nor did the scan find hardcoded secrets. However, the audit flagged a recursive force deletion command (`rm -rf`) inside a CI workflow file (`.github/workflows/ci.yaml`). While this is a common practice for cleaning up build environments, poorly sanitized variables in CI scripts can sometimes lead to pipeline vulnerabilities. Overall risk is rated as Medium.
Quality Assessment
The project is highly active, with its most recent code push occurring today. Despite this active development, the tool has extremely low community visibility, boasting only 5 GitHub stars. There is a discrepancy in its licensing: the automated scan detected "NOASSERTION," but the project's README displays an "Elastic-2.0" badge. Developers should manually verify the license terms before integration, as Elastic-2.0 has specific source-available restrictions.
Verdict
Use with caution — it is actively maintained and lacks dangerous local permissions, but its low community adoption and CI script flags warrant a closer manual review before deploying in sensitive environments.
This tool is a high-performance web crawling and scraping engine written in Rust, designed for structured data extraction. It offers native bindings across 11 programming languages and acts as a Model Context Protocol (MCP) server.
Security Assessment
As a web crawler, the tool inherently makes outbound network requests to fetch external website data. The codebase does not request dangerous local permissions, nor did the scan find hardcoded secrets. However, the audit flagged a recursive force deletion command (`rm -rf`) inside a CI workflow file (`.github/workflows/ci.yaml`). While this is a common practice for cleaning up build environments, poorly sanitized variables in CI scripts can sometimes lead to pipeline vulnerabilities. Overall risk is rated as Medium.
Quality Assessment
The project is highly active, with its most recent code push occurring today. Despite this active development, the tool has extremely low community visibility, boasting only 5 GitHub stars. There is a discrepancy in its licensing: the automated scan detected "NOASSERTION," but the project's README displays an "Elastic-2.0" badge. Developers should manually verify the license terms before integration, as Elastic-2.0 has specific source-available restrictions.
Verdict
Use with caution — it is actively maintained and lacks dangerous local permissions, but its low community adoption and CI script flags warrant a closer manual review before deploying in sensitive environments.
High-performance web crawling engine with bindings for 11 languages
README.md
Kreuzcrawl
High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 10 languages — same engine, identical results across every runtime.
Key Features
- Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
- Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
- Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
- 10 language bindings — Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly
- Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
- Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
- Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
- Streaming — Real-time crawl events via async streams for progress tracking
- Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
- Rate limiting — Per-domain request throttling with configurable delays
- Asset download — Download, deduplicate, and filter images, documents, and other linked assets
- MCP server — Model Context Protocol integration for AI agents
- REST API — HTTP server with OpenAPI spec
Installation
| Language | Package | Install |
|---|---|---|
| Python | kreuzcrawl | pip install kreuzcrawl |
| Node.js | @kreuzberg/kreuzcrawl | npm install @kreuzberg/kreuzcrawl |
| Rust | kreuzcrawl | cargo add kreuzcrawl |
| Go | pkg.go.dev | go get github.com/kreuzberg-dev/kreuzcrawl/packages/go |
| Java | Maven Central | See README |
| C# | NuGet | dotnet add package Kreuzcrawl |
| Ruby | kreuzcrawl | gem install kreuzcrawl |
| PHP | kreuzberg-dev/kreuzcrawl | composer require kreuzberg-dev/kreuzcrawl |
| Elixir | kreuzcrawl | {:kreuzcrawl, "~> 0.1"} |
| WASM | @kreuzberg/kreuzcrawl-wasm | npm install @kreuzberg/kreuzcrawl-wasm |
| C FFI | GitHub Releases | C header + shared library |
| CLI | crates.io | cargo install kreuzcrawl-cli |
| CLI (Homebrew) | kreuzberg-dev/tap | brew install kreuzberg-dev/tap/kreuzcrawl |
Quick Start
Python — Full docsfrom kreuzcrawl import create_engine, scrape
engine = create_engine()
result = scrape(engine, "https://example.com")
print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))
Node.js / TypeScript — Full docs
import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";
const engine = createEngine();
const result = await scrape(engine, "https://example.com");
console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);
Rust — Full docs
let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;
println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());
Go — Full docs
engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")
fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))
Java — Full docs
var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");
System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());
C# — Full docs
var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");
Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);
Ruby — Full docs
engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")
puts result.metadata.title
puts result.markdown.content
puts result.links.length
PHP — Full docs
$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");
echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);
Elixir — Full docs
{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")
IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))
Platform Support
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | — |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
Architecture
Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
│
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
│
Rust Core Engine (async, concurrent, SIMD-optimized)
│
├── HTTP Client (reqwest + tower middleware stack)
├── HTML Parser (html5ever + lol_html)
├── Markdown Converter (html-to-markdown-rs)
├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
├── Link Discovery (robots.txt, sitemaps, anchor analysis)
└── Browser Rendering (optional headless Chrome/Firefox)
Contributing
Contributions are welcome! See our Contributing Guide.
License
Links
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi