Qt Web Extractor

A general-purpose web content extraction engine powered by Qt WebEngine (Chromium). Designed to extract fully-rendered content from modern web pages that rely on JavaScript, cookies, dynamic content loading, or client-side rendering — transforming complex, noisy web structures directly into clean Markdown text format, preserving links and readability for LLM processing or downstream pipelines.

Also supports extracting text from PDF documents via Qt PDF.

Key features:

Smart Markdown formatting — converts fully rendered web pages directly into clean Markdown, preserving links and structure without the noise of raw HTML. Ideal for feeding text into LLMs.
Full JavaScript rendering — handles SPAs, React/Vue/Angular apps, and
any page that requires JS to display content.
Cookie & session support — access pages behind login walls or consent
gates.
PDF text extraction — extract text from PDF documents via Qt PDF with
auto-detection.
Multiple interfaces — use as a CLI tool, Python library, or HTTP
service with a simple REST API.
Headless operation — runs in Qt offscreen mode, no display or GPU
required.
Lightweight dependencies — no standalone browser binaries, and no full
browser process overhead, saving resources.
systemd integration — ships with a service unit for easy deployment.
Open WebUI compatible — works as an external web page loader for
Open WebUI, and can also be
used as a custom tool plugin.
Universal HTTP API — the REST server can serve any application that
needs rendered web content: AI agents, crawlers, monitoring tools,
automation scripts, and more.

Why?

Traditional HTTP fetchers (like requests or urllib) do plain HTTP
requests — no JS execution, no cookie handling, no waiting for async content.
That means SPAs, React/Vue apps, and anything behind a login wall comes back
empty or broken.

Qt Web Extractor spins up a headless Chromium (via Qt WebEngine) to render
pages properly, then hands back the text and HTML. Runs in offscreen mode by
default, no display needed.

Why not Playwright / Puppeteer / Selenium?

While tools like Playwright are incredibly powerful for browser automation, they can be overkill for simple content extraction:

Simpler Deployment: No need to download and manage separate, standalone Chromium binaries (which Playwright/Puppeteer do by default).
Package Manager Integration: It uses the system's native Qt WebEngine. On Linux distributions, this means it integrates perfectly with your system's package manager, receiving security updates automatically without bloating your application directory.
Lightweight: It focuses purely on rendering and extracting content, making it more lightweight and straightforward to set up as a simple background service.

Install

Arch Linux (AUR)

You can install the package directly from the Arch User Repository (AUR) using your favorite AUR helper (e.g., yay or paru):

yay -S qt-web-extractor

System deps

You need Qt6 WebEngine (which includes Qt6 PDF). On Arch:

sudo pacman -S qt6-webengine pyside6

Install the package

pip install .

Or in dev mode:

pip install -e .

Usage

CLI

# Clean Markdown text (preserves links and structure)
python -m qt_web_extractor https://example.com

# JSON output
python -m qt_web_extractor --json https://example.com

# rendered HTML
python -m qt_web_extractor --html https://example.com

# custom timeout (ms)
python -m qt_web_extractor --timeout 60000 https://example.com

# custom User-Agent
python -m qt_web_extractor --user-agent "MyBot/1.0" https://example.com

# override proxy for this command
python -m qt_web_extractor --proxy http://127.0.0.1:7890 https://example.com

# multiple URLs
python -m qt_web_extractor https://example.com https://example.org

# extract text from a PDF (auto-detected by .pdf extension)
python -m qt_web_extractor https://example.com/document.pdf

# force PDF extraction mode
python -m qt_web_extractor --pdf https://example.com/file

Python API

from qt_web_extractor import QtWebExtractor

extractor = QtWebExtractor(timeout_ms=30000)
result = extractor.extract("https://example.com")

print(result.title)
print(result.text)   # plain text
print(result.html)   # rendered HTML
print(result.error)  # empty string if all went well

# extract from PDF
result = extractor.extract_pdf("https://example.com/document.pdf")
print(result.text)

# override proxy explicitly (otherwise standard proxy env vars are used)
extractor = QtWebExtractor(proxy="http://127.0.0.1:7890")

Open WebUI integration

Qt Web Extractor integrates with Open WebUI as an external web page loader.
The server exposes an API compatible with Open WebUI's built-in web loader engine.

Install and start the server:

sudo systemctl enable --now qt-web-extractor

Or run manually:

qt-web-extractor serve

In the Open WebUI admin panel, go to
Settings → Web Search → Web Page Loader
Set Web Loader Engine to external
Set External Web Loader URL to http://127.0.0.1:8766
(or wherever the server is running)
Set External Web Loader API Key to the server's API_KEY if
you configured one, or any non-empty string if didn't.

That's it — Open WebUI will now use Qt Web Extractor to load all web pages
with full JavaScript rendering support. PDF URLs are auto-detected and handled
via Qt PDF.

Alternative: custom tool

You can also use qt_web_extractor/tool.py as a custom Open WebUI tool for
more explicit control (see the file for setup instructions). This provides
fetch_page, fetch_page_html, and fetch_pdf as conversation tools.

Server mode

Run as a persistent HTTP service for any application — AI platforms (Open
WebUI, etc.), web crawlers, automation scripts, monitoring tools, or your
own projects:

# start with defaults (127.0.0.1:8766)
qt-web-extractor serve

# custom host/port
qt-web-extractor serve --host 0.0.0.0 --port 9000

# with API key auth
qt-web-extractor serve --api-key mysecretkey

# override proxy for the service process
qt-web-extractor serve --proxy http://127.0.0.1:7890

Proxy handling follows standard environment variables by default:

export HTTPS_PROXY=http://127.0.0.1:7890
export HTTP_PROXY=http://127.0.0.1:7890
export ALL_PROXY=http://127.0.0.1:7890
export NO_PROXY=127.0.0.1,localhost,.internal.example

The explicit --proxy flag overrides HTTPS_PROXY / HTTP_PROXY / ALL_PROXY for outbound requests, while NO_PROXY is still honored. Only HTTP/HTTPS forward proxies are supported end-to-end.

API endpoints:

POST / with {"urls": ["https://...", ...]} → Open WebUI external loader
format, returns [{"page_content": "...", "metadata": {"source": "...", "title": "..."}}]
POST /extract with {"url": "https://..."} → single-URL format, returns
JSON with url, title, text, html, error
POST /mcp with JSON-RPC 2.0 payload → MCP endpoint for AI agents
(supports initialize, tools/list, tools/call)
GET /health → {"status": "ok"}

PDF URLs (ending in .pdf) are auto-detected in both endpoints. For
POST /extract, pass "pdf": true to force PDF mode.

MCP integration (Claude Code / OpenCode)

The built-in MCP endpoint (/mcp) reuses the same running server process.
No extra wrapper process is required.
MCP uses the same Bearer authentication as /extract.

Available MCP tool:

fetch_url with input { "url": "https://..." }
Returns rendered Markdown text (with PDF auto-detection)

Claude Code example:

# no auth
claude mcp add --transport http web-extractor http://127.0.0.1:8766/mcp

# if server uses --api-key
claude mcp add --transport http web-extractor http://127.0.0.1:8766/mcp \
  --header "Authorization: Bearer mysecretkey"

Optional Claude Code config:

These files are user-managed and are not auto-created by package installation.

Project-scoped: create .mcp.json in your project root.

User-scoped (global): configure ~/.claude.json under mcpServers, or run:

claude mcp add --transport http --scope user web-extractor http://127.0.0.1:8766/mcp

Example .mcp.json (project-scoped):

{
  "mcpServers": {
    "web-extractor": {
      "type": "http",
      "url": "${QT_WEB_EXTRACTOR_MCP_URL:-http://127.0.0.1:8766/mcp}",
      "headers": {
        "Authorization": "Bearer ${QT_WEB_EXTRACTOR_API_KEY:-}"
      }
    }
  }
}

OpenCode config (opencode.json in project root, or ~/.config/opencode/opencode.json for global user config):

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "web_extractor": {
      "type": "remote",
      "url": "http://127.0.0.1:8766/mcp",
      "enabled": true,
      "oauth": false,
      "headers": {
        "Authorization": "Bearer {env:QT_WEB_EXTRACTOR_API_KEY}"
      }
    }
  }
}

Hardcoded values are also valid in both configs, for example: "Authorization": "Bearer mysecretkey".

When auth is disabled, the Authorization header can be omitted.

systemd

A service file and config are included:

# edit config
sudo nano /etc/qt-web-extractor.conf

# start
sudo systemctl enable --now qt-web-extractor

Open WebUI settings

Setting	Value	Description
`WEB_LOADER_ENGINE`	`external`	Use external web loader
`EXTERNAL_WEB_LOADER_URL`	`http://127.0.0.1:8766`	Server URL
`EXTERNAL_WEB_LOADER_API_KEY`	`""`	Bearer token (must match server's `API_KEY`)

Server config (`/etc/qt-web-extractor.conf`)

Name	Default	Description
`HOST`	`127.0.0.1`	Listen address
`PORT`	`8766`	Listen port
`TIMEOUT_MS`	`30000`	Page load timeout (ms)
`USER_AGENT`	`""`	Custom User-Agent
`API_KEY`	`""`	Bearer token auth (empty = no auth)
`HTTPS_PROXY`	unset	HTTPS outbound proxy
`HTTP_PROXY`	unset	HTTP outbound proxy
`ALL_PROXY`	unset	Fallback outbound proxy
`NO_PROXY`	unset	Hosts that bypass proxy

How it works

The server runs Qt WebEngine on the main thread (Qt requirement) and an HTTP
server in a background thread. Incoming requests are queued and processed one
at a time by the Qt event loop. Each page gets 2 seconds after loadFinished
for JS to settle, then toPlainText() and toHtml() are extracted from the
rendered DOM. A hard timeout prevents hanging on unresponsive pages.

Sites behind Cloudflare's aggressive bot challenge may still fail — this is a
known limitation of all headless browsers.

Project layout

qt-web-extractor/
├── pyproject.toml
├── PKGBUILD
├── LICENSE
├── README.md
├── qt-web-extractor.service         # systemd unit
├── qt-web-extractor.conf.example    # default config
└── qt_web_extractor/
    ├── __init__.py
    ├── __main__.py      # CLI (extract + serve subcommands)
    ├── extractor.py     # core engine (QWebEnginePage)
    ├── server.py        # HTTP server wrapper
    └── tool.py          # Open WebUI tool interface

License

This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See the COPYING file for details.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.