knowhere

agent
Guvenlik Denetimi
Uyari
Health Uyari
  • License — License: Apache-2.0
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 6 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

Knowhere extracts, parses, and outputs structured chunks ready for AI Agents and RAG.

README.md
20260506-102713

Prepare unstructured data for AI Agents

Python Version GitHub stars Build Status
Join the community on GitHub Container Images License: Apache 2.0

🔗 Website | 📄 Docs | 🏠 Self-Host | 🖥️ Dashboard

Knowhere is the open-source infrastructure for unstructured data processing. It automates the complex pipeline of extracting, parsing, and transforming messy documents into structured, high-quality data optimized for AI Agents, Agentic RAG, and traditional vector-based RAG workflows.

[!NOTE]
Get started in seconds with Knowhere Cloud.
Avoid the complexity of self-deployment. Use our managed API at knowhereto.ai and enjoy $5 in free credits upon registration.

📢 News

  • May 7, 2026: 🚀 Knowhere is now Open Source! We have open-sourced our entire stack for document ingestion, parsing, and agentic RAG. You can now self-host the full platform using knowhere-self-hosted. Check out our Contribution Guide to get involved!
  • Apr 30, 2026: 📦 Version 2026.04.30.1 has been released. This update includes several stability improvements and initial support for the agentic RAG layer. See the full changelog for details.

How it Works

[!TIP]
TL;DR: Knowhere parses documents into structured units, maps them in a graph, and lets agents navigate that context to find and cite reliable evidence.

Knowhere turns raw documents into a structured memory store that AI agents can navigate and cite. The process follows a three-stage pipeline:

flowchart LR
    A[📄 Document Parsing] --> B[🕸️ Graph Construction]
    B --> C[🤖 Agentic Retrieval]
    B --> D[🔍 Vector-based RAG]
    C --> E[✅ Cited Results]
    D --> E

1. Document Parsing

Knowhere routes files to specialized parsers for PDFs, Office docs, images, and more. We don't just extract text; we preserve the document's hierarchy:

  • Hierarchical Paths: Every chunk knows its exact location (e.g., Section 2.1 > Table 4).
  • Multi-modal Units: Tables and images are treated as distinct assets with their own metadata.
  • Structural Awareness: Heading levels and section boundaries are maintained to keep context intact.

2. Memory Graph

Parsed content is organized into a lightweight graph. It’s designed as a practical map for agents, not a complex ontology.

  • Nodes: Represent documents, sections, and chunks.
  • Edges: Map semantic relationships (keyword overlap, summaries) and structural links.
    This graph helps agents quickly understand what a document is about and which neighboring files might be relevant.

3a. Agentic Retrieval

An agent navigates the memory graph to find evidence rather than relying on a single vector lookup:

  • Hybrid Discovery: Fuses keyword and semantic search (RRF) for broad first-pass coverage.
  • Agent Navigation: The agent "walks" the graph, reviewing section previews to drill down into the most relevant paths.
  • Cited Evidence: Results are returned as traceable evidence — source document, section, chunk, and any linked image or table assets.

3b. Vector-based RAG

For teams that prefer a pure retrieval pipeline without agent overhead, Knowhere's parsed chunks plug directly into standard vector stacks:

  • Dense Search: Chunk embeddings stored in Qdrant, pgvector, or Milvus for fast ANN lookup.
  • Sparse Search: BM25 term index for keyword-sensitive queries.
  • Multi-channel Fusion: Dense and sparse results are fused with RRF before being returned, giving you the best of both signals.

Ecosystem

Repository Description
knowhere This repo. Backend API and worker — document ingestion, parsing, graph construction, and retrieval.
🖥️ knowhere-dashboard The web UI. Connects to the API for the full product experience.
🐳 knowhere-self-hosted Docker Compose stack for self-hosted deployments. Packages the API, worker, and dashboard together.
🐍 knowhere-python-sdk Official Python SDK for the Knowhere Cloud API.
🦕 knowhere-node-sdk Official Node.js SDK for the Knowhere Cloud API.

Features

  • Multi-modal Parsing: High-fidelity extraction from PDF, Office, and images, preserving headings, tables, and hierarchical paths.
  • Lightweight Memory Graph: Context-aware organization that links documents and chunks for better relationship understanding.
  • Agentic RAG: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation.
  • Evidence-based Citations: Every result is backed by traceable source paths, ensuring reliability for AI Agent decision-making.

Supported Formats

✅ Supported

  • .pdf .docx .pptx .xlsx .csv
  • .jpg .png
  • .md .txt .json

⏳ Coming Soon

  • .epub .html .xml
  • .mp4 .mp3
  • .skills.md

Want to see a new format supported? Adding a parser is a great first contribution. Check out CONTRIBUTING.md to get started.

Prerequisites

  • Python 3.11+
  • uv
  • Docker with docker compose

Quick Start

  1. Sync the workspace dependencies:
uv sync --all-packages
  1. Copy the environment examples:
cp apps/api/.env.example apps/api/.env
cp apps/worker/.env.example apps/worker/.env
  1. Update the copied .env files with the values you need for local work:
  • database and Redis connection settings
  • S3-compatible storage credentials
  • DS_KEY
  • any optional LLM, billing, or webhook providers you want to enable
  1. Start the local infrastructure stack:
./deploy/local-dev/start-dev.sh
  1. Start the API and worker in separate terminals:
cd apps/api && uv run main.py
cd apps/worker && uv run worker.py

The API runs migrations during startup.

For API-only development without the dashboard, create an API-only user/key
after the API service starts:

cd apps/api
uv run scripts/init_user.py --email [email protected]

If you plan to use the dashboard, register through the dashboard instead of
using scripts/init_user.py.

The API is now running at http://localhost:5005. If you want the full product experience with a UI, run the knowhere-dashboard alongside it — it connects to this API out of the box.

Quality Checks

Run lint checks from the repository root:

make lint

Apply safe Ruff fixes:

make lint-fix

Run type checks across the API, worker, and shared source code:

make typecheck

Run both lint and type checks:

make check

Local Endpoints

  • API: http://localhost:5005
  • OpenAPI docs: http://localhost:5005/docs
  • LocalStack: http://localhost:4566
  • PostgreSQL: localhost:5432
  • Redis: localhost:6379

Additional Guides

Citation

If you use Knowhere in your research, please cite it as:

@software{knowhere2026,
  author       = {Ontos AI},
  title        = {Knowhere: Prepare Unstructured Data for AI Agents},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Ontos-AI/knowhere},
  version      = {2026.04.30.1},
  license      = {Apache-2.0}
}

Communication

Contribution

Any contributions to Knowhere are more than welcome!

If you are new to the project, check out the good first issues. They are well-defined, relatively simple, and a great way to get familiar with the codebase and the contribution workflow.

For general guidelines on branching, commit conventions, and the review process, take a look at CONTRIBUTING.md.

Other useful references:

👋 We're Hiring!

We're building the knowledge layer for the Agent era. If that sounds like work you want to do, reach out — decode the address below and drop us a line:

echo 'dGVhbUBrbm93aGVyZXRvLmFp' | base64 --decode

Yorumlar (0)

Sonuc bulunamadi