knowhere
Health Uyari
- License — License: Apache-2.0
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 6 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Knowhere extracts, parses, and outputs structured chunks ready for AI Agents and RAG.
Prepare unstructured data for AI Agents
🔗 Website | 📄 Docs | 🏠 Self-Host | 🖥️ Dashboard
Knowhere is the open-source infrastructure for unstructured data processing. It automates the complex pipeline of extracting, parsing, and transforming messy documents into structured, high-quality data optimized for AI Agents, Agentic RAG, and traditional vector-based RAG workflows.
[!NOTE]
Get started in seconds with Knowhere Cloud.
Avoid the complexity of self-deployment. Use our managed API at knowhereto.ai and enjoy $5 in free credits upon registration.
📢 News
- May 7, 2026: 🚀 Knowhere is now Open Source! We have open-sourced our entire stack for document ingestion, parsing, and agentic RAG. You can now self-host the full platform using knowhere-self-hosted. Check out our Contribution Guide to get involved!
- Apr 30, 2026: 📦 Version 2026.04.30.1 has been released. This update includes several stability improvements and initial support for the agentic RAG layer. See the full changelog for details.
How it Works
[!TIP]
TL;DR: Knowhere parses documents into structured units, maps them in a graph, and lets agents navigate that context to find and cite reliable evidence.
Knowhere turns raw documents into a structured memory store that AI agents can navigate and cite. The process follows a three-stage pipeline:
flowchart LR
A[📄 Document Parsing] --> B[🕸️ Graph Construction]
B --> C[🤖 Agentic Retrieval]
B --> D[🔍 Vector-based RAG]
C --> E[✅ Cited Results]
D --> E
1. Document Parsing
Knowhere routes files to specialized parsers for PDFs, Office docs, images, and more. We don't just extract text; we preserve the document's hierarchy:
- Hierarchical Paths: Every chunk knows its exact location (e.g.,
Section 2.1 > Table 4). - Multi-modal Units: Tables and images are treated as distinct assets with their own metadata.
- Structural Awareness: Heading levels and section boundaries are maintained to keep context intact.
2. Memory Graph
Parsed content is organized into a lightweight graph. It’s designed as a practical map for agents, not a complex ontology.
- Nodes: Represent documents, sections, and chunks.
- Edges: Map semantic relationships (keyword overlap, summaries) and structural links.
This graph helps agents quickly understand what a document is about and which neighboring files might be relevant.
3a. Agentic Retrieval
An agent navigates the memory graph to find evidence rather than relying on a single vector lookup:
- Hybrid Discovery: Fuses keyword and semantic search (RRF) for broad first-pass coverage.
- Agent Navigation: The agent "walks" the graph, reviewing section previews to drill down into the most relevant paths.
- Cited Evidence: Results are returned as traceable evidence — source document, section, chunk, and any linked image or table assets.
3b. Vector-based RAG
For teams that prefer a pure retrieval pipeline without agent overhead, Knowhere's parsed chunks plug directly into standard vector stacks:
- Dense Search: Chunk embeddings stored in Qdrant, pgvector, or Milvus for fast ANN lookup.
- Sparse Search: BM25 term index for keyword-sensitive queries.
- Multi-channel Fusion: Dense and sparse results are fused with RRF before being returned, giving you the best of both signals.
Ecosystem
| Repository | Description |
|---|---|
| knowhere | This repo. Backend API and worker — document ingestion, parsing, graph construction, and retrieval. |
| 🖥️ knowhere-dashboard | The web UI. Connects to the API for the full product experience. |
| 🐳 knowhere-self-hosted | Docker Compose stack for self-hosted deployments. Packages the API, worker, and dashboard together. |
| 🐍 knowhere-python-sdk | Official Python SDK for the Knowhere Cloud API. |
| 🦕 knowhere-node-sdk | Official Node.js SDK for the Knowhere Cloud API. |
Features
- Multi-modal Parsing: High-fidelity extraction from PDF, Office, and images, preserving headings, tables, and hierarchical paths.
- Lightweight Memory Graph: Context-aware organization that links documents and chunks for better relationship understanding.
- Agentic RAG: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation.
- Evidence-based Citations: Every result is backed by traceable source paths, ensuring reliability for AI Agent decision-making.
Supported Formats
✅ Supported
-
.pdf.docx.pptx.xlsx.csv -
.jpg.png -
.md.txt.json
⏳ Coming Soon
-
.epub.html.xml -
.mp4.mp3 -
.skills.md
Want to see a new format supported? Adding a parser is a great first contribution. Check out CONTRIBUTING.md to get started.
Prerequisites
- Python 3.11+
uv- Docker with
docker compose
Quick Start
- Sync the workspace dependencies:
uv sync --all-packages
- Copy the environment examples:
cp apps/api/.env.example apps/api/.env
cp apps/worker/.env.example apps/worker/.env
- Update the copied
.envfiles with the values you need for local work:
- database and Redis connection settings
- S3-compatible storage credentials
DS_KEY- any optional LLM, billing, or webhook providers you want to enable
- Start the local infrastructure stack:
./deploy/local-dev/start-dev.sh
- Start the API and worker in separate terminals:
cd apps/api && uv run main.py
cd apps/worker && uv run worker.py
The API runs migrations during startup.
For API-only development without the dashboard, create an API-only user/key
after the API service starts:
cd apps/api
uv run scripts/init_user.py --email [email protected]
If you plan to use the dashboard, register through the dashboard instead of
using scripts/init_user.py.
The API is now running at http://localhost:5005. If you want the full product experience with a UI, run the knowhere-dashboard alongside it — it connects to this API out of the box.
Quality Checks
Run lint checks from the repository root:
make lint
Apply safe Ruff fixes:
make lint-fix
Run type checks across the API, worker, and shared source code:
make typecheck
Run both lint and type checks:
make check
Local Endpoints
- API:
http://localhost:5005 - OpenAPI docs:
http://localhost:5005/docs - LocalStack:
http://localhost:4566 - PostgreSQL:
localhost:5432 - Redis:
localhost:6379
Additional Guides
- External dependency guide:
docs/external-services.md
Citation
If you use Knowhere in your research, please cite it as:
@software{knowhere2026,
author = {Ontos AI},
title = {Knowhere: Prepare Unstructured Data for AI Agents},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Ontos-AI/knowhere},
version = {2026.04.30.1},
license = {Apache-2.0}
}
Communication
- GitHub Discussions for questions, ideas, and general conversation.
- GitHub Issues for bug reports and feature requests.
Contribution
Any contributions to Knowhere are more than welcome!
If you are new to the project, check out the good first issues. They are well-defined, relatively simple, and a great way to get familiar with the codebase and the contribution workflow.
For general guidelines on branching, commit conventions, and the review process, take a look at CONTRIBUTING.md.
Other useful references:
- SECURITY.md — how to report vulnerabilities responsibly.
- CODE_OF_CONDUCT.md — community behavior expectations.
- LICENSE and NOTICE — Apache 2.0.
👋 We're Hiring!
We're building the knowledge layer for the Agent era. If that sounds like work you want to do, reach out — decode the address below and drop us a line:
echo 'dGVhbUBrbm93aGVyZXRvLmFp' | base64 --decode
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi