OpenDocuments
Open source RAG tool for AI document search — connect GitHub, Notion, Google Drive and ask questions with cited answers. Self-hosted with Ollama/OpenAI/Claude.
OpenDocuments
Open source RAG tool for AI document search — connect GitHub, Notion, Google Drive and ask questions with cited answers
The Problem: Scattered Knowledge, No AI Search
Your team's knowledge is trapped in silos:
- Engineering docs live in GitHub READMEs and Wiki pages
- Product specs are scattered across Notion databases
- Budget reports sit in Excel files on Google Drive
- API docs are auto-generated Swagger specs nobody reads
- Meeting notes rot in Confluence spaces
- Onboarding guides are buried in
.docxfiles on S3
When someone asks "How does our auth system work?" or "What was the Q3 budget for the AI team?", they spend 15 minutes hunting through 5 different tools. And they still might not find the answer.
The Solution: Self-Hosted AI Document Search
OpenDocuments connects to all your document sources, indexes everything into a unified search engine, and answers questions in natural language -- with source citations so you know exactly where the answer came from.
npm install -g opendocuments
opendocuments init
opendocuments start
Open http://localhost:3000, and ask away.
OpenDocuments is a free, open source alternative to proprietary enterprise AI search tools. It's a self-hosted RAG (Retrieval-Augmented Generation) platform that runs on your own infrastructure.
Recent Improvements
- One-touch Ollama setup:
initauto-detects Ollama, offers to pull missing models .envauto-loading: API keys in.envare loaded automatically (no manual export needed)- Multi-turn conversations: Chat remembers previous context for follow-up questions
- Degraded mode warnings: Clear banners when models aren't configured, with fix instructions
- Enhanced diagnostics:
opendocuments doctorchecks Ollama connectivity, model availability, and config validity - Security hardening: FTS5 injection prevention, file upload sanitization, OAuth state limits, workspace isolation
Real-World Use Cases
For Engineering Teams
"How do I authenticate against our internal API?"
OpenDocuments pulls the answer from your GitHub repo's docs/auth.md, links to the relevant Swagger endpoint, and includes a code example from the codebase -- all in one response.
# Index your repo and API docs
opendocuments index ./docs
opendocuments connector sync github
opendocuments ask "How does JWT token refresh work in our API?"
For Operations & HR Teams
"What's the remote work policy for the Tokyo office?"
OpenDocuments searches across your Confluence HR space, the employee handbook on Google Drive, and the latest policy update email -- even if some documents are in Korean and others in English.
opendocuments ask "도쿄 오피스 원격 근무 정책이 뭐야?" --profile precise
# Cross-lingual search finds both Korean and English documents
For Product Managers
"Compare the feature specs of v2.0 vs v3.0"
OpenDocuments decomposes the question, searches both versions' specs, and presents a structured comparison table -- citing each source document.
For AI-Assisted Development (MCP)
Use OpenDocuments as a knowledge base for Claude Code, Cursor, or any MCP-compatible AI tool:
{
"mcpServers": {
"opendocuments": {
"command": "opendocuments",
"args": ["start", "--mcp-only"]
}
}
}
Now your AI coding assistant can search your organization's entire document corpus while writing code.
For Self-Hosted Knowledge Bases
Deploy on your own infrastructure. Your data never leaves your network when using a local LLM via Ollama. No cloud dependency, no vendor lock-in, no subscription fees.
docker compose --profile with-ollama up -d
# Everything runs locally: LLM, embeddings, vector search, web UI
Quick Start
1. Install
npm install -g opendocuments
2. Initialize
opendocuments init
The interactive wizard will:
- Detect your hardware (CPU, RAM) and recommend the optimal LLM
- Let you choose between local (Ollama) or cloud (OpenAI, Claude, Gemini, Grok) models
- Auto-detect Ollama and offer to pull missing models automatically
- Validate cloud API keys before saving
- Select a plugin preset:
Developer,Enterprise,All, orCustom - Generate
opendocuments.config.tsand.env(API keys loaded automatically)
3. Start
opendocuments start
Open http://localhost:3000 -- you'll see a chat UI, document manager, and admin dashboard.
First time? If Ollama isn't running, you'll see a clear DEGRADED MODE banner with step-by-step fix instructions. Run
opendocuments doctorfor full diagnostics.
4. Index Your Documents
# Index a local directory (recursively finds all supported files)
opendocuments index ./docs
# Watch mode: auto-reindex when files change
opendocuments index ./docs --watch
# Or drag-and-drop files in the Web UI
5. Ask Questions
opendocuments ask "What's our deployment process?"
How It Works
Your Documents OpenDocuments You
───────────── ────────────── ───
GitHub repos ──┐
Notion pages ──┤ ┌─────────────┐
Google Drive ──┤ ── Ingest ──► │ Parse │
Confluence ──┤ │ Chunk │ "How does
S3 buckets ──┤ │ Embed │ auth work?"
Swagger specs──┤ │ Store │ │
Local files ──┤ └──────┬───────┘ │
Web pages ──┘ │ ▼
┌──────┴───────┐ ┌─────────────┐
│ SQLite │ │ RAG Engine │
│ (metadata) │◄─┤ Search │
│ │ │ Rerank │
│ LanceDB │ │ Generate │
│ (vectors) │ │ Cite sources│
└──────────────┘ └──────┬──────┘
│
▼
"Auth uses JWT
tokens with
refresh flow.
[Source: auth.md]"
The RAG Pipeline
- Intent Classification -- Understands whether you're asking about code, concepts, data, or want a comparison
- Query Decomposition -- Breaks complex questions into sub-queries for better retrieval
- Cross-Lingual Search -- Finds documents in both Korean and English regardless of query language
- Hybrid Search -- Combines dense vector search (semantic) with FTS5 sparse search (keyword) via Reciprocal Rank Fusion
- Reranking -- Scores results by keyword overlap and model-based relevance
- Confidence Scoring -- Tells you honestly when it's not sure about an answer
- Hallucination Guard -- Verifies each sentence is grounded in the retrieved sources
- 3-Tier Caching -- L1 query cache (5min), L2 embedding cache (24h), L3 web search cache (1h)
Supported File Formats
| Format | Extensions | How It's Parsed |
|---|---|---|
| Markdown | .md, .mdx |
Heading hierarchy, code block separation |
| Plain Text | .txt |
Direct text indexing |
.pdf |
Page-level extraction, OCR fallback for scanned docs | |
| Word | .docx |
HTML conversion with heading detection |
| Excel / CSV | .xlsx, .xls, .csv |
Sheet-aware table chunking (header + rows) |
| HTML | .html, .htm |
Structure-preserving extraction, script/nav stripping |
| Jupyter Notebook | .ipynb |
Markdown cells + code cells with language detection |
.eml |
Header parsing (from/to/subject/date) + body extraction | |
| Source Code | .js, .ts, .py, .java, .go, .rs, .rb, .php, .swift, .kt + more |
Function/class-level chunking with import extraction |
| PowerPoint | .pptx |
Slide-level text extraction |
| Structured Data | .json, .yaml, .yml, .toml |
Config and schema indexing |
| Archive | .zip |
Placeholder (full extraction planned) |
Fallback Chains: If a parser fails, the next one tries automatically:
parserFallbacks: {
'.pdf': ['@opendocuments/parser-pdf', '@opendocuments/parser-ocr'],
}
Data Sources
| Source | What It Indexes | Auth | How It Syncs |
|---|---|---|---|
| Local Files | Any supported format on your filesystem | None | File watching (--watch) |
| File Upload | Drag-and-drop in Web UI | None | Instant |
| GitHub | README, Wiki, code files, Issues | Personal Access Token | Polling / webhook |
| Notion | Pages, databases, all block types | Integration Token | Polling |
| Google Drive | Docs, Sheets, Slides, uploaded files | OAuth / Service Account | Polling |
| Amazon S3 / Google Cloud Storage | Any supported format in buckets | AWS / GCP credentials | Polling |
| Confluence | Wiki pages across spaces | API Token + Email | Polling |
| Swagger / OpenAPI | API endpoints with parameters and schemas | None (public specs) | Manual |
| Web Crawler | Any URL you register | Optional (cookies/headers) | Periodic |
| Web Search (Tavily) | Real-time web results merged into answers | Tavily API Key | Query-time |
Model Providers
Cloud Providers
| Provider | Models | Embedding | Best For |
|---|---|---|---|
| OpenAI | GPT-5.4, GPT-5.4-mini, GPT-4.1, o3, o4-mini | text-embedding-3-small/large | General purpose, vision, reasoning |
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5 | -- (use separate provider) | Long context (1M), coding, analysis |
| Gemini 3.1 Pro, Gemini 3.1 Flash Lite, Gemini 3.0 Deep Think | text-embedding-005 | Multimodal, multilingual | |
| xAI | Grok 4, Grok 4 Heavy, Grok 4.1 Fast | Grok embedding | Real-time knowledge, code |
Local Models (via Ollama)
| Model | Active Params | Total Params | Vision | Korean | Best For |
|---|---|---|---|---|---|
| Qwen 3.5 27B | 27B (dense) | 27B | Yes | Excellent | General purpose (32GB+ RAM) |
| Qwen 3.5 9B | 9B (dense) | 9B | Yes | Excellent | Mid-range (16GB RAM) |
| Qwen 3.5-122B-A10B | 10B (MoE) | 122B | Yes | Excellent | High quality, efficient |
| Llama 4 Scout | 17B (MoE) | 109B | Yes | Good | 10M context window |
| Llama 4 Maverick | 17B (MoE) | 400B | Yes | Good | Top open-source quality |
| DeepSeek V3.2 | 37B (MoE) | 671B | No | Good | Coding, reasoning |
| Gemma 3 27B | 27B | 27B | Yes | Good | Lightweight, 140+ languages |
| Gemma 3 4B | 4B | 4B | Yes | Good | Low-spec machines (8GB RAM) |
| K-EXAONE | 23B (MoE) | 236B | No | Best | Korean-specialized |
| EXAONE Deep 32B | 32B | 32B | No | Best | Korean reasoning |
| Phi-4 Reasoning Vision | 15B | 15B | Yes | Fair | Compact multimodal |
Embedding Models
| Model | Dimensions | Korean | Multimodal | Where |
|---|---|---|---|---|
| BGE-M3 | 1024 | Excellent | No | Ollama (default) |
| text-embedding-3-large | 3072 | Good | No | OpenAI |
| text-embedding-005 | 768 | Good | No | |
| nomic-embed-text | 768 | Fair | No | Ollama (lightweight) |
Auto-Recommendation
opendocuments init detects your hardware and recommends the best model:
| Your Hardware | Recommended Model | Recommended Embedding |
|---|---|---|
| 32GB+ RAM, GPU | Qwen 3.5 27B or Llama 4 Scout | BGE-M3 |
| 16GB RAM | Qwen 3.5 9B | BGE-M3 |
| 8GB RAM | Gemma 3 4B | nomic-embed-text |
| Any (cloud) | Claude Sonnet 4.6 or GPT-5.4-mini | text-embedding-3-large |
Three Ways to Use
1. Web UI
Full-featured dashboard at http://localhost:3000:
| Page | What You Can Do |
|---|---|
| Chat | Ask questions with streaming answers, source citations, confidence scores, feedback buttons. Switch between fast/balanced/precise profiles. |
| Documents | Browse indexed documents, drag-and-drop upload, view document details, soft-delete with trash/restore. |
| Connectors | See connector sync status and last sync times. |
| Plugins | View installed plugins with health indicators. |
| Settings | Toggle dark/light theme, change RAG profile, view server version. |
| Admin | Stats dashboard, search quality metrics, paginated query logs, plugin health, connector status, audit logs. |
Keyboard shortcuts: Cmd+K opens the Command Palette. Cmd+1-5 navigates between pages.
2. CLI
17 commands for power users and automation:
# Ask questions
opendocuments ask "What's the deploy process?"
opendocuments ask # Interactive REPL mode
opendocuments search "auth middleware" --top 10 # Vector search, no LLM
# Manage documents
opendocuments index ./docs --watch # Index + auto-reindex on changes
opendocuments document list # See all indexed docs
opendocuments document delete <id> # Soft-delete
# Manage connectors
opendocuments connector sync # Sync all connectors
opendocuments connector status # Check sync status
# Pipe support for scripting
cat README.md | opendocuments ask "Summarize this" --stdin
opendocuments ask "List endpoints" --json | jq '.sources[].sourcePath'
# Administration
opendocuments doctor # Health check
opendocuments auth create-key --name "ci-bot" --role member
opendocuments export --output ./backup
3. MCP Server
19 tools for AI-assisted workflows. Works with Claude Code, Cursor, Windsurf, and any MCP client.
opendocuments start --mcp-only
Your AI assistant can then:
- Search your organization's documents while coding
- Index new files as they're created
- Check document status and connector health
- Query configuration
RAG Profiles
fast |
balanced |
precise |
|
|---|---|---|---|
| Speed | ~1s | ~3s | ~5s+ |
| Search depth | 10 docs | 20 docs | 50 docs |
| Reranking | Off | On | On |
| Cross-lingual | Off | Korean + English | Korean + English |
| Query decomposition | Off | Off | Splits complex queries |
| Web search | Off | Fallback when local results are weak | Always merged |
| Hallucination guard | Off | Checks source grounding | Strict mode (annotates unverified) |
| Best for | Quick lookups, 8B local models | Daily use, 14B+ models | Critical questions, cloud LLMs |
Switch anytime: CLI flag (--profile precise), Web UI toggle, or config file.
Security
Personal Mode (default)
Zero configuration. No auth. Localhost only. Just works.
Team Mode
// opendocuments.config.ts
export default defineConfig({ mode: 'team' })
| Feature | How It Works |
|---|---|
| API Keys | od_live_ prefix, SHA-256 hashed, never stored in plaintext. Scoped to specific operations, with optional expiration. |
| Roles | admin (everything), member (read + write), viewer (read only) |
| Rate Limiting | 60 req/min default, per-key override. In-memory with lazy cleanup. |
| PII Redaction | Automatically masks emails, phone numbers, credit cards, IPs before sending to cloud LLMs. Configurable patterns and methods (replace/hash/remove). |
| Audit Log | Records auth events, document access, config changes. Queryable via admin API. |
| Security Alerts | Detects brute-force attempts, unusual data exports, API key abuse. |
| OAuth SSO | Google and GitHub login with HttpOnly cookie sessions. |
| Workspace Isolation | Every vector search is enforced with workspace_id filter. Documents, conversations, and API keys are scoped to workspaces. |
Configuration
// opendocuments.config.ts
import { defineConfig } from 'opendocuments-core'
export default defineConfig({
workspace: 'my-team',
mode: 'personal',
model: {
provider: 'ollama',
llm: 'qwen3.5:27b',
embedding: 'bge-m3',
},
rag: { profile: 'balanced' },
connectors: [
{ type: 'github', repo: 'org/repo', token: process.env.GITHUB_TOKEN },
{ type: 'notion', token: process.env.NOTION_TOKEN },
{ type: 'web-crawler', urls: ['https://docs.example.com'] },
],
plugins: ['@opendocuments/parser-pdf', '@opendocuments/parser-docx'],
security: {
dataPolicy: {
autoRedact: { enabled: true, patterns: ['email', 'phone', 'credit-card'] },
},
audit: { enabled: true },
},
storage: { db: 'sqlite', vectorDb: 'lancedb', dataDir: '~/.opendocuments' },
})
Docker Deployment
# Basic (cloud LLM)
docker compose up -d
# With local LLM (Ollama)
docker compose --profile with-ollama up -d
# With .env file for API keys
docker compose --env-file .env up -d
The Docker image includes all packages and plugins. Data persists in a named volume. Mount your config:
docker run -v ./opendocuments.config.ts:/app/opendocuments.config.ts \
-v opendocuments-data:/data -p 3000:3000 opendocuments
Plugin Development
Create custom parsers, connectors, or model providers:
opendocuments plugin create my-parser --type parser
cd my-parser
npm install
npm run test
npm run dev # Watch mode
opendocuments plugin publish # Publish to npm
Four plugin types: parser, connector, model, middleware. Each has a typed interface with lifecycle hooks (setup, teardown, healthCheck, metrics).
Community plugins follow the naming convention: opendocuments-plugin-*
See CONTRIBUTING.md for the full plugin development guide.
TypeScript SDK
import { OpenDocumentsClient } from '@opendocuments/client'
const client = new OpenDocumentsClient({
baseUrl: 'http://localhost:3000',
apiKey: 'od_live_...',
})
const result = await client.ask('How does auth work?')
console.log(result.answer) // "Auth uses JWT tokens with..."
console.log(result.sources) // [{ sourcePath: 'docs/auth.md', score: 0.92 }]
console.log(result.confidence) // { level: 'high', score: 0.87 }
Embeddable Widget
Add a chat widget to your internal tools:
<script src="http://localhost:3000/widget.js"></script>
<script>
OpenDocuments.widget({
server: 'http://localhost:3000',
apiKey: 'od_live_...',
workspace: 'public-docs',
})
</script>
Development
git clone https://github.com/joungminsung/OpenDocuments.git
cd OpenDocuments
npm run setup # Install + build (one command)
npm run test # 51 test suites, ~300 tests
npm run dev # Watch mode
Architecture
| Package | Role | Tests |
|---|---|---|
@opendocuments/core |
Plugin system, RAG engine, ingest pipeline, storage, auth, security | 159 |
@opendocuments/server |
HTTP API (Hono), MCP server, auth middleware, widget | 27 |
@opendocuments/cli |
17 CLI commands (Commander.js) | 3 |
@opendocuments/web |
React SPA with 7 pages (Vite + Tailwind) | -- |
@opendocuments/client |
TypeScript SDK | 3 |
| 5 model plugins | Ollama, OpenAI, Anthropic, Google, Grok | 41 |
| 9 parser plugins | PDF, DOCX, XLSX, HTML, Jupyter, Email, Code, PPTX, Structured | 37 |
| 8 connector plugins | GitHub, Notion, GDrive, S3, Confluence, Swagger, WebCrawler, WebSearch | 38 |
See CONTRIBUTING.md for conventions, test patterns, and plugin development guide.
Documentation
| Guide | Description |
|---|---|
| Quick Start | Install and run in 5 minutes |
| Architecture | Package structure, data flow, design decisions |
| Plugin API: Parsers | Create custom document parsers |
| Plugin API: Connectors | Connect external data sources |
| Plugin API: Models | Add custom AI providers |
| TypeScript SDK | Programmatic API client |
| Security Policy | Vulnerability reporting |
| Contributing | Development setup, conventions, plugin guide |
License
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found