nlp2sql
Health Warn
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 8 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
This tool converts natural language queries into optimized SQL using AI providers like OpenAI, Anthropic, and Google Gemini. It is designed for production environments and large-scale databases with over 1000 tables.
Security Assessment
Overall Risk: Medium. The tool connects to external AI provider APIs, meaning your natural language queries and potentially sensitive database schemas are sent over the network to third parties. It also directly connects to your databases (like PostgreSQL or Redshift). While the light code scan found no dangerous shell execution patterns, hardcoded secrets, or malicious code, inherent risks remain in passing database structures to external AI models. Ensure strict environment variable management for your API and database credentials.
Quality Assessment
The project is under active development, with its most recent push occurring just today. It benefits from a clear README, solid architectural documentation, and a standard permissive MIT license. The codebase follows modern practices, utilizing asynchronous Python and clean architecture patterns. However, community trust and visibility are currently very low. With only 8 GitHub stars, the tool has not yet been widely peer-reviewed or battle-tested by a large user base.
Verdict
Use with caution. The code is clean and safe to run, but passing live database schemas to external AI providers requires careful consideration of your organization's data privacy policies.
Enterprise-ready Natural Language to SQL converter with multi-provider AI support (OpenAI, Anthropic, Gemini). Built for production scale databases (1000+ tables) with Clean Architecture.
nlp2sql
Enterprise-ready Natural Language to SQL converter with multi-provider support
Convert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.
Features
- Multiple AI Providers: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in
- Database Support: PostgreSQL, Amazon Redshift
- Large Schema Handling: Vector embeddings and intelligent filtering for 1000+ tables
- Smart Caching: Query and schema embedding caching for improved performance
- Async Support: Full async/await support
- Clean Architecture: Ports & Adapters pattern for maintainability
Documentation
| Document | Description |
|---|---|
| Architecture | Component diagram and data flow |
| API Reference | Python API and CLI command reference |
| Configuration | Environment variables and schema filters |
| Enterprise Guide | Large-scale deployment and migration |
| Redshift Support | Amazon Redshift setup and examples |
| Contributing | Contribution guidelines |
Installation
# With UV (recommended)
uv add nlp2sql
# With pip
pip install nlp2sql
# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]
# With embeddings
pip install nlp2sql[embeddings-local] # Local embeddings (free)
pip install nlp2sql[embeddings-openai] # OpenAI embeddings
Quick Start
1. Set an API Key
export OPENAI_API_KEY="your-openai-key"
# or ANTHROPIC_API_KEY, GOOGLE_API_KEY
2. Connect and Ask
import asyncio
import nlp2sql
from nlp2sql import ProviderConfig
async def main():
nlp = await nlp2sql.connect(
"postgresql://user:pass@localhost:5432/mydb",
provider=ProviderConfig(provider="openai", api_key="sk-..."),
)
result = await nlp.ask("Show me all active users")
print(result.sql)
print(result.confidence)
print(result.is_valid)
asyncio.run(main())
connect() auto-detects the database type from the URL, loads the schema, and builds the FAISS embedding index. Subsequent ask() calls reuse everything from disk cache.
3. Few-Shot Examples
Pass a list of dicts -- connect() handles embedding and indexing automatically:
nlp = await nlp2sql.connect(
"redshift://user:pass@host:5439/db",
provider=ProviderConfig(provider="openai", api_key="sk-..."),
schema="dwh_data_share_llm",
examples=[
{
"question": "Total revenue last month?",
"sql": "SELECT SUM(revenue) FROM sales WHERE date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')",
"database_type": "redshift",
},
],
)
result = await nlp.ask("Show me total sales this quarter")
4. Schema Filtering (Large Databases)
nlp = await nlp2sql.connect(
"postgresql://localhost/enterprise",
provider=ProviderConfig(provider="anthropic", api_key="sk-ant-..."),
schema_filters={
"include_schemas": ["sales", "finance"],
"exclude_system_tables": True,
},
)
5. Custom Model and Temperature
nlp = await nlp2sql.connect(
"postgresql://localhost/mydb",
provider=ProviderConfig(
provider="openai",
api_key="sk-...",
model="gpt-4o",
temperature=0.0,
max_tokens=4000,
),
)
6. CLI
nlp2sql query \
--database-url postgresql://user:pass@localhost:5432/mydb \
--question "Show all active users" \
--explain
nlp2sql inspect --database-url postgresql://localhost/mydb
Advanced: Direct Service Access
For full control over the lifecycle, the lower-level API is still available:
from nlp2sql import create_and_initialize_service, ProviderConfig, DatabaseType
service = await create_and_initialize_service(
database_url="postgresql://localhost/mydb",
provider_config=ProviderConfig(provider="openai", api_key="sk-..."),
database_type=DatabaseType.POSTGRES,
)
result = await service.generate_sql("Count total users", database_type=DatabaseType.POSTGRES)
How It Works
Question ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation
│ │ │
SchemaRepository FAISS + TF-IDF hybrid Reuses precomputed
(+ disk cache) + batch scoring relevance scores
- Schema retrieval -- Fetches tables from database via
SchemaRepository(with disk cache for Redshift) - Relevance filtering -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings
- Context building -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)
- SQL generation -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context
- Validation -- SQL syntax and safety checks before returning results
See Architecture for the detailed flow with method references and design decisions.
Provider Comparison
| Provider | Default Model | Context Size | Best For |
|---|---|---|---|
| OpenAI | gpt-4o-mini | 128K | Cost-effective, fast |
| Anthropic | claude-sonnet-4-20250514 | 200K | Large schemas |
| Google Gemini | gemini-2.0-flash | 1M | High volume |
All models are configurable via ProviderConfig(model="..."). See Configuration for details.
Architecture
Clean Architecture (Ports & Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.
nlp2sql/
├── client.py # DSL: connect() + NLP2SQL class (recommended entry point)
├── core/ # Pure Python: entities, ProviderConfig, QueryResult, sql_safety, sql_keywords
├── ports/ # Interfaces: AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort,
│ # ExampleRepositoryPort, QuerySafetyPort, QueryValidatorPort, CachePort
├── adapters/ # Implementations: OpenAI, Anthropic, Gemini, PostgreSQL, Redshift,
│ # RegexQueryValidator
├── services/ # Orchestration: QueryGenerationService
├── schema/ # Schema management: SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager,
│ # ExampleStore
├── config/ # Pydantic Settings (centralized defaults)
└── exceptions/ # Exception hierarchy (NLP2SQLException -> 8 subclasses)
See Architecture for the full component diagram, data flow, and design decisions.
Development
# Clone and install
git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync
# Start test databases
cd docker && docker-compose up -d
# Run tests
uv run pytest
# Code quality
uv run ruff format .
uv run ruff check .
uv run mypy src/
MCP Server
nlp2sql includes a Model Context Protocol server for AI assistant integration.
{
"mcpServers": {
"nlp2sql": {
"command": "python",
"args": ["/path/to/nlp2sql/mcp_server/server.py"],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"NLP2SQL_DEFAULT_DB_URL": "postgresql://user:pass@localhost:5432/mydb"
}
}
}
}
Tools: ask_database, explore_schema, run_sql, list_databases, explain_sql
See mcp_server/README.md for complete setup.
Contributing
We welcome contributions. See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE.
Author
Luis Carbonel - @luiscarbonel1991
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found