SQLAS
Health Warn
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 6 GitHub stars
Code Warn
- network request — Outbound network request in backend/main.py
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Production-grade evaluation framework for Text-to-SQL and SQL AI agents. 50 metrics. Any LLM. pip install sqlas.
SQLAS — SQL Agent Scoring Framework
A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.
Evaluate SQL agents across 50+ metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, guardrails, and cache ROI. Aligned with Spider, BIRD, RAGAS, and MLflow standards.
Author: thepradip
Install
pip install sqlas # core
pip install "sqlas[mlflow]" # + MLflow
pip install "sqlas[ui]" # + Streamlit UI
pip install "sqlas[all]" # everything
What's New in v2.7.0
| Feature | Description |
|---|---|
| Multi-gold SQL | execution_accuracy_best_of(sql, gold_sqls) — evaluate against all valid gold queries, take best score. TestCase.gold_sqls: list[str] |
| Hardness classification | auto_classify_hardness(sql) → easy/medium/hard/extra-hard per BIRD criteria. Auto-set on every evaluate() call |
| Exact match metric | exact_match(generated, gold) — normalized string comparison. Exposed as SQLASScores.exact_match_score |
| Failure classification | classify_failure(sql, scores, details) → named FailureCategory with top_hint() actionable fix |
| Batch crash isolation | One failing test case no longer kills the entire batch |
| LLM retry with backoff | _retry_llm_judge() retries 3× (1s→2s→4s) on all 13 LLM judge call sites |
| Weight normalization | Custom weights auto-normalized to 1.0 instead of silently distorting scores |
| LLM judge cache | enable_judge_cache() — opt-in in-memory cache prevents re-scoring identical pairs in CI |
| Report generation | generate_report(scores_list, format="markdown"|"json"), to_json(), to_markdown_report() |
| Non-deterministic detection | NOW(), RANDOM(), CURRENT_TIMESTAMP trigger nondeterministic_warning in details |
| execute_fn timeout | 30s wall-clock timeout with thread-safe SQLite fallback |
| Safety patterns | UNION ALL SELECT, EXCEPT SELECT, WAITFOR DELAY, file injection, NL prompt injection synonyms |
Quick Start
from sqlas import evaluate
def llm_judge(prompt: str) -> str:
return openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role":"user","content":prompt}]
).choices[0].message.content
scores = evaluate(
question = "How many active users?",
generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
gold_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
db_path = "my.db",
llm_judge = llm_judge,
response = "There are 1,523 active users.",
result_data = {"columns":["COUNT(*)"],"rows":[[1523]],"row_count":1,"execution_time_ms":2.1},
)
print(scores.overall_score) # 0.95
print(scores.correctness_score) # 0.88
print(scores.verdict) # PASS
print(scores.hardness) # "easy"
print(scores.exact_match_score) # 1.0
print(scores.to_markdown_report()) # Markdown for PR comments
Failure Classification (v2.7)
Know exactly why a query failed — not just a score.
from sqlas import classify_failure, FailureCategory
analysis = classify_failure(
sql = "SELECT id FROM users LIMIT 100",
scores = {"execution_accuracy": 1.0, "row_count_match": 0.12},
details = {"row_count_match": {"pred_count": 100, "gold_count": 839}},
)
print(analysis.primary) # FailureCategory.LIMIT_TRUNCATION
print(analysis.summary()) # "FAIL [limit_truncation] (score=1.000)"
print(analysis.top_hint()) # "Remove LIMIT — question asks for full results, not top-N"
print(analysis.evidence) # {"limit_truncation": "LIMIT in SQL, 100 rows vs 839 expected"}
All failure categories:
| Category | Source |
|---|---|
LIMIT_TRUNCATION |
LIMIT silently cut result (100 vs 839 rows) |
WRONG_TABLE |
accounting_transactions used instead of accounting |
WRONG_AGGREGATION |
MAX instead of SUM, AVG instead of SUM |
SCALAR_MISMATCH |
Correlation or count value differs |
ROW_EXPLOSION |
1:N join inflated row count |
SCHEMA_HALLUCINATION |
Invented table/column names (counts, adm_count, n) |
FULL_TABLE_SCAN |
SELECT * with no WHERE/LIMIT |
TRIM_ON_NUMERIC |
TRIM() on REAL column — invalid on Postgres |
UNSAFE_QUERY |
DDL/DML attempted |
CURRENCY_NOT_CLEANED |
Single REPLACE missed commas in $1,234 |
NULL_IN_AGGREGATION |
AVG/SUM without IS NOT NULL |
JOIN_WITHOUT_FK |
Banking joined to users with no foreign key |
FAITHFULNESS_DROP |
Narration not grounded in SQL result |
Multi-gold SQL (v2.7)
When a question has multiple valid SQL formulations, evaluate against all and take the best score:
from sqlas import evaluate, TestCase
# Single evaluate call
scores = evaluate(
question = "Count active users",
generated_sql = "SELECT COUNT(*) FROM users WHERE status = 'active'",
gold_sql = "SELECT COUNT(*) FROM users WHERE active = 1", # primary gold
db_path = "my.db",
llm_judge = llm_judge,
)
# Batch with multiple gold SQLs per question
test_case = TestCase(
question = "Count active users",
gold_sqls = [
"SELECT COUNT(*) FROM users WHERE active = 1",
"SELECT COUNT(*) FROM users WHERE status = 'active'",
"SELECT COUNT(id) FROM users WHERE is_active = true",
],
)
Hardness Classification (v2.7)
from sqlas import auto_classify_hardness
auto_classify_hardness("SELECT COUNT(*) FROM users")
# → "easy"
auto_classify_hardness("SELECT u.id, SUM(o.total) FROM users u JOIN orders o ON u.id=o.user_id GROUP BY u.id HAVING SUM(o.total) > 1000")
# → "hard"
auto_classify_hardness("WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) FROM ...) SELECT ...")
# → "extra-hard"
Follows BIRD benchmark criteria. Auto-set on every evaluate() call as SQLASScores.hardness.
Report Generation (v2.7)
from sqlas import generate_report
# Batch markdown report — paste into PRs or CI comments
results = evaluate_batch(test_cases, llm_judge, db_path="my.db")
print(generate_report(results, questions, format="markdown"))
# JSON for artifact storage
print(generate_report(results, format="json"))
# Per-query reports
print(scores.to_json())
print(scores.to_markdown_report(question="How many users?", sql=generated_sql))
LLM Judge Cache (v2.7)
Prevent re-scoring identical prompts in CI runs:
from sqlas import enable_judge_cache, clear_judge_cache
enable_judge_cache() # opt-in — identical prompts return cached result
results = evaluate_batch(...)
clear_judge_cache() # clear between test runs
Three-Dimension Scoring
PASS only when all three dimensions meet their thresholds:
from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety
c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])
print(c.score, c.verdict) # 0.85 PASS (threshold 0.5)
print(q.score, q.verdict) # 0.72 PASS (threshold 0.6)
print(s.score, s.verdict) # 0.45 FAIL (threshold 0.9 — PII detected)
evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.
Guardrail Pipeline
from sqlas import GuardrailPipeline
pipeline = GuardrailPipeline(pii_columns=["email","ssn","password"])
r = pipeline.check_input("List every user's SSN") # blocks malicious NL intent
r = pipeline.check_sql(generated_sql) # blocks injection/PII SQL
r = pipeline.check_output(response, result_data) # blocks PII in response
Injection patterns detected: UNION ALL SELECT, EXCEPT SELECT, INTERSECT SELECT, stacked mutations, tautologies, time-based injection, file write/read, WAITFOR DELAY.
NL prompt injection detected: ignore/override/discard instructions, jailbreak, bypass guardrails, pretend unrestricted.
Spider / BIRD Benchmark
from sqlas.benchmarks import run_spider_benchmark
results = run_spider_benchmark(
agent_fn = my_agent,
llm_judge = llm_judge,
spider_dir = "./spider",
n_samples = 50, # stratified by difficulty → ~$0.25
mlflow_run = True,
)
print(results["summary"]["overall_score"])
Prompt Versioning
from sqlas import PromptRegistry
registry = PromptRegistry()
registry.register("You are a SQL analyst...", version_id="v1")
registry.record("v1", scores)
status = registry.detect_regression("v1", window=50, threshold=0.05)
if status["regressed"]:
for hint in status["hints"]:
print(hint["hint"]) # actionable prompt fix suggestion
Observability Integrations
from sqlas.integrations import log_all
log_all(results,
mlflow_experiment = "sql-agent-v2",
wandb_project = "sql-evals",
langsmith_project = "my-sql-agent",
)
Run a Test Suite
from sqlas import run_suite, TestCase, WEIGHTS_V4, build_schema_info
tables, columns = build_schema_info(db_path="my.db")
results = run_suite(
test_cases = test_cases,
agent_fn = my_agent,
llm_judge = llm_judge,
execute_fn = execute_fn,
valid_tables = tables,
valid_columns = columns,
weights = WEIGHTS_V4,
pass_threshold = 0.6,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])
Metrics Overview
| Dimension | Key Metrics |
|---|---|
| Correctness | Execution accuracy, exact match, multi-gold SQL, semantic equivalence, result set similarity |
| SQL Quality | SQL quality (LLM), schema compliance, complexity match, data scan efficiency |
| Context (RAGAS) | Context precision, recall, entity recall, noise robustness |
| Response | Faithfulness, answer relevance, completeness, fluency |
| Agentic | Steps efficiency, schema grounding, planning quality, tool use accuracy, plan compliance, first attempt success |
| Safety | Read-only compliance, SQL injection, prompt injection, PII access, PII leakage |
| Production | Execution success, VES efficiency, row explosion detection, empty result, result coverage |
| Cache | Cache hit score, tokens saved, few-shot examples used |
| Visualization | Chart spec validity, data alignment, LLM chart validation |
Weight Profiles
| Profile | Metrics | Best for |
|---|---|---|
WEIGHTS |
15 | Standard NL→SQL pipeline |
WEIGHTS_V2 |
20 | + RAGAS context quality |
WEIGHTS_V3 |
30 | + Guardrails + visualization |
WEIGHTS_V4 |
28 | + Agentic quality ← ReAct agents |
Changelog
v2.7.0
classify_failure()+FailureCategoryenum — named failure classification with actionable hintsauto_classify_hardness()— BIRD-aligned easy/medium/hard/extra-hard (auto-set on every eval)exact_match()+SQLASScores.exact_match_scoreexecution_accuracy_best_of()+TestCase.gold_sqls— multi-gold SQL evaluationgenerate_report()— batch markdown/JSON report;to_json(),to_markdown_report()on SQLASScoresenable_judge_cache()/clear_judge_cache()— opt-in LLM judge caching- LLM retry with exponential backoff (3×) on all 13 LLM judge call sites
- Batch eval crash isolation — one failure no longer kills the batch
- Weight normalization — auto-normalize to 1.0 instead of silently distorting
- execute_fn timeout (30s) with thread-safe SQLite fallback
- Non-deterministic query detection (NOW, RANDOM, CURRENT_TIMESTAMP)
- Safety: UNION ALL SELECT, EXCEPT, WAITFOR DELAY, file injection, NL synonyms
- Division-by-zero guards in all context metrics
v2.6.0
- Spider/BIRD benchmark (
run_spider_benchmark,run_bird_benchmark) - MLflow, W&B, LangSmith integrations (
sqlas.integrations) - Streamlit UI (
python -m sqlas ui) - React evaluation dashboard (
sqlas-ui/)
v2.5.0
plan_compliance()— measures create_plan enforcement before execute_sqlfirst_attempt_success()— measures SQL retry rate
v2.4.0
PromptRegistry— prompt versioning, regression detection, improvement hintsschema_retrieval_quality()— precision/recall/F1 for schema index
v2.3.0
GuardrailPipeline— 3-stage safety: input → SQL → output (zero LLM cost)FeedbackStore— verified gold SQL from user thumbs-up
v2.2.0
- Three-dimension scoring:
correctness_score,quality_score,safety_composite_score verdict— AND logic: PASS only when all three pass thresholds
License
MIT — thepradip · pypi.org/project/sqlas
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found