SQLAS

agent
Guvenlik Denetimi
Uyari
Health Uyari
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 6 GitHub stars
Code Uyari
  • network request — Outbound network request in backend/main.py
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

Production-grade evaluation framework for Text-to-SQL and SQL AI agents. 50 metrics. Any LLM. pip install sqlas.

README.md

SQLAS — SQL Agent Scoring Framework

A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.

PyPI
Python
Tests
License: MIT

Evaluate SQL agents across 50+ metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, guardrails, and cache ROI. Aligned with Spider, BIRD, RAGAS, and MLflow standards.

Author: thepradip


Install

pip install sqlas                # core
pip install "sqlas[mlflow]"      # + MLflow
pip install "sqlas[ui]"          # + Streamlit UI
pip install "sqlas[all]"         # everything

What's New in v2.7.0

Feature Description
Multi-gold SQL execution_accuracy_best_of(sql, gold_sqls) — evaluate against all valid gold queries, take best score. TestCase.gold_sqls: list[str]
Hardness classification auto_classify_hardness(sql)easy/medium/hard/extra-hard per BIRD criteria. Auto-set on every evaluate() call
Exact match metric exact_match(generated, gold) — normalized string comparison. Exposed as SQLASScores.exact_match_score
Failure classification classify_failure(sql, scores, details) → named FailureCategory with top_hint() actionable fix
Batch crash isolation One failing test case no longer kills the entire batch
LLM retry with backoff _retry_llm_judge() retries 3× (1s→2s→4s) on all 13 LLM judge call sites
Weight normalization Custom weights auto-normalized to 1.0 instead of silently distorting scores
LLM judge cache enable_judge_cache() — opt-in in-memory cache prevents re-scoring identical pairs in CI
Report generation generate_report(scores_list, format="markdown"|"json"), to_json(), to_markdown_report()
Non-deterministic detection NOW(), RANDOM(), CURRENT_TIMESTAMP trigger nondeterministic_warning in details
execute_fn timeout 30s wall-clock timeout with thread-safe SQLite fallback
Safety patterns UNION ALL SELECT, EXCEPT SELECT, WAITFOR DELAY, file injection, NL prompt injection synonyms

Quick Start

from sqlas import evaluate

def llm_judge(prompt: str) -> str:
    return openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role":"user","content":prompt}]
    ).choices[0].message.content

scores = evaluate(
    question      = "How many active users?",
    generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
    gold_sql      = "SELECT COUNT(*) FROM users WHERE active = 1",
    db_path       = "my.db",
    llm_judge     = llm_judge,
    response      = "There are 1,523 active users.",
    result_data   = {"columns":["COUNT(*)"],"rows":[[1523]],"row_count":1,"execution_time_ms":2.1},
)

print(scores.overall_score)           # 0.95
print(scores.correctness_score)       # 0.88
print(scores.verdict)                 # PASS
print(scores.hardness)                # "easy"
print(scores.exact_match_score)       # 1.0
print(scores.to_markdown_report())    # Markdown for PR comments

Failure Classification (v2.7)

Know exactly why a query failed — not just a score.

from sqlas import classify_failure, FailureCategory

analysis = classify_failure(
    sql     = "SELECT id FROM users LIMIT 100",
    scores  = {"execution_accuracy": 1.0, "row_count_match": 0.12},
    details = {"row_count_match": {"pred_count": 100, "gold_count": 839}},
)

print(analysis.primary)        # FailureCategory.LIMIT_TRUNCATION
print(analysis.summary())      # "FAIL [limit_truncation] (score=1.000)"
print(analysis.top_hint())     # "Remove LIMIT — question asks for full results, not top-N"
print(analysis.evidence)       # {"limit_truncation": "LIMIT in SQL, 100 rows vs 839 expected"}

All failure categories:

Category Source
LIMIT_TRUNCATION LIMIT silently cut result (100 vs 839 rows)
WRONG_TABLE accounting_transactions used instead of accounting
WRONG_AGGREGATION MAX instead of SUM, AVG instead of SUM
SCALAR_MISMATCH Correlation or count value differs
ROW_EXPLOSION 1:N join inflated row count
SCHEMA_HALLUCINATION Invented table/column names (counts, adm_count, n)
FULL_TABLE_SCAN SELECT * with no WHERE/LIMIT
TRIM_ON_NUMERIC TRIM() on REAL column — invalid on Postgres
UNSAFE_QUERY DDL/DML attempted
CURRENCY_NOT_CLEANED Single REPLACE missed commas in $1,234
NULL_IN_AGGREGATION AVG/SUM without IS NOT NULL
JOIN_WITHOUT_FK Banking joined to users with no foreign key
FAITHFULNESS_DROP Narration not grounded in SQL result

Multi-gold SQL (v2.7)

When a question has multiple valid SQL formulations, evaluate against all and take the best score:

from sqlas import evaluate, TestCase

# Single evaluate call
scores = evaluate(
    question      = "Count active users",
    generated_sql = "SELECT COUNT(*) FROM users WHERE status = 'active'",
    gold_sql      = "SELECT COUNT(*) FROM users WHERE active = 1",   # primary gold
    db_path       = "my.db",
    llm_judge     = llm_judge,
)

# Batch with multiple gold SQLs per question
test_case = TestCase(
    question  = "Count active users",
    gold_sqls = [
        "SELECT COUNT(*) FROM users WHERE active = 1",
        "SELECT COUNT(*) FROM users WHERE status = 'active'",
        "SELECT COUNT(id) FROM users WHERE is_active = true",
    ],
)

Hardness Classification (v2.7)

from sqlas import auto_classify_hardness

auto_classify_hardness("SELECT COUNT(*) FROM users")
# → "easy"

auto_classify_hardness("SELECT u.id, SUM(o.total) FROM users u JOIN orders o ON u.id=o.user_id GROUP BY u.id HAVING SUM(o.total) > 1000")
# → "hard"

auto_classify_hardness("WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) FROM ...) SELECT ...")
# → "extra-hard"

Follows BIRD benchmark criteria. Auto-set on every evaluate() call as SQLASScores.hardness.


Report Generation (v2.7)

from sqlas import generate_report

# Batch markdown report — paste into PRs or CI comments
results = evaluate_batch(test_cases, llm_judge, db_path="my.db")
print(generate_report(results, questions, format="markdown"))

# JSON for artifact storage
print(generate_report(results, format="json"))

# Per-query reports
print(scores.to_json())
print(scores.to_markdown_report(question="How many users?", sql=generated_sql))

LLM Judge Cache (v2.7)

Prevent re-scoring identical prompts in CI runs:

from sqlas import enable_judge_cache, clear_judge_cache

enable_judge_cache()          # opt-in — identical prompts return cached result
results = evaluate_batch(...)
clear_judge_cache()           # clear between test runs

Three-Dimension Scoring

PASS only when all three dimensions meet their thresholds:

from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety

c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])

print(c.score, c.verdict)   # 0.85  PASS   (threshold 0.5)
print(q.score, q.verdict)   # 0.72  PASS   (threshold 0.6)
print(s.score, s.verdict)   # 0.45  FAIL   (threshold 0.9 — PII detected)

evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.


Guardrail Pipeline

from sqlas import GuardrailPipeline

pipeline = GuardrailPipeline(pii_columns=["email","ssn","password"])

r = pipeline.check_input("List every user's SSN")    # blocks malicious NL intent
r = pipeline.check_sql(generated_sql)                # blocks injection/PII SQL
r = pipeline.check_output(response, result_data)     # blocks PII in response

Injection patterns detected: UNION ALL SELECT, EXCEPT SELECT, INTERSECT SELECT, stacked mutations, tautologies, time-based injection, file write/read, WAITFOR DELAY.

NL prompt injection detected: ignore/override/discard instructions, jailbreak, bypass guardrails, pretend unrestricted.


Spider / BIRD Benchmark

from sqlas.benchmarks import run_spider_benchmark

results = run_spider_benchmark(
    agent_fn   = my_agent,
    llm_judge  = llm_judge,
    spider_dir = "./spider",
    n_samples  = 50,          # stratified by difficulty → ~$0.25
    mlflow_run = True,
)
print(results["summary"]["overall_score"])

Prompt Versioning

from sqlas import PromptRegistry

registry = PromptRegistry()
registry.register("You are a SQL analyst...", version_id="v1")
registry.record("v1", scores)

status = registry.detect_regression("v1", window=50, threshold=0.05)
if status["regressed"]:
    for hint in status["hints"]:
        print(hint["hint"])   # actionable prompt fix suggestion

Observability Integrations

from sqlas.integrations import log_all

log_all(results,
    mlflow_experiment = "sql-agent-v2",
    wandb_project     = "sql-evals",
    langsmith_project = "my-sql-agent",
)

Run a Test Suite

from sqlas import run_suite, TestCase, WEIGHTS_V4, build_schema_info

tables, columns = build_schema_info(db_path="my.db")

results = run_suite(
    test_cases     = test_cases,
    agent_fn       = my_agent,
    llm_judge      = llm_judge,
    execute_fn     = execute_fn,
    valid_tables   = tables,
    valid_columns  = columns,
    weights        = WEIGHTS_V4,
    pass_threshold = 0.6,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])

Metrics Overview

Dimension Key Metrics
Correctness Execution accuracy, exact match, multi-gold SQL, semantic equivalence, result set similarity
SQL Quality SQL quality (LLM), schema compliance, complexity match, data scan efficiency
Context (RAGAS) Context precision, recall, entity recall, noise robustness
Response Faithfulness, answer relevance, completeness, fluency
Agentic Steps efficiency, schema grounding, planning quality, tool use accuracy, plan compliance, first attempt success
Safety Read-only compliance, SQL injection, prompt injection, PII access, PII leakage
Production Execution success, VES efficiency, row explosion detection, empty result, result coverage
Cache Cache hit score, tokens saved, few-shot examples used
Visualization Chart spec validity, data alignment, LLM chart validation

Weight Profiles

Profile Metrics Best for
WEIGHTS 15 Standard NL→SQL pipeline
WEIGHTS_V2 20 + RAGAS context quality
WEIGHTS_V3 30 + Guardrails + visualization
WEIGHTS_V4 28 + Agentic quality ← ReAct agents

Changelog

v2.7.0

  • classify_failure() + FailureCategory enum — named failure classification with actionable hints
  • auto_classify_hardness() — BIRD-aligned easy/medium/hard/extra-hard (auto-set on every eval)
  • exact_match() + SQLASScores.exact_match_score
  • execution_accuracy_best_of() + TestCase.gold_sqls — multi-gold SQL evaluation
  • generate_report() — batch markdown/JSON report; to_json(), to_markdown_report() on SQLASScores
  • enable_judge_cache() / clear_judge_cache() — opt-in LLM judge caching
  • LLM retry with exponential backoff (3×) on all 13 LLM judge call sites
  • Batch eval crash isolation — one failure no longer kills the batch
  • Weight normalization — auto-normalize to 1.0 instead of silently distorting
  • execute_fn timeout (30s) with thread-safe SQLite fallback
  • Non-deterministic query detection (NOW, RANDOM, CURRENT_TIMESTAMP)
  • Safety: UNION ALL SELECT, EXCEPT, WAITFOR DELAY, file injection, NL synonyms
  • Division-by-zero guards in all context metrics

v2.6.0

  • Spider/BIRD benchmark (run_spider_benchmark, run_bird_benchmark)
  • MLflow, W&B, LangSmith integrations (sqlas.integrations)
  • Streamlit UI (python -m sqlas ui)
  • React evaluation dashboard (sqlas-ui/)

v2.5.0

  • plan_compliance() — measures create_plan enforcement before execute_sql
  • first_attempt_success() — measures SQL retry rate

v2.4.0

  • PromptRegistry — prompt versioning, regression detection, improvement hints
  • schema_retrieval_quality() — precision/recall/F1 for schema index

v2.3.0

  • GuardrailPipeline — 3-stage safety: input → SQL → output (zero LLM cost)
  • FeedbackStore — verified gold SQL from user thumbs-up

v2.2.0

  • Three-dimension scoring: correctness_score, quality_score, safety_composite_score
  • verdict — AND logic: PASS only when all three pass thresholds

License

MIT — thepradip · pypi.org/project/sqlas

Yorumlar (0)

Sonuc bulunamadi