ai-gene-review
Health Pass
- License — License: BSD-3-Clause
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 12 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Demonstration of AI review of existing functional annotations
AI Gene Review
AI-assisted tool for reviewing and curating gene annotations with community feedback integration. This project provides a structured workflow for validating existing Gene Ontology (GO) annotations using AI-driven analysis combined with literature research, bioinformatics evidence, and crowdsourced expert feedback.
Overview
The AI Gene Review tool helps researchers and curators:
- Review existing GO annotations using strict, defined criteria
- Synthesize high-quality annotations from multiple evidence sources
- Fetch and organize gene data from UniProt and GOA databases
- Validate annotation files against LinkML schemas
- Manage references and supporting literature
- Collect community feedback through integrated voting and evaluation systems
Quick Start
Installation
- Install uv for dependency management
- Clone the repository and install dependencies:
git clone https://github.com/cmungall/ai-gene-review.git cd ai-gene-review uv sync --group dev
Basic Usage
Fetch gene data:
uv run ai-gene-review fetch-gene human TP53
Validate a gene review file:
uv run ai-gene-review validate genes/human/TP53/TP53-ai-review.yaml
Fetch publications for a gene:
uv run ai-gene-review fetch-gene-pmids human TP53
Generate statistics report:
just stats # Generate HTML report
just stats-open # Generate and open in browser
Workflow Overview
- Fetch Gene Data: Download UniProt records and GO annotations
- Literature Research: Gather supporting publications and evidence
- Create Review: Structure annotations using the YAML schema
- Validate: Check against LinkML schema and best practices
- Generate HTML: Render interactive web pages with voting capabilities
- Collect Feedback: Community voting and expert evaluation forms
- Iterate: Refine annotations based on validation results and community input
Key Features
- 🧬 Multi-organism support: Human, mouse, worm, and other model organisms
- 📚 Literature integration: Automatic PubMed citation fetching and caching
- ✅ Schema validation: LinkML-based validation for consistency
- 🛡️ Anti-hallucination validation: ID/label tuple checksums prevent AI fabrication of terms
- 🔄 Batch processing: Handle multiple genes efficiently
- 📊 Structured reviews: YAML-based gene annotation reviews
- 🔍 Evidence tracking: Detailed provenance and supporting text
- 🗳️ Community voting: Thumbs up/down feedback on AI decisions
- 📝 Expert evaluation: Detailed feedback forms for comprehensive gene review assessment
- 🎨 Interactive web interface: Rich HTML rendering with modern UI
Resources
Web Applications & Documentation
- Main Project Site: https://ai4curation.io/ai-gene-review (coming soon)
- Interactive Web App: Browse Gene Reviews - Search and explore gene annotation reviews
- Statistics Dashboard: Summary Statistics - Analytics and review metrics
- Evaluation Form: https://go.lbl.gov/gene-eval - Detailed expert feedback form
- Project Slides: Overview Presentation
Documentation Pages
- Voting System Guide: Learn how to provide feedback on AI curation decisions
- Evaluation Form Guide: Comprehensive guide for detailed gene review evaluation
- GitHub Issues: Report bugs and feature requests
Gene Review Structure
Each gene review follows a structured YAML format containing:
- Gene metadata: UniProt ID, gene symbol, taxon information
- Description: Comprehensive summary of gene function
- References: Literature and bioinformatics sources
- Existing annotations: Review of current GO annotations with actions (ACCEPT, MODIFY, REMOVE, etc.)
- Core functions: Curated essential gene functions
Example structure:
id: Q9BRQ4
gene_symbol: CFAP300
taxon:
id: NCBITaxon:9606
label: Homo sapiens
description: >-
CFAP300 is a cilium- and flagellum-specific protein...
existing_annotations:
- term:
id: GO:0005515
label: protein binding
action: MODIFY
reason: "While evidence is strong, 'protein binding' is uninformative..."
Example Data
The repository includes example gene reviews for:
- Human: BRCA1, CFAP300, RBFOX3, TP53
- Mouse: Various examples
- Worm: lrx-1
Browse the genes/ directory to see complete examples.
Community Feedback System
The AI Gene Review project includes a comprehensive feedback system to improve AI curation through community input:
🗳️ Quick Voting System
Every gene review page includes thumbs up/down voting on:
- Individual annotation decisions (ACCEPT, REMOVE, MODIFY actions)
- Gene descriptions and summaries
- Core function definitions
- Suggested questions for experts
- Suggested experiments
- Documentation sections (Deep Research, Notes, Bioinformatics Results)
- Proposed new GO terms
- Pathway visualizations
Features:
- Anonymous voting with session-based tracking
- Instant visual feedback with vote persistence
- Rate limiting to prevent abuse
- Vote data collection via Google Apps Script
📝 Detailed Evaluation Form
For comprehensive feedback, use the evaluation form at https://go.lbl.gov/gene-eval:
- Structured assessment of annotation quality
- Pre-filled gene information when accessed from gene pages
- Multiple choice and free-text questions
- Expert-level feedback for improving AI curation
📹 Video Tutorial: Watch a step-by-step guide on performing evaluations (YouTube)
🎯 Feedback Integration
The feedback system enables:
- Data-driven improvements to AI curation algorithms
- Identification of problematic patterns in automated annotations
- Community validation of AI decisions
- Prioritization of genes needing expert attention
How to Provide Feedback
- Quick feedback: Use 👍/👎 buttons on any gene review page
- Detailed feedback: Click "📝 Provide Detailed Feedback" button or visit the evaluation form directly
- Technical feedback: Submit issues and suggestions via GitHub Issues
Case Studies
PedH (Pseudomonas putida KT2440) - Lanthanide-Dependent Alcohol Dehydrogenase
The review of pedH revealed several important curation insights:
Key Discoveries
Lanthanide vs Calcium Dependency: PedH was incorrectly annotated with "calcium ion binding" (GO:0005509) when it actually requires lanthanide ions (La³⁺, Ce³⁺, Pr³⁺, Nd³⁺, Sm³⁺) for activity. This highlights the importance of reviewing automated annotations based on sequence similarity.
Cellular Localization Precision: Bioinformatics analysis confirmed PedH is a soluble periplasmic enzyme, not membrane-associated:
- Signal peptide (aa 1-25) directs export, then is cleaved
- No transmembrane regions in mature protein
- Functions throughout periplasmic space, not just at membrane boundaries
- Led to choosing GO:0042597 (periplasmic space) over GO:0030288 (outer membrane-bounded periplasmic space)
Dual Functional Roles: PedH serves both as:
- Metabolic enzyme: Oxidizes alcohols in 2-phenylethanol degradation pathway
- Regulatory sensor: Part of lanthanide-sensing system controlling gene expression via PedS2/PedR2 two-component system
Missing GO Terms Identified: The review revealed gaps in GO:
- No term for "lanthanide ion binding" (distinct from transition metal binding)
- No term for "lanthanide-dependent alcohol dehydrogenase activity"
Lessons for Curation
- Verify metal cofactors carefully - Don't assume calcium when other metals are possible
- Consider protein mobility - Soluble vs membrane-associated matters for localization terms
- Look for regulatory functions - Enzymes may have sensory/regulatory roles beyond catalysis
- Use bioinformatics to validate - Signal peptide and TM predictions can clarify localization
Anti-Hallucination Validation
The AI Gene Review system implements a robust anti-hallucination validation mechanism using ID/label tuple checksums to prevent AI systems from fabricating or misusing ontological terms.
How It Works
Every ontology term in the system requires both an id (semantic identifier) and label (human-readable name):
term:
id: GO:0005515 # Ontology identifier
label: protein binding # Canonical label
Validation Process
The TermValidator performs multi-layer validation:
- Format Validation: Ensures IDs follow proper CURIE patterns (
PREFIX:NUMBER) - Existence Validation: Verifies terms exist in authoritative ontologies via OAK/OLS APIs
- Label Matching: Cross-references provided labels against canonical ontology labels
- Branch Validation: Ensures GO terms are in correct ontological branches (MF/BP/CC)
- Obsolescence Checking: Flags outdated terms
Why This Prevents Hallucination
✅ Dual Verification: Both ID and label must be correct and consistent
✅ External Truth Source: Validates against authoritative ontologies (GO, HP, MONDO, etc.)
✅ Real-time Checking: Uses live API calls to catch fabricated terms
✅ Semantic Consistency: Ensures terms make sense in their context
Examples
# ❌ This would be caught as invalid
term:
id: GO:0005515
label: "DNA binding" # Wrong label for GO:0005515
# ✅ This passes validation
term:
id: GO:0005515
label: "protein binding" # Correct canonical label
# ❌ This would be flagged as fabricated
term:
id: GO:9999999
label: "made up function" # Non-existent term
Supported Ontologies
The validator supports 10+ major ontologies:
- GO: Gene Ontology (molecular functions, biological processes, cellular components)
- HP: Human Phenotype Ontology
- MONDO: Mondo Disease Ontology
- CL: Cell Ontology
- UBERON: Uberon Anatomy Ontology
- CHEBI: Chemical Entities of Biological Interest
- PR: Protein Ontology
- SO: Sequence Ontology
- PATO: Phenotype And Trait Ontology
- NCBITaxon: NCBI Taxonomy
This validation system represents a novel approach to preventing ontological hallucination in AI curation workflows and could serve as a model for other AI applications working with structured biological knowledge.
Annotation Rule Review
In addition to reviewing individual gene annotations, the AI Gene Review system supports systematic review of automated annotation rules used by UniProt (ARBA and UniRule systems). These rules generate millions of GO annotations across protein databases, making their quality critical for downstream research.
What are ARBA Rules?
ARBA (Association-Rule-Based Annotator) rules use combinations of:
- InterPro domains: Protein family signatures
- PANTHER families: Evolutionary classifications
- CATH FunFams: Functional families based on structure
- Taxonomic restrictions: Organism-specific constraints
When a protein matches all conditions in a rule, it receives the predicted GO annotation.
Why Review Rules?
Automated rules can produce systematic errors affecting thousands of proteins:
- Taxonomic over-annotation: Mammalian functions applied to fungi/plants
- GO term breadth: Using vague parent terms instead of specific functions
- Domain promiscuity: Structural domains with multiple functional contexts
- Catabolic/biosynthetic confusion: Grouping enzymes with opposite activities
Rule Review Workflow
# 1. Fetch rule data
just rules-fetch ARBA00089174
# 2. Run deep research (multiple providers)
just rules-deep-research-perplexity ARBA00089174
just rules-deep-research-falcon ARBA00089174
# 3. Create/update review YAML
# Reviews are stored in rules/arba/RULE_ID/RULE_ID-review.yaml
# 4. Validate the review
just rules-validate ARBA00089174
# 5. Validate all reviews
just rules-validate --all
Rule Review Structure
Each rule review evaluates:
| Assessment | Valid Values | Description |
|---|---|---|
| parsimony | PARSIMONIOUS, ACCEPTABLE, REDUNDANT, OVERLY_COMPLEX | Rule complexity vs necessity |
| literature_support | STRONG, MODERATE, WEAK, NONE, CONTRADICTED | Experimental evidence quality |
| condition_overlap | NONE, MINOR, SIGNIFICANT, COMPLETE | Redundancy between condition sets |
| go_specificity | TOO_BROAD, APPROPRIATE, TOO_NARROW, MISMATCHED | GO term choice appropriateness |
| taxonomic_scope | TOO_BROAD, APPROPRIATE, TOO_NARROW, MISSING, UNNECESSARY | Taxonomic restriction accuracy |
Example Finding: False Positive Detection
When reviewing ARBA00089174 (adaptive thermogenesis), the system identified that S. pombe gene cps1 (a fungal carboxypeptidase) received this annotation despite adaptive thermogenesis being a mammalian-specific physiological process. This revealed the rule's Eukaryota scope was too broad and should be restricted to Mammalia.
Rule Review Files
rules/
arba/
ARBA00089174/
ARBA00089174.enriched.json # Raw rule data from UniProt
ARBA00089174-review.yaml # Comprehensive review
ARBA00089174-deep-research-perplexity.md # Literature research
ARBA00089174-deep-research-falcon.md # Additional research
Actions for Rules
- ACCEPT: Rule produces accurate annotations, no changes needed
- MODIFY: Rule has correct biological basis but needs refinement (taxonomic scope, GO term specificity, condition consolidation)
- REMOVE: Rule produces more false positives than valid annotations
Repository Structure
- genes/ - Gene review data organized by organism
human/,mouse/,worm/- Species-specific gene directories- Each gene folder contains: YAML review, UniProt data, GO annotations, notes
- rules/ - Annotation rule reviews
arba/- ARBA rule reviews organized by rule ID- Each rule folder contains: enriched JSON, review YAML, deep research files
- docs/ - MkDocs-managed documentation
- src/ai_gene_review/ - Core Python package
cli.py- Command-line interfaceschema/- LinkML schema definitionsetl/- Data extraction and loading modules
- tests/ - Python tests and example data
- publications/ - Cached PubMed articles
Developer Tools
This project uses just command runner for development tasks.
Available commands:
just --list # Show all available commands
just test # Run tests, type checking, and linting
just format # Run code formatting checks
just install # Install project dependencies
CLI Commands:
uv run ai-gene-review --help # Show CLI help
uv run ai-gene-review fetch-gene human BRCA1 # Fetch gene data
uv run ai-gene-review validate <yaml-file> # Validate review file
uv run ai-gene-review batch-fetch <input-file> # Process multiple genes
HTML Rendering:
just render human BRCA1 # Render single gene to HTML
just render-all # Render all gene reviews to HTML
python -m ai_gene_review.render --all genes/ # Alternative rendering command
Contributing
See CONTRIBUTING.md for detailed contribution guidelines including:
- Code of conduct and best practices
- Understanding LinkML schemas
- Pull request workflow
- Development setup
Credits
This project uses the template monarch-project-copier
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found