agent-vision
Health Uyari
- License Γ’β¬β License: MIT
- Description Γ’β¬β Repository has a description
- Active repo Γ’β¬β Last push 0 days ago
- Low visibility Γ’β¬β Only 5 GitHub stars
Code Gecti
- Code scan Γ’β¬β Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions Γ’β¬β No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Eyes for AI coding agents ποΈ β render β perceive β report β fix β re-render. A machine-graded visual feedback loop (DOM/contrast/OCR-grounded + optional vision LLM) agents consume to self-correct before claiming done.
AgentVision β Eyes for AI Agents ποΈ
Problem: AI coding agents are blind β they write a UI, chart, SVG or PDF and never see the result, shipping breakage they can't perceive.
Result: AgentVision gives them eyes β render β see β report β fix β catching overflow, low contrast, broken images and typos.
So your agent self-corrects before it claims done.
AgentVision is a provider-agnostic framework that closes the visual feedback loop for AI
coding agents:
render β perceive β report β (agent fixes) β re-render β diff
It is not human-reviewed visual regression (Percy/Applitools/Argos) and not browser
automation (browser-use/Playwright). It is a machine-graded visual critique loop an agent
consumes to self-correct before claiming done β with a verdict (pass/warn/fail) and
actionable, coordinate-grounded issues.
The 60-second pitch
pip install "agentvision[render]"
playwright install chromium # see `agentvision doctor` if Chromium won't launch
agentvision demo # no API key required
agentvision demo renders a deliberately broken page, prints a FAIL report (overflow +
low-contrast + a 404 image β all DOM/CV-grounded, no LLM key needed), then loops against the
fixed version and prints "what changed: 3 issues resolved β PASS." That command is the
product.
What makes it trustworthy
Findings are grounded in sources we can actually trust:
- DOM geometry (
getBoundingClientRect+ scroll offset) β precise element boxes. - Computed-style contrast (
getComputedStyle) β real WCAG ratios, with aconfidence
flag (it degrades honestly over gradients/images/pseudo-elements rather than lying). - OCR word boxes (Tesseract) β precise text locations.
- Console / network / 4xx capture β the #1 "looks fine in code, broken live" cause.
A vision LLM (Claude/OpenAI/Gemini) adds semantic critique on top. Its pixel boxes are
treated as advisory (bbox_precise: false), never marketed as pixel-accurate.
Full-coverage vision. On a large artifact the model gets a downscaled overview plus
full-resolution tiles covering it, so fine detail and small text aren't lost to downscaling.
It's pixel-based and source-agnostic β the same coverage applies to HTML, a flat image, or a
PDF page, not just elements the DOM enumerates.
Match the intent, not just avoid defects
A typo-free, well-laid-out artifact can still be the wrong thing β an infographic that
shows the wrong stages, a page missing the panel you asked for, a generated image that
ignored half the prompt. Give AgentVision the intent and it grades the render against it,
so PASS means "matches what I set out to build," not merely "defect-free":
# Does the render match the thought? (text claims grade deterministically via OCR)
agentvision conform ./infographic.png \
--brief "launch infographic for AgentVision" \
--expect 'must: title reads "AgentVision"' \
--expect 'should: shows 4 stages left to right'
For AI-generated artifacts the fix is a better prompt, not code β so the generative loop
generate β see β grade vs intent β refine prompt β regenerate runs until it matches. The
image generator is a hook you supply; AgentVision never bundles an image-gen dependency:
agentvision generate --generator mypkg.gen:make_image \
--brief "minimalist infographic, dark background, no typos" --max-iter 4 -o final.png
See docs/conformance.md. Express intent three ways β a free-text
brief (eyes extract the checklist), an explicit checklist (--expect, deterministic),
or a reference image (--reference). Claims are must: / should: / nice:.
Eyes β brain: the handoff
In anatomy the eyes are only the afferent half β the retina perceives, the optic nerve
carries the signal to the brain, the brain decides, the hand acts, the eyes look again.
AgentVision is that afferent pathway for an agent: it perceives and hands a clean signal back
to the brain (whatever does your reasoning/planning/memory) β it deliberately doesn't
decide for you. Any perception call distills to a Handoff:
agentvision analyze ./page.html --handoff
{ "perceived": "fail", "next_action": "revise", "matches_intent": false,
"todo": ["[overflow] hero text overflows on the right",
"[intent/must] a \"Checkout\" button is visible"],
"open_questions": ["Verify: uses the brand's dark theme"] }
next_action (done / revise / review) drives the brain's loop; todo is the work-list;open_questions is what perception couldn't confirm (never dropped). Available asreport.to_handoff(), the MCP perceive_handoff tool, POST /handoff, and a handoff.json
per loop iteration β provider- and brain-agnostic. See docs/handoff.md.
Eyes & Brain β AgentVision Γ Verel
AgentVision is the eyes. It pairs with Verel,
the brain β an agent framework where nothing is "done" until a grader returns a verdict.
The eyes perceive and grade intent; the brain decides with attestation and compounds only
verified work into memory; then the eyes look again.
They ship and version independently (pip install agentvision, pip install verel) yet work
in sync: AgentVision plugs into Verel as its verel.senses perception organ β mapped onto a
unified verdict bus (vision alongside tests, lint and types), with intent conformance
recorded in the brain's memory each iteration. AgentVision stays brain-agnostic; Verel is the
reference brain. See docs/handoff.md.
Many faces, one core
| Surface | Who it's for |
|---|---|
Library (import agentvision) |
Python apps, custom harnesses |
CLI (agentvision β¦) |
Any agent that can run a shell command; CI |
| Claude Code Skill | Claude agents β auto-invokes the loop before claiming done |
MCP server (agentvision-mcp) |
Cursor, Claude, any MCP-capable host |
REST service (agentvision-serve) |
Non-MCP / networked / CI agents |
| Integration recipes | Cursor rules, Aider, generic "agent contract" |
β οΈ "Provider-agnostic" describes the API surface, not behavior. The framework can't
force a non-Claude agent into the loop β it gives every agent the means. The Claude
Code Skill is the one surface that makes an agent use it proactively; MCP is the
first-class cross-host path; the recipes cover the rest.
Vision backends
Pluggable and selectable via --backend / AGENTVISION_VISION_BACKEND:
anthropic(default modelclaude-haiku-4-5, upgradable to Sonnet/Opus)openai,geminilocalβ CV/OCR heuristics only, no API key, no egress (great for CI / air-gapped)
Install
pip install "agentvision[all]" # everything
pip install "agentvision[render]" # just rendering + the no-key local loop
pip install "agentvision[render,anthropic]" # + Claude analysis
System dependencies (Chromium, Tesseract, poppler) and a doctor that checks them:
agentvision doctor # attempts a real Chromium launch; lists every missing lib
agentvision doctor --fix # installs the Chromium browser binary
On a bare RHEL/CentOS box, playwright install-deps does not work (apt-only). See
docs/quickstart.md for the dnf line, or use the bundled
Dockerfile which bakes the deps in.
Usage
# Analyze a file/URL/HTML string and print a structured report
agentvision analyze ./index.html --backend local --json
# Run the self-correcting loop
agentvision loop ./dashboard.html --max-iter 3
# Responsive contact sheet across breakpoints
agentvision sheet ./index.html --breakpoints 375,768,1280,1920
# Visual regression against a named baseline
agentvision baseline ./index.html --name home
agentvision regress ./index.html --name home
Live pages, SPAs & dashboards (polling, websockets, canvas/WebGL):
# localhost dev server, wait for the data to render, freeze animation, machine output
agentvision analyze http://localhost:5173 --allow-local \
--wait-for "#dashboard" --settle-ms 800 --quiet
Streaming / video / over-time behavior β watch, don't just glance:
# Is the video actually playing? Did loading finish? Are captions on?
agentvision watch https://app.example.com/player --frames 6 --interval-ms 500 \
--expect 'must: the video is playing'
watch reads deterministic <video> state (currentTime/readyState/captions) + pixel
liveness/stall/black-frame detection, then adds a time-aware vision pass. See
docs/use-cases/streaming.md.
--nav-wait defaults to load (polling pages never go idle); --freeze (default on) pauses
animations + requestAnimationFrame so canvas/WebGL pages capture without hanging; --quiet
prints only JSON (logs to stderr, exit codes 0 pass/warn Β· 2 fail Β· 3 error).
Library:
import asyncio
from agentvision import load_settings
from agentvision.core.loop import LoopSession
async def main():
settings = load_settings(vision_backend="local")
session = LoopSession("examples/broken_layout.html", settings=settings)
result = await session.iterate()
print(result.report.verdict, [i.message for i in result.report.issues])
asyncio.run(main())
Drop it into your workflow & your agents
# CI gate (GitHub Action): fails the build on a visual FAIL verdict
- uses: amitpatole/[email protected]
with: { source: dist/index.html, command: check, args: --full-page }
- CI / pre-commit / Makefile β shell out; exit codes
0 pass/warn Β· 2 fail Β· 3 error,--quietfor JSON-only output. Reusable GitHub Action + pre-commit hook included. - Your agents β drop
integrations/agent-contract.md
into the system prompt, use the Claude Code Skill, or the MCP tools (Cursor/Claude/any host).
Full guide: docs/integrations.md.
Documentation
π Full docs site: amitpatole.github.io/agent-vision
- Quickstart Β· The Loop Β·
Conformance Β· Handoff (eyesβbrain) Β·
Streaming / temporal Β· Backends Β·
Adapters Β· Integrations Β· Vision
What we do not claim (honesty)
- Pixel-accurate vision-model bounding boxes (they're advisory).
- WCAG verdicts on rasterized non-HTML (heuristic only).
- Bit-reproducible screenshots / deterministic LLM reports.
- Uniform provider-agnostic behavior (only the API surface is uniform).
License
MIT Β© Amit Patole
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi