10x-bench

skill
Security Audit
Warn
Health Warn
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Pass
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This project is an Astro-based benchmarking dashboard designed to compare how different Large Language Models (LLMs) perform on a "vibe coding" task. It evaluates one-shot website generation attempts by processing CSV evaluation files and displaying the comparative results interactively.

Security Assessment
The repository poses a minimal security risk. It is fundamentally a static website generator and data processing script. The automated code scan of 12 files found no dangerous patterns, and there are no hardcoded secrets present. Because it is an evaluation dashboard rather than a runtime server or an extension, it does not request dangerous permissions, execute hidden shell commands, or actively access sensitive user data. Overall risk: Low.

Quality Assessment
The project is brand new, with its most recent push occurring today. While this indicates active development, the tool currently suffers from extremely low community visibility and adoption, having only accumulated 5 GitHub stars. On a positive note, the codebase is properly standardized and safe for commercial use under the MIT license. Developers should view this as an early-stage or experimental project rather than a battle-tested community standard.

Verdict
Safe to use.
SUMMARY

This repository serves as a benchmark for various LLMs vibe coding Przeprogramowani.pl website.

README.md

10x Benchmark

A comprehensive benchmark comparing how different large language models tackle "vibe coding" — creating a fully functional website for Przeprogramowani.pl in a single attempt, without iterative refinement.

Overview

This repository evaluates the practical capabilities of various state-of-the-art LLMs by having each model create a website implementation based on the same prompt and content specifications. The results provide insights into each model's ability to understand requirements, generate code, and produce production-ready web solutions.

Key Concept: "Vibe Coding"

Vibe coding represents a one-shot approach to web development where an LLM must:

  • Understand the complete project requirements from a single prompt
  • Extract and properly format content specifications
  • Generate a functional, well-structured website
  • Produce clean, maintainable code without iterative debugging

Project Structure

Core Directories & Files

Path Purpose
./website/ Astro-based results dashboard (displays all benchmark results)
./scripts/process-results.ts TypeScript script that processes CSV results and generates dashboard data
./eval-attempts/ Model implementations (one-shot attempts)
./eval-results/ Processed evaluation result files

Model Attempt Directories

Each model's implementation is stored in a dedicated directory under ./eval-attempts/.

Each eval-results/{model-name}-attempt-{number} directory contains eval-results.csv with criterion-by-criterion evaluation scores. Multiple attempt directories per model indicate iterative benchmark runs.

How It Works

  1. Prompt: Each model receives the same input prompt (see 10x-bench-eval)
  2. Content: Reference content and specifications are maintained in 10x-bench-eval
  3. Implementation: Models generate website code in their respective attempt directories under ./eval-attempts/
  4. Evaluation: All implementations are assessed using the criteria and tooling from 10x-bench-eval
  5. Results Processing: The scripts/process-results.ts script parses evaluation CSV files and generates data for the dashboard
  6. Results Dashboard: An Astro-based static website (in ./website/) displays comparative results with interactive tables and summaries

Evaluation Criteria

The benchmark evaluates implementations across multiple dimensions:

  • Technical Stack: Framework choices, code organization, and architecture
  • Page Structure: Proper implementation of all required pages and routes
  • Content Accuracy: Correct use of provided copy and content
  • SEO & Metadata: Proper handling of titles, descriptions, and semantic HTML
  • Responsive Design: Mobile-friendliness and responsive layout implementation
  • Code Quality: Readability, maintainability, and best practices
  • Functionality: Working features and user interactions

For detailed criteria, see ./benchmark/criteria.md

Results Dashboard

Benchmark results are displayed in an interactive Astro-based static website:

  • ./website/ — Results dashboard with:
    • Overview page showing all attempts sorted by performance
    • Interactive results table with sticky headers and frozen first column
    • Model family averages
    • Benchmark details page displaying the prompt and evaluation criteria
    • Data automatically processed from CSV evaluation files via scripts/process-results.ts

Getting Started

View Results Dashboard

# Install dependencies
npm install

# Build and start development server (processes results and runs Astro)
npm run dev

# Open http://localhost:3000 in your browser

Explore Benchmark Materials

Benchmark prompt, evaluation criteria and reference content are maintained in the companion repo: 10x-bench-eval.

To explore model implementations: ls -la ./eval-attempts/

Build for Production

npm run build

This processes all evaluation results and generates a static production-ready site in ./website/dist/

Data Processing Pipeline

The benchmark uses an automated data pipeline to convert raw evaluation results into the interactive dashboard:

  1. Input: Each attempt directory contains eval-result.csv with criterion scores
  2. Processing: scripts/process-results.ts parses CSV files and calculates:
    • Total score for each attempt (excluding "Task completion time")
    • Percentage score relative to maximum possible score
    • Model family averages across all attempts
  3. Output: Generates website/src/data/results.json
  4. Display: Astro website statically renders the dashboard using the JSON data

The script supports two CSV formats:

  • New format: Criterion,Score,Max,Notes
  • Legacy format: Criterion,Score,Notes (assumes Max=1)

Purpose

This benchmark serves as a practical evaluation tool for:

  • Understanding LLM capabilities in web development
  • Assessing code generation quality across different models
  • Identifying strengths and weaknesses in one-shot implementation scenarios
  • Informing technology choices for AI-assisted development workflows

Related Repositories

Repository Purpose
10x-bench (this repo) Model implementations, results dashboard, data processing, and the /run-eval skill
10x-bench-eval Evaluation criteria, scoring methodology, benchmark prompt, reference content

Note: Each attempt represents a completely independent, one-shot effort with no iterative refinement or human intervention during implementation.

Reviews (0)

No results found