10x Benchmark

Name: 10x-bench
Author: przeprogramowani

A comprehensive benchmark comparing how different large language models tackle "vibe coding" — creating a fully functional website for Przeprogramowani.pl in a single attempt, without iterative refinement.

Overview

This repository evaluates the practical capabilities of various state-of-the-art LLMs by having each model create a website implementation based on the same prompt and content specifications. The results provide insights into each model's ability to understand requirements, generate code, and produce production-ready web solutions.

Key Concept: "Vibe Coding"

Vibe coding represents a one-shot approach to web development where an LLM must:

Understand the complete project requirements from a single prompt
Extract and properly format content specifications
Generate a functional, well-structured website
Produce clean, maintainable code without iterative debugging

Project Structure

Core Directories & Files

Path	Purpose
`./website/`	Astro-based results dashboard (displays all benchmark results)
`./scripts/process-results.ts`	TypeScript script that processes CSV results and generates dashboard data
`./eval-attempts/`	Model implementations (one-shot attempts)
`./eval-results/`	Processed evaluation result files

Model Attempt Directories

Each model's implementation is stored in a dedicated directory under ./eval-attempts/.

Each eval-results/{model-name}-attempt-{number} directory contains eval-results.csv with criterion-by-criterion evaluation scores. Multiple attempt directories per model indicate iterative benchmark runs.

How It Works

Prompt: Each model receives the same input prompt (see 10x-bench-eval)
Content: Reference content and specifications are maintained in 10x-bench-eval
Implementation: Models generate website code in their respective attempt directories under ./eval-attempts/
Evaluation: All implementations are assessed using the criteria and tooling from 10x-bench-eval
Results Processing: The scripts/process-results.ts script parses evaluation CSV files and generates data for the dashboard
Results Dashboard: An Astro-based static website (in ./website/) displays comparative results with interactive tables and summaries

Evaluation Criteria

The benchmark evaluates implementations across multiple dimensions:

Technical Stack: Framework choices, code organization, and architecture
Page Structure: Proper implementation of all required pages and routes
Content Accuracy: Correct use of provided copy and content
SEO & Metadata: Proper handling of titles, descriptions, and semantic HTML
Responsive Design: Mobile-friendliness and responsive layout implementation
Code Quality: Readability, maintainability, and best practices
Functionality: Working features and user interactions

For detailed criteria, see ./benchmark/criteria.md

Results Dashboard

Benchmark results are displayed in an interactive Astro-based static website:

./website/ — Results dashboard with:
- Overview page showing all attempts sorted by performance
- Interactive results table with sticky headers and frozen first column
- Model family averages
- Benchmark details page displaying the prompt and evaluation criteria
- Data automatically processed from CSV evaluation files via scripts/process-results.ts

Getting Started

View Results Dashboard

# Install dependencies
npm install

# Build and start development server (processes results and runs Astro)
npm run dev

# Open http://localhost:3000 in your browser

Explore Benchmark Materials

Benchmark prompt, evaluation criteria and reference content are maintained in the companion repo: 10x-bench-eval.

To explore model implementations: ls -la ./eval-attempts/

Build for Production

npm run build

This processes all evaluation results and generates a static production-ready site in ./website/dist/

Data Processing Pipeline

The benchmark uses an automated data pipeline to convert raw evaluation results into the interactive dashboard:

Input: Each attempt directory contains eval-result.csv with criterion scores
Processing: scripts/process-results.ts parses CSV files and calculates:
- Total score for each attempt (excluding "Task completion time")
- Percentage score relative to maximum possible score
- Model family averages across all attempts
Output: Generates website/src/data/results.json
Display: Astro website statically renders the dashboard using the JSON data

The script supports two CSV formats:

New format: Criterion,Score,Max,Notes
Legacy format: Criterion,Score,Notes (assumes Max=1)

Purpose

This benchmark serves as a practical evaluation tool for:

Understanding LLM capabilities in web development
Assessing code generation quality across different models
Identifying strengths and weaknesses in one-shot implementation scenarios
Informing technology choices for AI-assisted development workflows

Related Repositories

Repository	Purpose
10x-bench (this repo)	Model implementations, results dashboard, data processing, and the `/run-eval` skill
10x-bench-eval	Evaluation criteria, scoring methodology, benchmark prompt, reference content

Note: Each attempt represents a completely independent, one-shot effort with no iterative refinement or human intervention during implementation.