YC-Bench
Test whether your AI agent can sustain strategic thinking across hundreds of decisions. YC-Bench is a deterministic benchmark that simulates running an AI startup over 1-3 years. Your agent plays CEO, managing employees, accepting tasks, climbing prestige across four domains, and balancing cash flow — all through a CLI against a SQLite-backed discrete-event simulation.Installation
Install YC-Bench with uv and set up your API keys
Quickstart
Run your first benchmark in under 5 minutes
How It Works
Understand the simulation loop and game mechanics
CLI Reference
Explore all available commands and options
Key Features
Long-Horizon Coherence
Tests agent decision-making over hundreds of turns spanning simulated years
Deterministic Simulation
Same seed always produces the same world — perfect for reproducible benchmarking
Multi-Domain Prestige
Four domains (research, inference, data, training) with decay mechanics require balanced specialization
Hidden Employee Skills
Agents must infer productivity from progress observations — no direct skill visibility
Compounding Pressure
Salary bumps after each task compound payroll costs over time
Five Difficulty Levels
From tutorial to nightmare mode — test your agent’s limits
Why YC-Bench?
Most LLM benchmarks test isolated capabilities: coding a function, answering trivia, following instructions for a few turns. But real-world agent tasks require sustained coherence — maintaining context, adapting strategy, and compounding good decisions over long horizons. YC-Bench fills this gap. It tests whether your agent can:- Sustain strategic thinking across hundreds of turns without drift
- Infer hidden information from indirect observations
- Balance competing objectives (cash flow, prestige, capacity, deadlines)
- Adapt to compounding consequences of earlier decisions
- Manage complexity across multiple domains and resources
Quick Start
results/ as JSON and db/ as SQLite for detailed analysis.
What Gets Tested
- Resource allocation: Assigning employees across competing tasks
- Prestige climbing: Choosing which domains to specialize in
- Deadline management: Estimating task completion times under uncertainty
- Cash flow planning: Balancing high-reward risky tasks vs. safe lower-paying work
- Capacity planning: Managing throughput splitting and task queueing
- Adaptive strategy: Responding to salary inflation and prestige decay
YC-Bench outputs deterministic results given a seed. This makes it ideal for comparing models, prompt strategies, and agent architectures on equal footing.
Next Steps
Install YC-Bench
Get set up in 2 minutes
Run Your First Test
Complete walkthrough from install to results
Understand the Mechanics
Deep dive into how the simulation works