YC-Bench

Test whether your AI agent can sustain strategic thinking across hundreds of decisions. YC-Bench is a deterministic benchmark that simulates running an AI startup over 1-3 years. Your agent plays CEO, managing employees, accepting tasks, climbing prestige across four domains, and balancing cash flow — all through a CLI against a SQLite-backed discrete-event simulation.

Installation

Install YC-Bench with uv and set up your API keys

Quickstart

Run your first benchmark in under 5 minutes

How It Works

Understand the simulation loop and game mechanics

CLI Reference

Explore all available commands and options

Key Features

Long-Horizon Coherence

Tests agent decision-making over hundreds of turns spanning simulated years

Deterministic Simulation

Same seed always produces the same world — perfect for reproducible benchmarking

Multi-Domain Prestige

Four domains (research, inference, data, training) with decay mechanics require balanced specialization

Hidden Employee Skills

Agents must infer productivity from progress observations — no direct skill visibility

Compounding Pressure

Salary bumps after each task compound payroll costs over time

Five Difficulty Levels

From tutorial to nightmare mode — test your agent’s limits

Why YC-Bench?

Most LLM benchmarks test isolated capabilities: coding a function, answering trivia, following instructions for a few turns. But real-world agent tasks require sustained coherence — maintaining context, adapting strategy, and compounding good decisions over long horizons. YC-Bench fills this gap. It tests whether your agent can:

Sustain strategic thinking across hundreds of turns without drift
Infer hidden information from indirect observations
Balance competing objectives (cash flow, prestige, capacity, deadlines)
Adapt to compounding consequences of earlier decisions
Manage complexity across multiple domains and resources

Quick Start

# Install with uv
git clone https://github.com/collinear-ai/yc-bench.git
cd yc-bench
uv sync

# Set your API key (LiteLLM-compatible)
export ANTHROPIC_API_KEY="sk-ant-..."

# Run a benchmark
uv run yc-bench run \
  --model anthropic/claude-sonnet-4-6 \
  --seed 1 \
  --config medium

Results are saved to results/ as JSON and db/ as SQLite for detailed analysis.

What Gets Tested

Resource allocation: Assigning employees across competing tasks
Prestige climbing: Choosing which domains to specialize in
Deadline management: Estimating task completion times under uncertainty
Cash flow planning: Balancing high-reward risky tasks vs. safe lower-paying work
Capacity planning: Managing throughput splitting and task queueing
Adaptive strategy: Responding to salary inflation and prestige decay

YC-Bench outputs deterministic results given a seed. This makes it ideal for comparing models, prompt strategies, and agent architectures on equal footing.

Next Steps

Install YC-Bench

Get set up in 2 minutes

Run Your First Test

Complete walkthrough from install to results

Understand the Mechanics

Deep dive into how the simulation works

​YC-Bench

Installation

Quickstart

How It Works

CLI Reference

​Key Features

Long-Horizon Coherence

Deterministic Simulation

Multi-Domain Prestige

Hidden Employee Skills

Compounding Pressure

Five Difficulty Levels

​Why YC-Bench?

​Quick Start

​What Gets Tested

​Next Steps

Install YC-Bench

Run Your First Test

Understand the Mechanics

YC-Bench

Key Features

Why YC-Bench?

Quick Start

What Gets Tested

Next Steps