Skip to main content

YC-Bench

Test whether your AI agent can sustain strategic thinking across hundreds of decisions. YC-Bench is a deterministic benchmark that simulates running an AI startup over 1-3 years. Your agent plays CEO, managing employees, accepting tasks, climbing prestige across four domains, and balancing cash flow — all through a CLI against a SQLite-backed discrete-event simulation.

Installation

Install YC-Bench with uv and set up your API keys

Quickstart

Run your first benchmark in under 5 minutes

How It Works

Understand the simulation loop and game mechanics

CLI Reference

Explore all available commands and options

Key Features

Long-Horizon Coherence

Tests agent decision-making over hundreds of turns spanning simulated years

Deterministic Simulation

Same seed always produces the same world — perfect for reproducible benchmarking

Multi-Domain Prestige

Four domains (research, inference, data, training) with decay mechanics require balanced specialization

Hidden Employee Skills

Agents must infer productivity from progress observations — no direct skill visibility

Compounding Pressure

Salary bumps after each task compound payroll costs over time

Five Difficulty Levels

From tutorial to nightmare mode — test your agent’s limits

Why YC-Bench?

Most LLM benchmarks test isolated capabilities: coding a function, answering trivia, following instructions for a few turns. But real-world agent tasks require sustained coherence — maintaining context, adapting strategy, and compounding good decisions over long horizons. YC-Bench fills this gap. It tests whether your agent can:
  • Sustain strategic thinking across hundreds of turns without drift
  • Infer hidden information from indirect observations
  • Balance competing objectives (cash flow, prestige, capacity, deadlines)
  • Adapt to compounding consequences of earlier decisions
  • Manage complexity across multiple domains and resources

Quick Start

# Install with uv
git clone https://github.com/collinear-ai/yc-bench.git
cd yc-bench
uv sync

# Set your API key (LiteLLM-compatible)
export ANTHROPIC_API_KEY="sk-ant-..."

# Run a benchmark
uv run yc-bench run \
  --model anthropic/claude-sonnet-4-6 \
  --seed 1 \
  --config medium
Results are saved to results/ as JSON and db/ as SQLite for detailed analysis.

What Gets Tested

  • Resource allocation: Assigning employees across competing tasks
  • Prestige climbing: Choosing which domains to specialize in
  • Deadline management: Estimating task completion times under uncertainty
  • Cash flow planning: Balancing high-reward risky tasks vs. safe lower-paying work
  • Capacity planning: Managing throughput splitting and task queueing
  • Adaptive strategy: Responding to salary inflation and prestige decay
YC-Bench outputs deterministic results given a seed. This makes it ideal for comparing models, prompt strategies, and agent architectures on equal footing.

Next Steps

Install YC-Bench

Get set up in 2 minutes

Run Your First Test

Complete walkthrough from install to results

Understand the Mechanics

Deep dive into how the simulation works