Overview
YC-Bench measures agent performance across multiple dimensions: survival, profitability, task completion rate, prestige achieved, and efficiency. Unlike single-metric benchmarks, YC-Bench requires agents to balance competing objectives.There is no single “score” — instead, the benchmark reports a rollout with detailed metrics. Researchers can define custom scoring functions based on their evaluation priorities.
Primary Success Criteria
1. Survival (Binary)
Did the company survive to horizon end without going bankrupt?- ❌ Bankruptcy:
funds < 0after payroll or any transaction - ❌ Agent crash: Unhandled exception or timeout
- ❌ Max turns exceeded: Agent hit turn limit (if configured)
2. Final Funds (Continuous)
How much cash remains at the end?- Negative: Bankrupt ❌
- 50K: Barely survived (razor-thin margins)
- 200K: Comfortable survival
- $200K+: Strong performance (healthy cash reserves)
Typical Final Funds by Preset
| Preset | Starting Funds | Typical Final Funds (Survival) | Strong Performance |
|---|---|---|---|
| Tutorial | $80K | 150K | $200K+ |
| Easy | $120K | 300K | $500K+ |
| Medium | $150K | 200K | $400K+ |
| Hard | $150K | 100K | $300K+ |
| Nightmare | $250K | -$50K (bankruptcy common) | $100K+ |
Final funds reflect cumulative profit over the entire run, accounting for all task rewards, payroll expenses, and compounding salary growth.
Task Completion Metrics
Tasks Completed (Success)
Number of tasks finished on time (before deadline).- Tutorial: 100–150 tasks
- Easy: 80–120 tasks
- Medium: 60–100 tasks
- Hard: 40–80 tasks
- Nightmare: 20–50 tasks
Tasks Failed (Late Completion)
Number of tasks completed after deadline.- 0 failures: Perfect execution ✅
- 1–5 failures: Acceptable (occasional missed deadline)
- 5–10 failures: Suboptimal (deadline estimation issues)
- 10+ failures: Poor planning (over-commitment or under-resourcing)
Tasks Cancelled
Number of tasks cancelled before completion.- 0 cancellations: Committed to all accepted tasks ✅
- 1–3 cancellations: Strategic cancellation (rare)
- 3+ cancellations: Poor task selection or over-commitment
Cancellation incurs a 2.0× prestige penalty (worse than failure). Frequent cancellations suggest the agent is accepting tasks speculatively without validating feasibility.
Task Completion Rate
- 95%+ success: Excellent planning and execution
- 85–95% success: Good (occasional deadline miss)
- 70–85% success: Acceptable (some planning issues)
- Below 70% success: Poor (frequent failures/cancellations)
Prestige Levels Achieved
Final prestige in each domain reflects how far the agent climbed the prestige ladder.Prestige Interpretation
| Avg Prestige | Market Access | Difficulty |
|---|---|---|
| 1.0–2.0 | Entry-level tasks only | Agent never climbed prestige |
| 2.0–3.0 | Low-tier tasks | Minimal progression |
| 3.0–5.0 | Mid-tier tasks (profitable) | Good |
| 5.0–7.0 | High-tier tasks (high margin) | Excellent |
| 7.0–10.0 | Elite tasks (maximum difficulty) | Outstanding |
Prestige Balance
Check variance across domains:- Low variance (e.g., all domains within 1.0 of each other): Balanced strategy ✅
- High variance (e.g., one domain at 7.0, others at 2.0): Narrow specialization ⚠️
In medium and hard presets, most tasks require 2 domains. Agents with high prestige variance (narrow specialization) will be locked out of multi-domain tasks.
Efficiency Metrics
Runway Utilization
How efficiently did the agent use the available time?profit_per_day while maintaining survival.
Employee Utilization
- Below 50%: Employees idle (under-utilized)
- 50–70%: Reasonable (some idle time for flexibility)
- 70–85%: High utilization (efficient)
- 85%+: Over-commitment risk (no buffer for delays)
Payroll-to-Revenue Ratio
- Below 30%: Excellent margin (high-prestige tasks)
- 30–50%: Good margin
- 50–70%: Thin margin (risky)
- Above 70%: Unsustainable (bankruptcy risk)
As salaries compound over time (+1% per task), the payroll ratio increases throughout the run. A healthy run should show declining payroll ratio as prestige climbs (higher task rewards offset salary growth).
Interpreting Results: Good vs. Great Runs
Good Run
- ✅ Survived with healthy cash reserves ($120K)
- ✅ 90% task success rate (5 failures, 2 cancellations)
- ✅ Climbed to prestige ~4 (mid-tier tasks)
- ✅ Balanced prestige across domains (variance 0.8)
- ✅ Reasonable payroll ratio (45%)
Great Run
- ✅ Survived with excellent cash reserves ($450K)
- ✅ 98% task success rate (near-perfect execution)
- ✅ Climbed to prestige 6–7 (high-tier tasks)
- ✅ Extremely balanced prestige (variance 0.3)
- ✅ Excellent payroll ratio (32%)
Poor Run (But Survived)
- ⚠️ Barely survived ($15K remaining)
- ❌ 67% success rate (many failures/cancellations)
- ❌ Low prestige (never climbed to mid-tier)
- ❌ High prestige variance (narrow specialization)
- ❌ High payroll ratio (68% — unsustainable)
This run survived but demonstrates poor planning: over-commitment (12 failures), poor task selection (8 cancellations), narrow specialization (high variance), and thin margins (68% payroll ratio).
Failure Analysis
When a run ends in bankruptcy, examine:1. Cash Flow Timeline
- ❌ Long gaps between task completions (revenue droughts)
- ❌ Payroll > revenue in consecutive months
- ❌ Large failed tasks (no revenue, but payroll still paid)
2. Task Failure Rate
Iftasks_failed / tasks_accepted > 20%:
- Agent is over-committing (accepting too many tasks)
- Agent is under-estimating task duration (poor throughput inference)
- Agent is over-splitting employees (throughput penalty)
3. Prestige Decay
If prestige levels declined over time:- Agent is not completing enough tasks per domain to offset decay
- Agent specialized too narrowly (unused domains decayed)
- Agent got locked out of market (prestige too low to accept new tasks)
4. Payroll Growth
If payroll grew faster than revenue:- Agent completed many tasks (salary bumps) but failed to climb prestige
- Agent accepted low-margin tasks (prestige 1–2) that don’t offset compounding payroll
Benchmark Comparison
To compare agents, aggregate metrics across multiple seeds (e.g., 10 runs per agent):Example Leaderboard
| Agent | Survival Rate | Avg Final Funds | Avg Completion Rate | Avg Prestige |
|---|---|---|---|---|
| GPT-4 | 90% | $185K | 87% | 4.8 |
| Claude Opus | 85% | $210K | 91% | 5.2 |
| Gemini Pro | 70% | $95K | 78% | 3.9 |
| Baseline | 40% | $30K | 65% | 2.5 |
The benchmark is designed to have no ceiling — even the best agents will struggle to achieve 100% survival rate on nightmare preset. The goal is to measure relative performance across agents.
Custom Scoring Functions
Researchers can define custom scoring functions. Example:Observing Results During Run
Agents can monitor progress in real-time:Next Steps
How It Works
Review the core game loop and terminal conditions.
Configuration
Tune difficulty presets to create custom benchmarks.
Development
Understand the codebase architecture and data model.
Task Management
Understand how task outcomes factor into final scores.