Skip to main content
YC-Bench includes five carefully calibrated presets that test progressively harder scenarios. Each preset adjusts dozens of parameters to create specific challenges for LLM agents.

Available Presets

Tutorial

1-year horizon • Tests basic loop executionForgiving environment for testing basic CLI discovery and the accept → assign → dispatch → resume loop.

Easy

1-year horizon • Tests throughput awarenessSingle-domain tasks with moderate deadlines. Tests whether agents understand that parallelism dilutes throughput.

Medium

1-year horizon • Tests domain specializationPrestige ladder active, 2-domain tasks. Agents must specialize in a few domains to unlock higher-reward tiers.

Hard

1-year horizon • Tests capacity planningTight deadlines, heavy penalties, limited runway. Requires precise ETA calculation and conservative task acceptance.

Nightmare

1-year horizon • Tests sustained perfect playRazor-thin margins, aggressive compounding, steep prestige requirements. One mistake cascades into bankruptcy.

Preset Comparison

The table below highlights the key differences between presets:
ParameterTutorialEasyMediumHardNightmareDefault
Starting Funds$250,000$200,000$150,000$100,000$80,000$150,000
Horizon1 year1 year1 year1 year1 year3 years
Prestige ModeConstant 1Tri(1,4,1)Tri(1,7,3)Tri(1,8,4)Tri(1,10,5)Tri(1,10,4)
Domain CountConstant 1Constant 1Tri(1,3,2)Tri(1,3,2)Tri(1,3,2)Tri(1,3,2)
Required QtyTri(300,1200,600)Tri(500,2000,1000)Tri(700,3000,1500)Tri(1000,4000,2000)Tri(1200,5000,2500)Tri(800,4000,2000)
Deadline (qty/day)50100150220220200
Fail Penalty0.3×0.8×1.0×1.4×2.0×1.4×
Cancel Penalty0.5×1.2×1.5×2.0×2.5×2.0×
Salary Bump %0%0.5%1%1%2%1%
Reward Scale0.20.30.450.550.70.55
Notation: Tri(low, high, mode) = triangular distribution with given parameters. See Parameters for details.

What Each Preset Tests

Tutorial

Key Question: Can the agent execute the basic loop?
# From tutorial.toml
[world.dist.required_prestige]
type = "constant"
value = 1        # ALL tasks accessible immediately

[world.dist.domain_count]
type = "constant"
value = 1        # Single-domain only

deadline_qty_per_day = 50.0  # Very generous deadlines
Economics:
  • Starting runway: ~16 months with 10 employees
  • Monthly payroll: ~$15K
  • Mode task: 1 domain × 600 units, 7-day deadline
  • A single mid-tier employee can finish in 7.4 days
Tests:
  • Does the agent discover the CLI commands?
  • Does it call sim resume to advance time?
  • Can it read JSON output and act on it?

Easy

Key Question: Does the agent understand throughput dilution?
# From easy.toml
[world.dist.required_prestige]
type = "triangular"
low  = 1
high = 4
mode = 1        # Almost all tasks accessible at prestige-1

deadline_qty_per_day = 100.0  # Moderate deadlines
Economics:
  • Starting runway: ~7.8 months with 10 employees
  • Monthly payroll: ~$32K
  • Mode task: 1 domain × 1000 units, 10-day deadline
  • Team throughput on 1 task: 230 units/day → 3 days
  • On 4 parallel tasks: 57 units/day → 12 days (FAIL)
Tests:
  • Does the agent understand that parallel tasks split employee rates?
  • Does it keep ≤2 tasks active at a time?
  • Can it sequence tasks rather than batch?

Medium

Key Question: Can the agent climb the prestige ladder strategically?
# From medium.toml
[world.dist.required_prestige]
type = "triangular"
low  = 1
high = 7
mode = 3        # Most tasks need prestige 2–4

[world.dist.domain_count]
type = "triangular"
low  = 1
high = 3
mode = 2        # Most tasks need 2 domains

reward_prestige_scale = 0.45  # Climbing prestige doubles income
Economics:
  • Starting runway: ~7.8 months with 10 employees
  • Monthly payroll: ~$32K
  • Mode task: 2 domains × 1500 units, 10-day deadline
  • Prestige-1 reward: ~$30K
  • Prestige-4 reward: 30K×2.35=30K × 2.35 = 70K
Tests:
  • Does the agent understand prestige gates market access?
  • Does it specialize in 2–3 domains rather than spreading thin?
  • Can it handle 2-domain task assignments effectively?

Hard

Key Question: Can the agent compute ETAs and never overcommit?
# From hard.toml
initial_funds_cents = 10_000_000  # $100,000 — tight runway

deadline_qty_per_day = 220.0      # Tight deadlines

penalty_fail_multiplier   = 1.4   # Mistakes cost real prestige
penalty_cancel_multiplier = 2.0

salary_bump_pct = 0.01            # Noticeable compounding
Economics:
  • Starting runway: ~5.4 months with 10 employees
  • Monthly payroll: ~$46K
  • Mode task: 2 domains × 2000 units, 9-day deadline
  • Split 4+3 employees: finishes in 8.7 days (just fits!)
  • Dispatching a second task splits all rates → both tasks miss
Tests:
  • Can the agent estimate completion time vs. deadline?
  • Does it understand that new dispatches degrade existing tasks?
  • Can it manage cash flow with 5.4-month runway?
  • Does it resist “tempting” high-reward tasks it can’t finish?

Nightmare

Key Question: Can the agent sustain perfect play for an entire year?
# From nightmare.toml
initial_funds_cents = 8_000_000   # $80,000 — razor-thin runway

[world.dist.required_prestige]
type = "triangular"
low  = 1
high = 10
mode = 5        # Most tasks need prestige 4–6

penalty_fail_multiplier   = 2.0   # Catastrophic penalties
penalty_cancel_multiplier = 2.5

salary_bump_pct = 0.02            # Aggressive compounding
reward_prestige_scale = 0.7       # Steep reward curve
Economics:
  • Starting runway: ~4.8 months with 10 employees
  • Monthly payroll: ~$52K initially, grows 30–50% over the year
  • Mode task: 2 domains × 2500 units, 11-day deadline
  • Revenue at prestige-1: ~30K(net:30K (net: -22K/month)
  • Revenue at prestige-5: ~$114K (now profitable)
  • The race: Climb to prestige 5 before month 5 or die
Tests:
  • Can the agent survive a 4.8-month clock to profitability?
  • Does it plan a prestige climb path across 2–3 domains?
  • Can it handle 3-domain assignments without throughput collapse?
  • Does it account for salary growth in long-term planning?
  • Can it resist every temptation to over-accept?

Specifying a Preset

Use the --config flag to specify a preset:
yc-bench run --config tutorial
yc-bench run --config nightmare
Available preset names:
  • tutorial
  • easy
  • medium
  • hard
  • nightmare
  • default (the 3-year hardened benchmark)
If no --config is specified, YC-Bench uses the default preset, which is the canonical 3-year benchmark configuration.

Preset Inheritance

All presets use extends = "default" and override only specific parameters. This means:
  1. Every parameter not explicitly overridden inherits from default.toml
  2. Parameters like num_employees, num_market_tasks, and salary tier distributions are consistent across presets
  3. You can inspect the full effective configuration by examining both files
Example: The tutorial preset only overrides 13 parameters but inherits 50+ others from default.

Next Steps

Parameters

Complete reference of all tunable parameters in default.toml

Tuning

Learn how to create your own custom presets