Long-horizon training environments

Verified EnvironmentsForCoding AgentsAnd Post-Training Loops

The Synthetic Data Company builds and sells verifiable training environments for AI labs running RL and post-training. Our 1,000+ environments across 30+ domains pair binary rewards with rich feedback signals: detailed test failures, compiler diagnostics, LLM critique, debugging traces, and annotated-image review for stronger long-horizon agent performance.

1,000+

Verifiable environments

30+

Domain categories

Hours → Weeks+

Task horizon

The Evidence

More Environments Produce Better Models

Model performance scales predictably with the number of verified training environments. More environments, covering more real workflows, produce agents that generalize better to production work.

Y-axis: Average Ranking (lower is better)

Data from Qwen 3.5 Technical Report (Alibaba, 2026). Average ranking computed across BFCL-V4, VITA-Bench, DeepPlanning, Tool-Decathlon, and MCP-Mark.

INSIDE AN ENVIRONMENT

One Environment, Three Agents, 3 Days

Each environment is a long-horizon project with structured milestones. Different agents find different paths.

Illustrative. Agents evaluated on the C compiler environment with four milestone tiers. Hover shows percent progress, where 100% means builds linux. Time axis is compressed from one minute to three days - early progress is fast, deep milestones take sustained work.

THE ENVOI SDK

Define Environments. Evaluate Agents.

environment.py

illustrative

import envoi

basics = envoi.suite("basics")
torture = envoi.suite("torture")

@envoi.setup
async def build_compiler(submission: envoi.Documents):
    result = await envoi.run("chmod +x build.sh && ./build.sh")
    if result.exit_code != 0:
        raise RuntimeError(f"Build failed:\n{result.stderr}")

@basics.test
async def variables():
    """Compile and run variable declaration tests."""
    result = await envoi.run("./cc tests/variables.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stdout": result.stdout}

@torture.test("gcc_{test_id}")
async def gcc_torture(test_id: str):
    """Run a single GCC torture test case."""
    result = await envoi.run(f"./cc torture/{test_id}.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stderr": result.stderr}

Step 1 — Define

Your task. Your rules.

An environment is a self-contained project with milestones that define what 'correct' means. Write async Python — envoi handles sandboxing, file transfer, and process isolation.

  • Structured milestonesGroup tests into suites. Each suite is a checkpoint agents must reach — basics, torture tests, the full build.
  • Real toolchains, real buildsAgents compile code, run binaries, and pass test harnesses. No multiple choice — just build the thing.
  • Rich diagnosticsReturn whatever you want — pass/fail, stdout, diffs, images. The training loop gets structured results, not just a score.

Step 2 — Evaluate

One call. Full signal.

Connect to a running environment, submit your agent's output, and get back structured results. Run from your training loop, your CI, or a notebook.

  • Session-based evaluationEach evaluation runs in an isolated session. No cross-contamination between runs, no cleanup needed.
  • Granular or holisticRun all suites at once, or target a specific milestone. Get exactly the signal you need for each training step.
  • Drop-in integrationThree lines to connect, submit, and score. Fits into any RL loop, any CI pipeline, any eval harness.

evaluate.py

illustrative

import envoi

async with await envoi.connect_session(
    "https://path-to-environment.com",
    submission=envoi.Documents("./agent-output/src"),
) as session:
    # Run all tests -- get structured results
    results = await session.test()

    # Or run a specific milestone
    basics = await session.test("basics")
    torture = await session.test("torture")

    # Each result is whatever the environment returns --
    # pass/fail, diffs, images, diagnostics, scores.
envoi is open-source

Private data. Verified environments.
Production-ready agents.

Tell us what you need — we will scope availability, anonymization, and pricing.