Long-horizon training environments

Verified EnvironmentsForCoding AgentsAnd Post-Training Loops

The Synthetic Data Company builds and sells verifiable training environments for AI labs running RL and post-training. Our 1,000+ environments across 30+ domains pair binary rewards with rich feedback signals: detailed test failures, compiler diagnostics, LLM critique, debugging traces, and annotated-image review for stronger long-horizon agent performance.

1,000+

Verifiable environments

30+

Domain categories

Hours → Weeks+

Task horizon

Verifiable environments: 1,000+
Domain categories: 30+
Environment types: 4
Task horizon: Hours → Weeks+

Catalog preview

1,000+ production-scale environments and growing

Build from scratch, maintain, or evolve real-world projects with verifiable rewards, long-horizon milestones, and production-scale evaluation. Example environments include:

Odoo ERP clone

Optimized C compiler

Slack clone

SQLite engine

Game Boy emulator

Verilog simulator

Optimized FlashAttention kernel

S3-compliant object store

Automatic data labeling pipeline for SOTA image classification

DynamoDB-compatible key-value store

Moodle clone

Calendly clone

Gitea git forge

Lambda FaaS runtime

CloudFormation engine

Step Functions state machine

Redis-compatible server

DNS resolver

Load balancer

API gateway with routing and auth

Prometheus TSDB and PromQL engine

OAuth2/OIDC server

Chess engine with search

Game Boy Advance emulator

Lean 4 theorem prover

Apache Airflow DAG engine

Home Assistant IoT platform

PDF/A validator

FHIR-compliant health data server

Kill Bill billing clone

Saleor commerce clone

Sentry error tracking clone

Grafana Loki log engine clone

HashiCorp Vault clone

Tekton CI/CD pipeline clone

Matrix homeserver clone

Lua interpreter

Temporal workflow engine clone

PostHog analytics clone

Odoo ERP clone

Optimized C compiler

Slack clone

SQLite engine

Game Boy emulator

Verilog simulator

Optimized FlashAttention kernel

S3-compliant object store

Automatic data labeling pipeline for SOTA image classification

DynamoDB-compatible key-value store

Moodle clone

Calendly clone

Gitea git forge

Lambda FaaS runtime

CloudFormation engine

Step Functions state machine

Redis-compatible server

DNS resolver

Load balancer

API gateway with routing and auth

Prometheus TSDB and PromQL engine

OAuth2/OIDC server

Chess engine with search

Game Boy Advance emulator

Lean 4 theorem prover

Apache Airflow DAG engine

Home Assistant IoT platform

PDF/A validator

FHIR-compliant health data server

Kill Bill billing clone

Saleor commerce clone

Sentry error tracking clone

Grafana Loki log engine clone

HashiCorp Vault clone

Tekton CI/CD pipeline clone

Matrix homeserver clone

Lua interpreter

Temporal workflow engine clone

PostHog analytics clone

Odoo ERP clone

The Evidence

More Environments Produce Better Models

Model performance scales predictably with the number of verified training environments. More environments, covering more real workflows, produce agents that generalize better to production work.

Y-axis: Average Ranking (lower is better)

Data from Qwen 3.5 Technical Report (Alibaba, 2026). Average ranking computed across BFCL-V4, VITA-Bench, DeepPlanning, Tool-Decathlon, and MCP-Mark.

INSIDE AN ENVIRONMENT

One Environment, Three Agents, 3 Days

Each environment is a long-horizon project with structured milestones. Different agents find different paths.

Illustrative. Agents evaluated on the C compiler environment with four milestone tiers. Hover shows percent progress, where 100% means builds linux. Time axis is compressed from one minute to three days - early progress is fast, deep milestones take sustained work.

THE ENVOI SDK

Define Environments. Evaluate Agents.

environment.py

illustrative

import envoi

basics = envoi.suite("basics")
torture = envoi.suite("torture")

@envoi.setup
async def build_compiler(submission: envoi.Documents):
    result = await envoi.run("chmod +x build.sh && ./build.sh")
    if result.exit_code != 0:
        raise RuntimeError(f"Build failed:\n{result.stderr}")

@basics.test
async def variables():
    """Compile and run variable declaration tests."""
    result = await envoi.run("./cc tests/variables.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stdout": result.stdout}

@torture.test("gcc_{test_id}")
async def gcc_torture(test_id: str):
    """Run a single GCC torture test case."""
    result = await envoi.run(f"./cc torture/{test_id}.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stderr": result.stderr}

Step 1 — Define

Your task. Your rules.

An environment is a self-contained project with milestones that define what 'correct' means. Write async Python — envoi handles sandboxing, file transfer, and process isolation.

Structured milestonesGroup tests into suites. Each suite is a checkpoint agents must reach — basics, torture tests, the full build.
Real toolchains, real buildsAgents compile code, run binaries, and pass test harnesses. No multiple choice — just build the thing.
Rich diagnosticsReturn whatever you want — pass/fail, stdout, diffs, images. The training loop gets structured results, not just a score.

Step 2 — Evaluate

One call. Full signal.

Connect to a running environment, submit your agent's output, and get back structured results. Run from your training loop, your CI, or a notebook.

Session-based evaluationEach evaluation runs in an isolated session. No cross-contamination between runs, no cleanup needed.
Granular or holisticRun all suites at once, or target a specific milestone. Get exactly the signal you need for each training step.
Drop-in integrationThree lines to connect, submit, and score. Fits into any RL loop, any CI pipeline, any eval harness.

evaluate.py

illustrative

import envoi

async with await envoi.connect_session(
    "https://path-to-environment.com",
    submission=envoi.Documents("./agent-output/src"),
) as session:
    # Run all tests -- get structured results
    results = await session.test()

    # Or run a specific milestone
    basics = await session.test("basics")
    torture = await session.test("torture")

    # Each result is whatever the environment returns --
    # pass/fail, diffs, images, diagnostics, scores.

envoi is open-source

Private data. Verified environments.
Production-ready agents.

Tell us what you need — we will scope availability, anonymization, and delivery.

Verified EnvironmentsForCoding AgentsAnd Post-Training Loops

More Environments Produce Better Models

One Environment, Three Agents, 3 Days

Define Environments. Evaluate Agents.

Your task. Your rules.

One call. Full signal.

Private data. Verified environments.Production-ready agents.

Private data. Verified environments.
Production-ready agents.