Open Source

Envoi

An open-source framework for building verified long-horizon training environments. Define tasks, write verifiers, and generate environments with structured reward signals.

GitHub

INSIDE AN ENVIRONMENT

One Environment, Three Agents, 3 Days

Each environment is a long-horizon project with structured milestones. Different agents find different paths.

Illustrative. Agents evaluated on the C compiler environment with four milestone tiers. Hover shows percent progress, where 100% means builds linux. Time axis is compressed from one minute to three days - early progress is fast, deep milestones take sustained work.

THE ENVOI SDK

Define Environments. Evaluate Agents.

environment.py

illustrative

import envoi

basics = envoi.suite("basics")
torture = envoi.suite("torture")

@envoi.setup
async def build_compiler(submission: envoi.Documents):
    result = await envoi.run("chmod +x build.sh && ./build.sh")
    if result.exit_code != 0:
        raise RuntimeError(f"Build failed:\n{result.stderr}")

@basics.test
async def variables():
    """Compile and run variable declaration tests."""
    result = await envoi.run("./cc tests/variables.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stdout": result.stdout}

@torture.test("gcc_{test_id}")
async def gcc_torture(test_id: str):
    """Run a single GCC torture test case."""
    result = await envoi.run(f"./cc torture/{test_id}.c -o out && ./out")
    return {"passed": result.exit_code == 0,
            "stderr": result.stderr}

Step 1 — Define

Your task. Your rules.

An environment is a self-contained project with milestones that define what 'correct' means. Write async Python — envoi handles sandboxing, file transfer, and process isolation.

Structured milestonesGroup tests into suites. Each suite is a checkpoint agents must reach — basics, torture tests, the full build.
Real toolchains, real buildsAgents compile code, run binaries, and pass test harnesses. No multiple choice — just build the thing.
Rich diagnosticsReturn whatever you want — pass/fail, stdout, diffs, images. The training loop gets structured results, not just a score.

Step 2 — Evaluate

One call. Full signal.

Connect to a running environment, submit your agent's output, and get back structured results. Run from your training loop, your CI, or a notebook.

Session-based evaluationEach evaluation runs in an isolated session. No cross-contamination between runs, no cleanup needed.
Granular or holisticRun all suites at once, or target a specific milestone. Get exactly the signal you need for each training step.
Drop-in integrationThree lines to connect, submit, and score. Fits into any RL loop, any CI pipeline, any eval harness.

evaluate.py

illustrative

import envoi

async with await envoi.connect_session(
    "https://path-to-environment.com",
    submission=envoi.Documents("./agent-output/src"),
) as session:
    # Run all tests -- get structured results
    results = await session.test()

    # Or run a specific milestone
    basics = await session.test("basics")
    torture = await session.test("torture")

    # Each result is whatever the environment returns --
    # pass/fail, diffs, images, diagnostics, scores.

envoi is open-source

Private data. Verified environments.
Production-ready agents.

Tell us what you need — we will scope availability, anonymization, and delivery.