Open Source
Envoi
An open-source framework for building verified long-horizon training environments. Define tasks, write verifiers, and generate environments with structured reward signals.
INSIDE AN ENVIRONMENT
One Environment, Three Agents, 3 Days
Each environment is a long-horizon project with structured milestones. Different agents find different paths.
Illustrative. Agents evaluated on the C compiler environment with four milestone tiers. Hover shows percent progress, where 100% means builds linux. Time axis is compressed from one minute to three days - early progress is fast, deep milestones take sustained work.
THE ENVOI SDK
Define Environments. Evaluate Agents.
environment.py
illustrative
import envoi
basics = envoi.suite("basics")
torture = envoi.suite("torture")
@envoi.setup
async def build_compiler(submission: envoi.Documents):
result = await envoi.run("chmod +x build.sh && ./build.sh")
if result.exit_code != 0:
raise RuntimeError(f"Build failed:\n{result.stderr}")
@basics.test
async def variables():
"""Compile and run variable declaration tests."""
result = await envoi.run("./cc tests/variables.c -o out && ./out")
return {"passed": result.exit_code == 0,
"stdout": result.stdout}
@torture.test("gcc_{test_id}")
async def gcc_torture(test_id: str):
"""Run a single GCC torture test case."""
result = await envoi.run(f"./cc torture/{test_id}.c -o out && ./out")
return {"passed": result.exit_code == 0,
"stderr": result.stderr}Step 1 — Define
Your task. Your rules.
An environment is a self-contained project with milestones that define what 'correct' means. Write async Python — envoi handles sandboxing, file transfer, and process isolation.
- Structured milestonesGroup tests into suites. Each suite is a checkpoint agents must reach — basics, torture tests, the full build.
- Real toolchains, real buildsAgents compile code, run binaries, and pass test harnesses. No multiple choice — just build the thing.
- Rich diagnosticsReturn whatever you want — pass/fail, stdout, diffs, images. The training loop gets structured results, not just a score.
Step 2 — Evaluate
One call. Full signal.
Connect to a running environment, submit your agent's output, and get back structured results. Run from your training loop, your CI, or a notebook.
- Session-based evaluationEach evaluation runs in an isolated session. No cross-contamination between runs, no cleanup needed.
- Granular or holisticRun all suites at once, or target a specific milestone. Get exactly the signal you need for each training step.
- Drop-in integrationThree lines to connect, submit, and score. Fits into any RL loop, any CI pipeline, any eval harness.
evaluate.py
illustrative
import envoi
async with await envoi.connect_session(
"https://path-to-environment.com",
submission=envoi.Documents("./agent-output/src"),
) as session:
# Run all tests -- get structured results
results = await session.test()
# Or run a specific milestone
basics = await session.test("basics")
torture = await session.test("torture")
# Each result is whatever the environment returns --
# pass/fail, diffs, images, diagnostics, scores.Private data. Verified environments.
Production-ready agents.
Tell us what you need — we will scope availability, anonymization, and pricing.