The Synthetic Data Team • July 14, 2025

Simulating AI Agents at very large scale

This last year has been marked by rapid breakthroughs in reasoning models, uncovering a new dimension of AI scalability: test-time compute.

In prior years, most AI scaling was focus on scaling pre-training compute which involved collecting large amount of internet data and training ever larger models. While this approach worked well for a few years, we quickly reached limits once we exhausted most of the usable data on the internet.

In order to scale further, we need a new source of data: synthetic data.

The Age of Simulation

With the o-series models, OpenAI has demonstrated that the way forward to scale performance is to let the model think longer at test time using Chain-of-Thought prompting.

In practice, this involves allowing the model to "think" as long as it needs before producing a final answer.

<think>
I need to think long and hard about the question....
</think>

Final answer is 42

Moreover, we want models to do more than just act as mere chatbots. We want models that can interact with the world and act in our behalf. We want agents.

The question becomes: how do we train thinking agents? Especially given we have already exhausted all the data in the internet.

The Synthetic Data Approach

Our conviction is that the vast majority of this training data will come from building large numbers of realistic digital environments for existing AIs to learn from.

As demonstrated by recent results by OpenAI, xAI, and Kimi, scaling synthetic data can yield vast performance gains. But, in order to avoid reward hacking, the synthetic data be generated in high-fidelity, high-signal environments.

Reinforcement Learning from Verifier Rewards (RLVR)

Traditional data labeling is dead. While companies like Scale AI revolutionized human-in-the-loop labeling, they're fundamentally limited by human speed and subjective judgment. The future belongs to verifier-driven synthetic data generation.

Here's how RLVR changes everything:

Instead of humans manually labeling "correct" or "incorrect," engineers write intelligent verifiers—programmatic judges that can evaluate agent behavior at superhuman speed and consistency. These verifiers don't just check binary outcomes; they assess complex behaviors across thousands of simulated environments simultaneously.

Consider a simple e-commerce task: "Add items to cart and complete checkout." Traditional labeling would require humans to manually verify each attempt. With RLVR, a verifier programmatically checks:

Did the item get added to the cart?
Was the quantity correct?
Did the checkout flow complete successfully?
Were edge cases handled properly?

The AI agent attempts the task repeatedly, learning from each failure until it satisfies all verifier conditions. This creates a virtuous cycle: the agent improves through trial and error while generating massive amounts of high-quality training data.

Scaling the Unscalable

The real challenge isn't writing verifiers—it's orchestrating them at scale. The Synthetic Data Company provides:

Infrastructure at Scale: Thousands of Docker containers and browser environments spinning up on-demand, each running isolated simulations where AI agents can safely fail, learn, and improve.

Verifier Engineering Teams: Like Scale AI delivers labelers, we deliver expert engineers who write sophisticated verifiers. But unlike labelers who process one example at a time, each verifier can evaluate thousands of agent attempts.

Adaptive Complexity: Verifiers can themselves call LLMs, enabling meta-learning scenarios where the evaluation criteria evolve based on agent performance. This creates training data of unprecedented sophistication.

The Compound Advantage

Traditional labeling faces diminishing returns—more labelers mean more inconsistency and higher costs. RLVR exhibits increasing returns:

Speed: Verifiers evaluate thousands of attempts per second vs. human minutes per label
Consistency: Code doesn't get tired, biased, or inconsistent
Complexity: Verifiers can assess multi-step reasoning and long-horizon tasks that would overwhelm human labelers
Cost: Once written, verifiers run indefinitely at marginal cost

Why Frontier Labs Are Making the Switch

The math is simple. To reach AGI, models need to master millions of complex tasks across every domain of human work. Manual labeling would require armies of experts and decades of effort. RLVR compresses this to months.

Leading AI labs are already seeing 10-100x improvements in data generation speed while maintaining higher quality than human labels. As models become more capable, they need increasingly sophisticated training scenarios—exactly what programmatic verifiers excel at creating.

The Future is Simulated

We're entering an era where AI improvement is bottlenecked not by compute or algorithms, but by access to high-quality training scenarios. The companies that can generate the most realistic, diverse, and challenging simulations will build the most capable agents.

The Synthetic Data Company isn't just another data labeling service. We're building the infrastructure for AI to learn from infinite simulated experiences—the same way humans learn from the real world, but at digital speed and scale.

The age of manual labeling is over. The age of simulation has begun.

Want to accelerate your model's capabilities with RLVR? Contact us →

Product

Core Simulation Platform

Labeling Services