Aug 14, 2025

Introducing Snowglobe

If you’ve built AI agents, you know how challenging it is to test them. How do you even begin formulating a test plan for a technology whose input space is infinite?

Most teams fall back on a small ‘golden’ dataset, maybe 50 to 100 hand-picked examples. It takes ages to put together, and even then, it only covers the happy paths, missing the messy reality of real users. That’s how you end up with an agent that’s perfect in development, but starts hallucinating, going off-topic, or breaking policies as soon as it meets real-world scenarios.

Snowglobe fixes this problem by creating a high fidelity simulation engine for conversational AI agents! Snowglobe creates realistic personas that interact with your AI agent across diverse simulated scenarios before you go to production.

Our customers are already generating tens of thousands of simulated conversations with Snowglobe before they go into production, allowing them to speed up what would take weeks of manual scenario handcrafting and catch potential issues before they happen in prod.

Why we built this

Before Snowglobe, I’d spent a number of years building self-driving cars. Weirdly enough, there’s a number of unexpected similarities between self driving cars and AI agents, including that testing them is really hard because their input domain is infinite!

In self driving, we dealt with this problem by making high fidelity simulation engines that would allow you to systematically test that your car behaved well even in the highly risky edge case scenarios. Nowhere was this demonstrated more than at Waymo, where they logged 20+ million miles on real roads, but 20+ BILLION miles in simulation before being launched.

Our goal was simple - stand on the shoulders of giants, and build a general purpose simulation engine that allows AI engineers to have the same reliability that autonomous cars have.

How we’re different

We perform rich persona modeling to ensure a lot realism and diversity in the scenarios we generate. This is not the same as asking ChatGPT to generate synthetic data for you, which then all sounds like the same ChatGPT-voice created them.
We manage the orchestration for generating stateful conversations with many back-and-forths.
Our scenarios are grounded in your agent’s context, so the conversations you end up with are realistic to your use case rather than a generic set.
Unlike conventional redteaming tools, we don’t make the adversarial assumption on our scenarios, so you get to see how your AI agent behaves with normal users that are just using your AI agent in a variety of different ways.
Easily export your scenarios as datasets to Hugging Face or your favorite eval & tracing tool.

Who is this for?

If you’re building a conversational AI agent and testing on a limited dataset, spending time manually creating test sets, or want to run QA or pentesting of your AI agents, then come try Snowglobe! We'd love to chat with you further, learn what your building and show you Snowglobe in action. Book a time with one of our founders.

See similar articles

Thu, Sep 11 9AM PT: Reliable AI Though Simulation Testing Webinar

Learn to simulate realistic user behavior for Al applications with Snowglobe

Sep 8, 2025

Introducing Snowglobe

Snowglobe is a simulation environment for LLM teams to test how their applications respond to real-world user behavior. Run full workflows through realistic scenarios, catch edge cases early, and confidently improve before deploying to production.

Aug 14, 2025