Aug 13, 2025

Scaling AI Safety Testing for Educational Applications

Leveraging Snowglobe to Safeguard Chatbot Interactions Across Diverse Student Personas

400 test cases generated and executed in one day vs. week-long manual process
9,000+ students served across 300 schools with zero safety incidents
100,000+ students to potentially expand in the following phase with Ministry of Education
From 89% error rate to 0%

"Using Snowglobe, we can run over 400 test cases in just a day, compared to requiring 2-3 people working for at least a week to manually generate and test all scenarios. This has transformed how we test our AI applications to make sure they are safe and guarded against unintended use, helping us achieve near-zero failure rates while ensuring our educational chatbot safely serves thousands of students across Thailand." — Panuwat Poon, Business Lead - RISA AI Product Manager, SCB10

When SCB10X set out to support Thai students in preparing for the PISA reading test, they envisioned more than a traditional study tool. Instead, they built an interactive chatbot and experience where students take authentic PISA mock exams, get instant and personalized AI-powered feedback, and naturally ask follow-up questions to understand mistakes and improve. Rather than waiting days for human reviews and feedback, the chatbot instantly adapts its explanations to each learner, encouraging active understanding over just memorization at scale.

This high-touch, student-centered approach comes with unique safety and reliability challenges. Deployed nationwide in partnership with Thailand's Ministry of Education, the chatbot must not only deliver accurate instant feedback, but also navigate a vast range of possible student questions, avoid sensitive or inappropriate topics, and withstand attempts to confuse or “trick” the system.

Traditional QA simply couldn’t keep pace. Manual testing required a week of multiple testers without full risk coverage. By adopting Guardrails AI’s Snowglobe simulation platform, SCB10X transformed their safeguarding workflow—automating the generation of hundreds of nuanced test scenarios, surfacing hidden vulnerabilities, and enabling quick remediation. This shift made it possible to launch a secure, trustworthy chatbot now supporting 9,000 students across 300 schools, with potential to scale the similar use case for other exams where it could reach over 100,000 users annually in the next phase of national expansion.

The Challenge of Testing Non-Deterministic AI Applications in Educational Settings

The introduction of generative AI fundamentally changed SCB10X's approach to product development and testing.

"Previously, all our processes were fairly predictable and deterministic—we always knew what to expect as an outcome. But after introducing generative AI, everything became non-deterministic. We started to see not only the successful scenarios we intended, but also unexpected or undesired ones." — Panuwat Poon, Business Lead - RISA AI product manager, SCB10X.

The educational chatbot project exemplified these new complexities. Developed to help Thai students prepare for the PISA exam—a global triennial assessment, administered by the OECD, that measures the reading, math, and science skills of 15-year-olds in countries worldwide to benchmark national education systems—the application needed to provide instant feedback on responses to a mock test , offer personalized explanations, and maintain safe interactions across diverse student personas and conversation scenarios. Manual testing simply couldn’t match the breadth, creativity, or speed required to safeguard modern educational AI.

The stakes were particularly high given the educational context and cultural sensitivities. "It’s crucial for us that our chatbot avoids conversations on topics such as gender equality, politics, or other sensitive subjects that require special care in our society. We must also ensure the bot never provides harmful or hallucinated responses" Poon emphasized. The team needed comprehensive testing that could identify potential safety violations across numerous interaction patterns that manual testing simply couldn't cover efficiently.

Implementing Comprehensive AI Testing with Snowglobe

Snowglobe's simulation capabilities directly addressed SCB10X's testing challenges by automating the generation of diverse user personas and interaction scenarios. The platform enabled comprehensive safeguarding across multiple risk profiles, including attempts to steer conversations toward inappropriate topics or "jailbreak" the application's safety guardrails, or to use the service in ways that violate its intended purpose or usage policies.

“What truly sets Snowglobe apart is its remarkable ability to simulate a broad array of student personas and create scenarios we might never have anticipated ourselves,” Poon explained. The platform generated over 400 test cases spanning 50 unique personas, rigorously probing risk profiles such as monarchy-related topics, off-topic conversations, student activism, content safety, and other out-of-scope usage. These test cases were exceptionally thorough—including highly imaginative, uncommon, and adversarial prompts that manual testing would likely have missed or never even conceived of.

The technical integration proved straightforward, leveraging the Thai Large Language Model (LLM) developed SCB10X team. "We fine-tuned our Thai LLM to this specific education use case application. Then, integrating with Snowglobe was straightforward—we just needed to add a simple code snippet provided by Guardrails AI," Poon noted.

Snowglobe's user interface particularly impressed the SCB10X team, including QA engineers who found the platform intuitive and comprehensive. The system allowed detailed review of simulated conversations, manual annotation of results, and comprehensive export capabilities for further analysis.

Transforming Testing Efficiency and Application Safety

The impact of implementing Snowglobe was immediate and dramatic. Initial testing revealed significant vulnerabilities, with early simulations showing high failure rates across different risk profiles. "We had issues with content safety, failures in terms of response about sensitive topics such as politics, student activism, and problems with irrelevant topic conversations," Poon recalled.

However, the detailed feedback from Snowglobe enabled rapid iteration and improvement. The team exported results to Excel files, analyzed failure cases in detail, and used the insights to refine their system prompts and safety mechanisms. The efficiency gains were remarkable—what previously required 2-3 people working for at least a week could now be accomplished in a single day. This acceleration enabled rapid iteration cycles and more comprehensive testing coverage than manual methods could achieve.

As a result, the chatbot safely launched to over 9,000 students in 300 schools—with enthusiastic feedback from both teachers and learners. With this foundation of efficiency, trust, and safety, the SCB10X team is now exploring to potentially scale the application across Thailand to serve other standard exams where the number of potential users could go over 100,000 students annually.

Ready to Raise the Bar for AI Reliability?

Explore how Snowglobe can enhance the safety and reliability of your AI solution.

Book time with one of our founders to request a demo and see Snowglobe in action
Try out Snowglobe now!

See similar articles

Thu, Sep 11 9AM PT: Reliable AI Though Simulation Testing Webinar

Learn to simulate realistic user behavior for Al applications with Snowglobe

Sep 8, 2025

Introducing Snowglobe

Snowglobe is a simulation environment for LLM teams to test how their applications respond to real-world user behavior. Run full workflows through realistic scenarios, catch edge cases early, and confidently improve before deploying to production.

Aug 14, 2025