Skip to main content

What are Behavior Tests?

Behavior tests in Galtea are designed to evaluate the conversational capabilities of your product through multi-turn dialogue interactions. These tests use the Conversation Simulator to create realistic conversations with synthetic users that have specific goals, personalities, and behaviors. Unlike Accuracy Tests that focus on single-turn question-answer pairs, or Security & Safety Tests that test security vulnerabilities, Behavior tests evaluate how well your AI product can:
  • Maintain context across multiple conversation turns
  • Guide users toward successful task completion
  • Handle unexpected but realistic user inputs
  • Stay in character and follow conversation guidelines
  • Manage complex dialogue flows

Creating Behavior Tests

You can create behavior tests in Galtea through two methods:
1

Define Your Product

Start by defining your product in Galtea. These detailed product properties will be used as the foundation for generating realistic behavior scenarios.
2

Configure the test

Select Behavior as the test type and Generated as the test origin
The test creation process can be done via the SDK or the Galtea dashboard
3

Generate Behavior Scenarios

Galtea will automatically analyze your product definition to create tailored conversation scenarios, complete with:
  • Realistic user personas based on your target audience
  • Goals aligned with your product’s use cases
  • Natural conversation flows that test key features

Conversation Flow Categories

Behavior tests can cover various types of conversational interactions:
Scenarios where the synthetic user has a specific goal to accomplish, such as booking a service, making a purchase, or getting support for an issue.
Scenarios focused on retrieving specific information or answers, testing the AI’s knowledge and ability to provide helpful responses.
Scenarios that simulate customer service interactions, including complaints, troubleshooting, and problem resolution.
Scenarios where the user is exploring options, comparing products, or seeking advice before making decisions.
Scenarios that involve multiple stages or require gathering various pieces of information over several turns.
Scenarios that test how the AI handles difficult, unclear, or unexpected user behavior while maintaining conversation quality.

Example Behavior Tests and File Format

Here are examples of behavior test content and structure:
GoalUser PersonaInputStopping CriteriasMax IterationsScenario
Book a one-way flight from SFO to JFK for next TuesdayA busy professional who is direct and values efficiencyI need to book a flightThe user has confirmed the flight booking|The chatbot indicates it cannot fulfill the request10Flight booking scenario
Find the cheapest round-trip flight to Europe in summerA budget-conscious student who asks many questionsHi, I’m looking for cheap flights to EuropeFlight is booked|User decides not to book|Maximum budget is exceeded15Budget travel scenario
Change an existing flight reservationA frustrated customer whose plans have changedI need to change my flight immediatelyReservation is successfully modified|Customer is transferred to agent8Flight modification scenario
GoalUser PersonaInputStopping CriteriasMax IterationsScenario
Get help with a defective productAn upset customer who received a broken itemMy product arrived broken and I’m very disappointedIssue is resolved|Customer requests supervisor|Refund is processed12Product defect resolution
Understand how to use a new featureA curious but non-technical userI heard about a new feature but don’t know how to use itUser successfully uses the feature|User gives up|Technical support is escalated10Feature education scenario
Cancel a subscriptionA polite customer who wants to downgrade serviceI’d like to cancel my subscription pleaseSubscription is cancelled|Alternative plan is accepted|Retention offer is declined8Subscription cancellation
GoalUser PersonaInputStopping CriteriasMax IterationsScenario
Find the right product for specific needsA thorough researcher who compares many optionsI’m looking for a solution but need help choosingProduct recommendation is accepted|User requests human consultation|User decides to research more15Product recommendation scenario
Get pricing information for enterprise solutionA business decision maker focused on ROIWhat are your enterprise pricing options?Quote is requested|Meeting is scheduled|User indicates budget constraints10Enterprise sales scenario
Compare different service tiersAn analytical customer who wants detailed comparisonsCan you help me understand the differences between your plans?Plan is selected|User requests trial|User needs more time to decide12Service comparison scenario
This structure is required by Galtea to automatically generate the test cases for your behavior tests as tests within the platform. If this format is not provided, you can still manually create the test cases.
For automatic processing, the file format must be CSV.
The examples provided above are simplified demonstrations. In actual CSV files, behavior tests can be much more detailed and the number of test cases (rows) can be significantly higher to provide comprehensive conversation testing coverage.

Structure of Behavior Tests

Behavior tests have a specific structure designed to enable realistic conversation simulation:
goal
Text
required
The overall objective the synthetic user is trying to achieve during the conversation. Example: “Book a one-way flight from San Francisco (SFO) to New York (JFK) for next Tuesday”
user_persona
Text
required
The personality and communication style of the synthetic user that will be simulated throughout the conversation. Example: “A busy professional who is direct and values efficiency. They prefer to get things done quickly without much small talk.”
input
Text
The first message the synthetic user sends to start the conversation. If not provided, the system will generate an appropriate opening based on the goal and persona. Example: “Hello, I need to book a flight.”
stopping_criterias
Text
A delimited string of conditions that, if met, will cause the conversation simulation to end. Use ; or | as delimiters to separate multiple criteria. Example: “The user has confirmed the flight booking|The chatbot indicates it cannot fulfill the request;User expresses satisfaction with the service”
max_iterations
Number
The maximum number of conversation turns before the simulation automatically ends. This prevents infinite loops and controls test duration. Example: 10
scenario
Text
A brief description of the scenario for documentation and organizational purposes. This helps categorize and manage different types of conversation tests. Example: “Flight booking scenario for business travelers”

Using Behavior Tests with the Conversation Simulator

Behavior tests are specifically designed to work with Galtea’s Conversation Simulator. When you run evaluations using behavior tests, the system will:
  1. Initialize the conversation using the input and user_persona
  2. Generate realistic user responses based on the goal and conversation history
  3. Continue the dialogue until one of the stopping_criterias is met or max_iterations is reached
  4. Evaluate the conversation using your selected metrics to assess performance
For detailed implementation examples, see the Conversation Simulator Tutorial.

Data Catalog

When generating behavior tests, you can optionally upload a Data Catalog — a file containing reference data from your product’s domain. This helps the generator produce scenarios grounded in realistic values rather than fully fictional ones.
The data catalog does not need to contain real production data. Synthetic samples, anonymized records, or representative mock data all work equally well — what matters is that the values reflect the shape and variety of your domain.

How it works

Values from the data catalog are incorporated into the scenario field of each generated test case. User personas remain fully synthetic and are never populated from the catalog. For example, if your product is a banking assistant, a data catalog might include sample account types, transaction descriptions, and card IDs. The generator would then reference those values in scenarios like “User asks about a pending charge of €47.80 at REST LA BARRACA on card C-001” instead of inventing generic placeholders.

Supported formats

CSV, JSON, JSONL, XML, YAML, and TXT files up to 50 MB.

What to include

  • Domain entities your users would reference (product names, plan tiers, category labels)
  • Sample identifiers (order IDs, ticket numbers, account codes)
  • Representative amounts, dates, or statuses
  • Any reference data that makes scenarios feel grounded in your specific domain

What to avoid

  • Actual customer PII or production credentials
  • Internal system IDs or backend states that end users would never see
  • Extremely large datasets — the generator uses up to ~50,000 characters of content

Best Practices for Manually Creating Behavior Tests

  • Make goals specific and measurable
  • Include realistic constraints (time, budget, preferences)
  • Vary complexity across different scenarios
  • Consider both successful and unsuccessful outcomes
  • Include personality traits that affect communication style
  • Specify technical knowledge level and domain expertise
  • Define emotional states and motivations
  • Consider cultural and contextual factors
  • Include both positive outcomes (goal achieved) and negative outcomes (failure, frustration)
  • Account for edge cases where the conversation might stall
  • Use multiple criteria connected with ”|” or ”;” to cover various scenarios
  • Make criteria specific enough to be clearly identifiable
  • Test common user journeys and edge cases
  • Include scenarios with different complexity levels
  • Cover various user types and use cases
  • Test both cooperative and challenging user behaviors