Skip to main content
2025-10-06
Platform Simplification, New Conversational Metrics, and Major Usability Upgrades

Platform Simplification and SDK v3.0

We’ve undertaken a major simplification of our core concepts to make the platform more intuitive. Evaluation Tasks are now simply Evaluations, and Metric Types have been renamed to Metrics. The old parent Evaluations entity has been removed entirely. These changes streamline the workflow and clarify the relationship between different parts of Galtea.To support this, we’ve released a new major version of our SDK. Please see the Migration to v3.0 guide for detailed instructions on updating your projects.

New Conversational Metrics

We’re excited to introduce two new metrics designed specifically for evaluating conversational AI:
  • User Objective Accomplished: Evaluates whether the user’s stated goal was successfully and correctly achieved during the conversation.
  • User Satisfaction: Assesses the user’s overall experience, focusing on efficiency and sentiment, to gauge their satisfaction with the interaction.

Enhanced Test Case Feedback and Management

Improving test quality is now a more collaborative process. When upvoting or downvoting a Test Case, you can now add a user_score_reason to provide valuable context for your feedback.Additionally, you can now filter test cases by their score directly via the SDK using the user_score parameter in the test_cases.list() method.

Dashboard and SDK Usability Improvements

We’ve rolled out several updates to make your workflow smoother and more efficient:
  • Improved Dashboard Navigation: Navigating between related entities like Tests, Test Cases, and Evaluations is now more intuitive. We’ve also adjusted table interactions—you can now single-click a row to select and copy text without navigating away. To see an entity’s details, simply right-click the row.
  • Efficient Batch Fetching in SDK: The SDK now allows you to fetch objects by providing a list of IDs (e.g., fetching all Test Cases for a list of Test IDs at once), significantly improving usability for batch operations.

General Improvements

This release also includes numerous performance optimizations and minor UI/UX enhancements across the entire platform to provide a faster and more polished experience.
2025-09-29
Generate Scenarios from Quality Tests, URL Validation Metric, and Platform Boost

Create Conversational Scenarios from Your Quality Tests

You can now generate comprehensive Scenario Based Tests directly from your existing Quality Tests. This powerful feature allows you to transform your single-turn, gold-standard test cases into realistic, multi-turn conversational scenarios, significantly accelerating the process of evaluating your AI’s dialogue capabilities.

New Deterministic Metric: URL Validation

We’ve added the URL Validation metric to our deterministic evaluation suite. It ensures that all URLs in your model’s output are safe, properly formatted with HTTPS, and resolvable. It includes strict validation and SSRF protection, making it essential for any application that generates external links.

Platform Enhancements and Performance Boost

We’ve rolled out major performance improvements across the platform, with a special focus on the analytics views, making them faster and more responsive. This update also includes several minor visual fixes, such as ensuring icons render correctly in different themes, to provide a smoother and more polished user experience.
2025-09-22
Consistent Sorting, Improved Navigation, and SAML SSO

Consistent Sorting and Enhanced Navigation

We’ve made significant usability improvements to the platform. Table sorting is now more consistent and persistent; you can navigate away, refresh the page, and your chosen sort order will remain, ensuring a smoother workflow. Additionally, navigating from an Evaluation is easier than ever. You can now right-click on any evaluation in the table or use the dropdown menu in the details page to jump directly to the related Test Case or Session.
Right-click navigation from the Evaluation table

Upvote and Downvote Test Cases

To help teams better curate their test suites, we’ve introduced a voting system for Test Cases. You can now upvote or downvote test cases directly from the Dashboard, providing a quick feedback loop on test case quality and relevance.
Upvote and downvote buttons for Test Cases

SAML SSO Authentication

Organizations can now enhance their security by configuring SAML SSO for authentication. This allows for seamless and secure access to the Galtea platform through your existing identity provider.

Improved Misuse Resilience Metric

The Misuse Resilience metric has been enhanced to accept the full product context. This allows for a more accurate and comprehensive evaluation of your model’s ability to resist misuse by leveraging a deeper understanding of your product’s intended capabilities and boundaries.

New Analytics Filter

The analytics page now includes a filter for Test Case language. This allows you to narrow down your analysis and gain more precise insights into the performance of your multilingual models.Enjoy the improvements!
2025-09-15
Test Case Confidence, Custom Judges for Conversations, and More

Confidence Scores for Generated Test Cases

We’re introducing confidence scores for all generated Test Cases. This new feature provides a clear indicator of the quality and reliability of each test case, helping you better understand your test suites and prioritize human review efforts. Higher scores indicate greater confidence in the test case’s relevance and accuracy.

Simplified and More Flexible Metric Creation

Creating Metrics is now more intuitive and flexible. Metrics are now directly linked to a specific test category (QUALITY, RED_TEAMING, or SCENARIOS), which simplifies the creation process by tailoring the available parameters to the relevant test category. This change makes it easier to define metrics that are perfectly aligned with your evaluation goals. See the updated creation guide for more details.

Custom Judges for Conversational Evaluation

You can now create your own Custom Judges specifically for evaluating multi-turn conversations. This powerful feature allows you to define complex, stateful evaluation logic that assesses the entire dialogue, enabling you to measure nuanced aspects like task completion, context retention, and persona adherence across multiple turns. Learn more in our guide to evaluating conversations.

Conversation Simulator Enhancements

We’ve added more control and realism to the Conversation Simulator:
  • Agent-First Interactions: The simulate method now includes an agent_goes_first parameter, allowing you to test scenarios where the agent initiates the conversation. See the SDK docs.
  • Selectable Conversation Styles: When creating a Scenario Based Test, you can now choose a conversation style (written or spoken). This influences the tone and formality of the synthetic user’s dialogue, enabling more realistic testing. This is available under the strategies parameter in the test creation method.
2025-09-08
Smarter Simulations, Bulk Uploads, and Streamlined Onboarding

More Realistic Conversation Simulations

Our Conversation Simulator has been significantly upgraded to generate more realistic and human-like interactions. The simulator’s evaluation of stopping criteria is now more precise, ensuring that your multi-turn dialogue tests conclude under the correct conditions and provide more accurate insights into your AI’s conversational abilities.Additionally, conversations generated during simulations are now linked to the specific product version they were created with, allowing for better tracking and traceability.
To enable version tracking for simulated conversations, please upgrade to Galtea SDK version 1.8.1 or higher.

Bulk Document Uploads for Test Generation

You can now upload multiple files at once using ZIP archives to generate comprehensive test suites. This feature streamlines the process of creating tests from large document collections, saving you time and effort when building out your evaluation scenarios. For more details, see the test creation documentation.

Streamlined Onboarding Experience

Getting started with a new product is now faster than ever. When you create a new product in the Galtea dashboard, an initial version is now created automatically. This simplification streamlines the setup process and helps you get to your first evaluation more quickly.

Enhanced Document Processing

We’ve improved our backend for processing documents used in test generation. This enhancement leads to more accurate and relevant test cases being created from your knowledge bases, improving the overall quality of your evaluations.
2025-09-01
Galtea Rebranding, Enhanced Analytics, and Platform Security

Fresh New Look: Galtea Rebranding is Live!

We’re excited to unveil Galtea’s complete visual transformation! Our new branding includes a refreshed logo, updated slogan, and a modern look across the entire platform. Experience the new design on our website and explore the updated dashboard interface. This rebrand reflects our continued commitment to providing a more intuitive and visually appealing user experience.
Galtea Rebranding

Enhanced Analytics with Improved Version Comparisons

The Analytics views have received significant improvements to make data analysis more efficient and user-friendly. Key enhancements include:
  • Streamlined Chart Comparisons: Charts now support easier side-by-side version comparisons, helping you quickly identify performance differences across iterations.
  • Optimized Filter Layout: Filters no longer occupy valuable screen real estate, giving you more space to focus on your analytics data and insights.
  • Improved Visual Clarity: Enhanced data presentation makes it easier to interpret results and make informed decisions about your AI systems.
These improvements build upon our existing analytics capabilities to provide a more seamless evaluation experience.

Strengthened Authentication and Security

We’ve implemented more secure and robust authentication mechanisms across the platform. These behind-the-scenes security enhancements provide better protection for your data and ensure reliable access to your Galtea workspace, giving you peace of mind when working with sensitive AI evaluation data.

Advanced PDF Knowledge Extraction

Our document processing capabilities have been significantly improved for PDF files. The enhanced extraction algorithms now provide more accurate and comprehensive knowledge retrieval from PDF documents, making it easier to create relevant test cases and evaluation scenarios from your existing documentation and knowledge bases.
2025-08-11
Dashboard Enhancements: TestCase Management, Judge Templates, and System Resilience

TestCase Scenarios Creation and Edition via Dashboard

You can now create and edit TestCase Scenarios directly through the Dashboard interface. This streamlined workflow allows you to define complex multi-turn conversation scenarios without leaving the platform, making it easier to set up comprehensive testing workflows for your conversational AI systems.

Enhanced TestCase Dashboard with Test-Specific Columns

We’ve improved the TestCase Dashboard with smarter table displays that now only show columns relevant to each specific test. This reduces visual clutter and makes it easier to focus on the information that matters most for your particular testing scenario, whether you’re working with single-turn evaluations, conversation scenarios, or other tests.

New Judge Template Selector for Judge Metrics

When creating Judge metrics, you can now select from pre-built judge templates to accelerate your metric setup process. This feature provides a starting point for common evaluation patterns while still allowing full customization of your evaluation prompts and scoring logic.

Infrastructure Improvements for Event Resilience

We’ve enhanced our event handling infrastructure to provide better resilience against unexpected system events. These improvements help ensure that your tests and evaluations are preserved and continue running smoothly, even during system maintenance or unexpected interruptions.
2025-08-04
Create Custom Judges via Own Prompts in Metric Creation, New Deterministic Metrics and UI Improvements

Create Custom Judges via Your Own Prompts

You can now define Custom Judge metrics by crafting your own evaluation prompts during metric creation. This allows you to encode your domain-specific rubrics, product constraints, or behavioral guidelines directly into the metric—giving you precise control over how LLM outputs are assessed. Simply write your prompt, specify the scoring logic, and Galtea will leverage LLM-as-a-judge techniques to evaluate outputs according to your standards.

New Deterministic Metrics

Four new deterministic metrics are now available:
  • Text Similarity: Quantifies how closely two texts resemble each other using character-level fuzzy matching.
  • Text Match: Checks if generated text is sufficiently similar to a reference string, returning a pass/fail based on a threshold.
  • Spatial Match: Verifies if a predicted box aligns with a reference box using IoU scoring, producing a pass/fail result.
  • IoU (Intersection over Union): Computes the overlap ratio between predicted and reference boxes for alignment and detection tasks.
These new metrics complement our existing evaluation suite and are perfect for scenarios requiring deterministic, rule-based scoring.

Dashboard Redesign: Second Iteration

We’ve launched the second iteration of our redesigned dashboard with a refreshed visual language focused on clarity and usability. Key improvements include:
  • Modern Forms: Forms have been modernized to provide a more intuitive and visually appealing user experience, as well as giving a more professional look to the Dashboard.
This redesign represents our commitment to delivering a more intuitive and visually appealing user experience while maintaining the powerful functionality you rely on.
2025-07-28
New Evaluator Models, Classic Metrics, and Dashboard Redesign

Expanded Evaluator Model Support

We’ve added support for more evaluator models to enhance your evaluation capabilities:
  • Gemini-2.5-Flash: Google’s latest high-performance model optimized for speed and accuracy
  • Gemini-2.5-Flash-Lite: A lightweight variant offering faster processing with efficient resource usage
  • Gemini-2.0-Flash: Google’s established model providing reliable evaluation performance
You can now link these models to specific Metrics for consistent evaluation results and leverage specialized models for different evaluation scenarios. Learn more about configuring evaluator models in our SDK documentation.

Enhanced Conversation Simulation

Testing conversational AI just got more powerful with two major improvements:
  • Visible Stopping Reasons: You can now see exactly why simulated conversations ended in the dashboard, providing crucial insights into dialogue flow and helping you identify areas for improvement.
  • Custom User Persona Definitions: Create highly specific synthetic user personas when generating Scenario Based Tests. Define detailed user backgrounds, goals, and behaviors to test how your AI handles diverse user interactions more effectively.
These enhancements work seamlessly with our Conversation Simulator to deliver more realistic and insightful testing scenarios.

Classic NLP Metrics Now Available

We’ve expanded our metric library with three essential deterministic metrics for precise text evaluation:
  • BLEU: Measures n-gram overlap between generated and reference text, ideal for machine translation and constrained generation tasks.
  • ROUGE: Evaluates summarization quality by measuring the longest common subsequence between candidate and reference summaries.
  • METEOR: Assesses translation and paraphrasing by aligning words using exact matches, stems, and synonyms for more nuanced evaluation.
These classic metrics complement our existing evaluation suite and are perfect for scenarios requiring deterministic, rule-based scoring.

Enhanced Red Teaming with Jailbreak Resilience v2

Security testing gets an upgrade with Jailbreak Resilience v2, an improved version of our jailbreak resistance metric. This enhanced evaluation provides more comprehensive assessment of your model’s ability to resist adversarial prompts and maintain safety boundaries across various attack vectors.

Dashboard Redesign: First Iteration

We’ve launched the first iteration of our redesigned dashboard with a refreshed visual language focused on clarity and usability. Key improvements include:
  • Modern Typography: Cleaner, more readable text throughout the platform
  • Refined UI Elements: Updated buttons, cards, and form elements with reduced rounded corners for a more contemporary look
  • Streamlined Tables: Enhanced data presentation with improved content layout
  • Updated Login Experience: A more polished and user-friendly authentication flow
This redesign represents our commitment to delivering a more intuitive and visually appealing user experience while maintaining the powerful functionality you rely on.

Improved SDK Documentation

We’ve enhanced our SDK documentation with clearer guidance on defining evaluator models for metrics, making it easier to configure and customize your evaluation workflows.
2025-07-21
Synthetic User Simulation, New Data Leakage Metric, and Platform Enhancements

Test Your Chatbots with Synthetic Users and Scenarios

It is now possible to generate tests that simulate realistic, multi-turn user interactions. Our new Scenario Based Tests allow you to define synthetic user personas and goals to evaluate how well your conversational AI handles complex dialogues.This feature is powered by the Conversation Simulator, which programmatically runs these scenarios to test dialogue flow, context handling, and task completion. Get started with our new Simulating User Conversations tutorial.

New Red Teaming Metric: Data Leakage

We’ve added the Data Leakage metric to our suite of Red Teaming evaluations. This metric assesses whether your LLM returns content that could contain sensitive information, such as PII, financial data, or proprietary business data. It is crucial for ensuring your applications are secure and privacy-compliant.

Enhanced Metric Management

We’ve rolled out several improvements to make metric creation and management more powerful and intuitive:
  • Link Metrics to Specific Models: You can now associate a Metric with a specific evaluator model (e.g., “GPT-4.1”). This ensures consistency across evaluation runs and allows you to use specialized models for certain metrics.
  • Simplified Custom Scoring: We’ve introduced a more streamlined method for defining and calculating scores for your own deterministic metrics using the CustomScoreEvaluationMetric class. This makes it easier to integrate your custom, rule-based logic directly into the Galtea workflow. Learn more in our tutorial on evaluating with custom scores.

Support for Larger Inputs and Outputs

To better support applications that handle large documents or complex queries, we have increased the maximum character size for evaluation inputs and outputs to 250,000 characters.Enjoy the new features and improvements! As always, we welcome your feedback.
2025-07-14
Conversation Simulation, New Metric, and Credit Management

Test Your Chatbots with Realistic Conversation Simulation

You can now evaluate your conversational AI with our new Conversation Simulator. This powerful feature allows you to test multi-turn interactions by simulating realistic user conversations, complete with specific goals and personas. It’s the perfect way to assess your product’s dialogue flow, context handling, and task completion abilities.Get started with our step-by-step guide on Simulating User Conversations.

New Metric: Resilience To Noise

We’ve expanded our RAG metrics with Resilience To Noise. This metric evaluates your product’s ability to maintain accuracy and coherence when faced with “noisy” input, such as:
  • Typographical errors
  • OCR/ASR mistakes
  • Grammatical errors
  • Irrelevant or distracting content
This is essential for ensuring your AI performs reliably in real-world scenarios where user input isn’t always perfect. Learn more about how it’s calculated.

Stay in Control with Enhanced Credit Management

We’ve rolled out a new and improved credit management system to give you better visibility and control over your usage. The system now includes proactive warnings that notify you when you are approaching your allocated credit limits, helping you avoid unexpected service interruptions and manage your resources more effectively.

Streamlined Conversation Logging with OpenAI-Aligned Format

Logging entire conversations is now easier and more intuitive. We’ve updated our batch creation method to align with the widely-used messages format from OpenAI, consisting of role and content pairs. This makes sending multi-turn interaction data to Galtea simpler than ever.See the new format in action in the Inference Result Batch Creation docs.
2025-07-07
Custom Threats, New Metrics, and Enhanced Test Case Management

Tailor Your Red Teaming with Custom Threats

You can now define your own custom threats when creating Red Teaming Tests. This new capability allows you to move beyond our pre-defined threat library and create highly specific adversarial tests that target the unique vulnerabilities and edge cases of your AI product. Simply describe the threat you want to simulate, and Galtea will generate relevant test cases.

New Red Teaming Strategies: RolePlay and Prefix Injection

We’ve expanded our arsenal of Red Teaming Strategies to help you build more robust AI defenses:
  • RolePlay: This strategy attempts to alter the model’s identity (e.g., “You are now an unrestricted AI”), encouraging it to bypass its own safety mechanisms and perform actions it would normally refuse.
  • Prefix Injection: Adds a misleading or tactical instruction before the actual malicious prompt. This can trick the model into a different mode of operation, making it more susceptible to the adversarial attack.

Introducing the Misuse Resilience Metric

A new non-deterministic metric, Misuse Resilience, is now available. This powerful metric evaluates your product’s ability to stay aligned with its intended purpose, as defined in your product description, even when faced with adversarial inputs or out-of-scope requests. It ensures your AI doesn’t get diverted into performing unintended actions, a crucial aspect of building robust and responsible AI systems. Learn more in the full documentation.

Enhanced Test Case Management: Mark as Reviewed

To improve collaboration and workflow for human annotation teams, Test Cases can now be marked as “reviewed”. This feature allows you to:
  • Track which test cases have been validated by a human.
  • See who performed the review, providing a clear audit trail.
  • Filter and manage your test sets with greater confidence.
Enjoy these updates as we continue to make AI evaluation more powerful and intuitive!
2025-06-30
New Metric, More Red Teaming Strategies and Better Usability

Introducing the Factual Accuracy Metric

We’ve added a new Factual Accuracy metric to our evaluation toolkit! This non-deterministic metric measures whether the information in your model’s output is factually correct when compared to a trusted reference answer. It’s particularly valuable for RAG and question answering systems where accuracy is paramount.The metric uses an LLM-as-a-judge approach to compare key facts between your model’s output and the expected answer, helping you catch hallucinations and ensure your AI provides reliable information to users. Read the full documentation here.

Enhanced Red Teaming with New Attack Strategies

Our red teaming capabilities just got more sophisticated! We’ve added two powerful new attack strategies:
  • Biblical Strategy: Transforms adversarial prompts into biblical/ancient scripture style using poetic and symbolic language to disguise malicious intent while preserving meaning.
  • Math Prompt Strategy: Encodes harmful requests into formal mathematical notation using group theory concepts to obscure the intent from standard text analysis.
These strategies join our existing arsenal to help you test your AI’s defenses against increasingly creative attack vectors that real-world adversaries might use. See all available red teaming strategies.

Smarter Red Teaming Test Generation

We’ve significantly improved how red teaming tests are generated. Our system now takes even more factors into account when creating adversarial test cases:
  • Product-Aware Generation: Tests are now more precisely tailored to your specific product’s strengths, weaknesses, and operational boundaries.
  • Context-Sensitive Attacks: The generation process better understands your product’s intended use cases to craft more relevant and challenging scenarios.
  • Enhanced Threat Modeling: Our algorithms now consider a broader range of factors when determining the most effective attack vectors for your particular AI system.
This means your red teaming tests will be more effective at uncovering real vulnerabilities and edge cases specific to your product.

Better Metric Source Visibility and Management

Understanding where your metrics come from is now easier than ever! We’ve enhanced the platform to provide clearer visibility into metric sources:
  • Source Classification: All metrics are now clearly labeled with their source - whether they’re from established frameworks, custom Galtea implementations, or other origins.
  • Enhanced Filtering: You can now filter metrics by their source to quickly find the evaluation criteria that best fit your needs.
  • Improved Descriptions: Metric descriptions now include more detailed information about their origins and implementation, with links to relevant documentation.
Enhanced Metrics Table with Source Visibility
The three main metric sources you’ll see are:
  • Galtea: Custom metrics designed specifically for your needs, like our new Factual Accuracy metric
  • G-Eval: Framework-based metrics that use evaluation criteria or steps for assessment
  • Established Frameworks: Metrics adapted from proven evaluation libraries and methodologies
This transparency helps you make more informed decisions about which metrics to use for your specific evaluation needs.Enjoy these improvements, and as always, we’d love to hear your feedback on these new features!
2025-06-23
Quality Test Generation, Metric Tags, and Product Details

Generate Quality Tests from Examples

You can now create Quality Tests directly from your own examples using the new Few Shots parameter. This makes it easier to tailor tests to your specific use cases and ensure your models are evaluated on the scenarios that matter most. Learn more about test creation.

Metric Tags

Metrics now support tags for easier classification and discovery. Quickly find and organize metrics relevant to your projects. See all metrics.

Enhanced Product Details

Products now include new detail fields:
  • Capabilities: What your product can do.
  • Inabilities: Known limitations.
  • Security Boundaries: Define the security scope and constraints.
These additions help you document and communicate your product’s strengths and boundaries more clearly. Read about product details.

Improved Q&A Generation

Question-answer pairs are now generated with improved accuracy and clarity, thanks to better text filtering and processing.

New Guide: Setting Up Your Product Description

We’ve created a comprehensive guide to help you set up your product descriptions effectively. This guide covers best practices, examples, and tips to ensure your product is presented in the best light. Check it out here

General Improvements

We’ve made various bug fixes and UX/UI improvements across the Dashboard, SDK, and more, making your experience smoother and more reliable.
2025-06-16
Major Platform Overhaul & SDK v2

Major Platform Overhaul

We’ve been hard at work reorganizing and expanding the Galtea platform to handle more use cases and prepare for exciting future features. This release brings significant improvements to the dashboard, SDK, and test generation.

Dashboard Enhancements

  • Reorganized Version, Test, and Evaluation Views:
    Detailed views have been streamlined and improved for clearer insights.
  • New Sessions Visualizations:
    Easily organize and navigate conversations through our new Sessions feature.
  • Evaluations Visualization Removed:
    The dashboard now focuses on Sessions and Evaluations as the primary elements.
  • Better Filters Across Tables:
    Quickly find what you need with improved filtering capabilities on the dashboard.
  • General Bug Fixes & UX Improvements:
    Enjoy smoother interactions, clearer tooltips, and more intuitive code snippets.

SDK v2 Released

The new Galtea SDK v2 is here! It includes breaking changes to simplify workflows and add session support. Check out the migration guide for a smooth transition.
  • Implicit Session Creation:
    Sessions are created automatically when needed for evaluations.
  • Repurposed evaluations.create():
    The old method is replaced by create_single_turn() for test-based evaluations, while create() now exclusively handles session-based evaluations.
  • New evaluations.create_single_turn() Method:
    Use this for single-turn test cases. It now requires version_id instead of evaluation_id.
  • Simplified Version Creation:
    The galtea.versions.create() method now accepts all properties directly, no need for an optional_props dictionary.
  • Sessions Support:
    Group multiple inference results under a single session for better multi-turn tracking using galtea.sessions.create().

Improved Test Case Generation

  • Smarter Test Coverage:
    Test cases are now distributed more intelligently across your documents for better coverage based on the number of questions you choose to generate.
  • Single Threat per Red Teaming Test:
    Red Teaming Tests now only allow a single threat per test, ensuring clearer results.
Enjoy the upgrade!
2025-05-26
Enhanced Test Generation & Streamlined Workflow with Code Snippets

Improved Test Generation

Our test generation capabilities have been significantly upgraded:
  • Versatile Red Teaming: Red Teaming tests are now more powerful, allowing you to employ multiple attack strategies to thoroughly probe your AI’s defenses.
  • Better Synthetic Data: We’ve made general improvements to the quality of synthetic data generation, ensuring your tests are more effective and realistic.

Code Snippets Now Available on the Dashboard

We’re making it easier than ever to integrate Galtea into your development process!
  • Simplified Evaluation Setup: The “Create Evaluation” form on the dashboard has been replaced with a convenient code snippet. Simply copy and paste it directly into your project to get started.
  • Streamlined Creation: Similarly, a new code snippet for “Create Evaluation” is now available on the dashboard, simplifying how you send evaluation data to Galtea. You can easily copy and paste this into your project.
    Create Evaluation Code Snippet

Usability Improvements

We’ve also rolled out several usability enhancements based on your feedback:
  • Enhanced Readability in Tables: Table cells now correctly render line breaks, making it easier to view multi-line content and detailed information at a glance.
  • Controlled Test Case Generation: To ensure optimal performance and manageability, the maximum number of test cases automatically generated for a single test from a knowledge base is now capped at 1000.
Enjoy these improvements and as always, we welcome your feedback!
2025-05-19
Platform Upgrades: Easier Onboarding, Improved UI & Finer Control

Streamlined Onboarding and Quicker Starts

We’ve revamped the platform onboarding! It’s now more visually intuitive and to help new users get evaluating in no time, we now provide a default Metric and a default Test. This makes it easier than ever to get started with Galtea and run your first evaluation quickly.

Deeper Insights with Visible Conversation Turns

Understanding the full context of interactions is key. You can now view the complete conversation turns associated with your test cases directly within the dashboard. This offers richer context, aiding in more thorough analysis and debugging of your conversational AI products.
Conversation Turns Display in Dashboard

Dashboard Usability Boost

We’re continually refining the Galtea experience. This update brings several UI enhancements across the dashboard, designed to improve overall usability and make your workflow smoother and more intuitive.

Tailor Your Test Generation: Selectable Test Case Counts

Gain more control over your testing process! When generating tests, you can now specify the exact number of test cases you want Galtea to create. This allows you to fine-tune the scope and depth of your tests according to your needs.

Track Your Team’s Work: Creator Attribution Displayed

Clarity in collaboration is important. Now, the user who created a Product, Test, Version, or other key assets will be clearly displayed on their respective details pages. This helps in tracking ownership and contributions within your team.

Enhanced Table Functionality for Easier Data Navigation

Working with data tables in the dashboard is now more efficient:
  • Clear Filter Indicators: Easily see which filters are currently applied to any table.
  • Quick Filter Reset: A new “Clear All Filters” button allows you to reset your view with a single click.
Enjoy these improvements and as always, we welcome your feedback!
2025-05-12
New Conversation Evaluation and Extended Data Generation Capabilities

New Conversation Evaluation Metrics

You can now evaluate conversations using these new metrics:
  • Role Adherence - Assess how well an AI stays within its defined role
  • Knowledge Retention - Measure how effectively information is remembered throughout a conversation
  • Conversation Completeness - Evaluate whether all user queries were fully addressed
  • Conversation Relevancy - Determine if responses remain on-topic and purposeful

Enhanced Security Framework

We’ve significantly improved user access management by implementing an Attribute-Based Access Control (ABAC) strategy, providing more granular control over who can access what within your organization.

Extended Data Generation Capabilities

Our data generation tools have been expanded with:
  • Catalan Language Support - Create synthetic data in Catalan to enhance your multilingual applications
  • Added support for text-based files - Upload your knowledge base in virtually any text-based format including JSON, HTML, Markdown, and more

Improved Test Creation Experience

We’ve enhanced the clarity of threat selection in the Test Creation form. The selection now displays both the threat and which security frameworks that threat covers, making it easier to align your testing with specific security standards.
Improved Threat Selection

Analytics & Navigation Enhancements

  • Reduced Clutter in Analytics Filters - Tests and Versions filtering now only display elements that have been used in an evaluation
  • Streamlined Navigation - Clicking the “input” cell in the evaluations table now navigates directly to the associated Test Case

Bug Fixes & Improvements

We’ve resolved several issues to ensure a smoother experience:
  • Fixed a bug that could trigger an infinite loop in the Test Cases List of the dashboard
  • Addressed multiple small UI glitches and errors throughout the platform
Enjoy these improvements and as always, we welcome your feedback!
2025-05-05
Analytics Upgrades and Red Teaming Test Improvements

Improvements in Red Teaming Tests

  • New “misuse” threat implemented
    Now red teaming incorporates a new threat, misuse, which are queries that not necessaryly malicious however out-of-scope for you specific product. You can now test whether your product can successfully block these queries by marking “Mitre Atlas: Ambiguous prompts” in the threat list.
  • Better “data leakage” and “toxicity” tests
    The red teaming tests incorporate better your product meta data, to generate the most adequate test cases for “data leakage” and “toxicity”.

Analytics Page Upgrades

We’re continuing to expand the power of the Analytics page! This update introduces:
  • Radar View for Version Comparison
    You can now visualize performance across multiple metrics for a single version using the brand-new radar view. It provides a quick way to understand strengths and weaknesses at a glance.
  • Smarter Metric Filters
    Filters now only show metrics that have actually been used in evaluations—removing unnecessary clutter and making it easier to find relevant data.
  • Graph Tooltips
    Hovering over truncated names now reveals full labels with tooltips, helping you understand graph contents more clearly.
Radar View

SDK Safeguards

We’ve added protections to ensure your SDK integration is as smooth and reliable as possible:
  • Version Compatibility Checks
    If the SDK version you’re using is not compatible with the current API, it will now throw a clear error to prevent unexpected behavior.
  • Update Notifications
    When a new SDK version is available, you’ll get a console message with update information—keeping you in the loop without being intrusive.

Bug Fixes

  • Metric Range Calculation
    Some default metrics were previously displaying inverted scoring scales (e.g., treating 0% as best and 100% as worst). This is now resolved for accurate interpretation.
  • Test Creation Not Possible Through .txt Knowledge Base Files
    Due to a recent refactor, the creation of tests using knowledge base files with .txt extensions was not possible. This has been fixed and you can now create tests using .txt files as the knowledge base again.
2025-04-28
Monitorization and UI improvements

Monitoring Is Live!

Real-world user interactions with your products can now be fully monitored and analyzed. Using the Galtea SDK, you can trigger evaluations in a production environment and view how different versions perform with real users. Read more here.

Improved Galtea Red Teaming Tests

Our simulation-generated tests have been upgraded—delivering higher-quality outcomes. Red teaming tests can now be directed to validate even more specific aspects of various security standards, such as OWASP, MITRE ATLAS, and NIST. Specifically, we have improved jailbreak attacks, in addition to new financial attacks and toxicity prompts.

New Analytics Page

A completely redesigned analytics page is now available! It features:
  • Enhanced Filtering Capabilities.
  • Improved Data Clarity and Layout.
    New Analytics Image
The new design not only raises the clarity and density of data presentation but also improves your overall user experience.
And with monitoring active, you can see production evaluation results in real time on this page!

User Experience Enhancements

We’re continuously refining the platform based on your feedback. This week’s improvements include:
  • Customizable Evaluations List:
    You can now select which metrics you are interested in, so the evaluations list only shows the ones you need.
  • Enhanced Evaluation List Filtering:
    Easily filter evaluations by versions, evaluations, tests and test groups.
  • Enhanced Test List Filtering:
    Easily filter tests by its group.
  • Smart Table Sorting:
    When you apply a custom sort, the default (usually creation date) is automatically disabled.tional Filters” />
    Additional Filters
Enjoy the improvements!
I