Zero-Code Trace Ingestion via OpenTelemetry
If your service is already instrumented with OpenTelemetry, you can now send traces to Galtea without writing any Galtea-specific code. The platform exposes an OTLP-compatible trace ingestion endpoint that accepts standard OTLP/JSON payloads from your existing OTel Collector.When you run a direct inference through Galtea, the platform automatically injects a W3Ctraceparent header into the request to your endpoint. If your service propagates that header internally (as most OTel-instrumented services do), all resulting spans are automatically correlated with the corresponding Galtea inference result. No manual ID passing, no SDK calls for trace creation, just your existing instrumentation working end to end.For services that are not OTel-instrumented, the Galtea SDK’s @trace decorator and start_trace context manager continue to work as before. A step-by-step guide covering all tracing methods, including W3C trace context propagation, is available in the documentation.Structured JSON Inputs
Test case inputs and session contexts now support structured JSON objects in addition to plain text strings. This means you can store complex, multi-field payloads directly in the platform database. Existing string-based inputs continue to work unchanged.In upcoming releases, structured JSON inputs will be more broadly supported across the platform, including endpoint connections and SDK methods.SDK: Synchronous Evaluation Polling
The Python SDK now includesevaluations.wait_for(), which polls one or more evaluations until they leave PENDING status and raises a TimeoutError if the deadline is exceeded. This simplifies scripts and CI pipelines that need to block until results are ready.Normalized JSON Field Match Metric
The JSON Field Match metric now supports a normalized comparison mode that performs case-insensitive and accent-insensitive matching within field values. For example,"Sí" and "SI" are treated as equal. This is useful when evaluating outputs in multilingual contexts or when casing differences are not meaningful.Editable Test Metadata
Tests now include an editablemetadata field where you can store free-form notes or custom tracking information. The field is available in the dashboard, the API, and the SDK, giving you a lightweight way to annotate tests without modifying their structure.Platform Improvements
- Time-of-day filters: Date filters throughout the dashboard now support hour, minute, and second precision, so you can narrow results to a specific time window rather than a full day.
- Augmented column: The test cases table now includes an Augmented column indicating which test cases were generated through AI augmentation.
- Double-click prevention: The Save Accepted button during metric generation no longer triggers duplicate submissions if clicked rapidly.
- Pagination stability: Table pagination no longer resets unexpectedly when navigating between pages via URL.
- Show Details navigation: The Show Details action in entity tables now opens directly to the details tab.
Specifications in Analytics
The analytics dashboard now includes a dedicated Specifications section that aggregates evaluation scores per specification, giving you a clear picture of how each policy area is performing across versions. Specification-level radar charts let you compare coverage at a glance, and clicking any specification label filters the entire dashboard to show only metrics linked to that specification.Version comparison charts have also been improved with click-to-highlight and double-click-to-navigate directly to the evaluations list.Test Case CSV Export
You can now export your test cases as CSV files directly from the dashboard. Select specific test cases or export all visible ones — the platform generates a CSV with 22 columns covering input, expected output, strategy, language, scores, and more, then provides a direct download link. Use exported data for offline review, spreadsheet analysis, or integration with external workflows.Documentation and Video Guides
The documentation has been significantly expanded with new video walkthroughs and reference pages:- Video guides now cover specifications, versions, traces, human evaluation, endpoint connection setup, and AI-powered specification generation — each embedded directly in the relevant docs page
- A new Writing Specifications guide covers how to define specifications for your product — behavioral expectations, policies, and rules that drive metric generation and test creation
- New reference pages for evaluation types, evaluation parameters, and endpoint connection configuration provide deeper detail on evaluation setup
- A runnable evaluation demo covers endpoint-based, specification-filtered, and local-agent evaluation modes
Platform Improvements
- JSON field matching metric: A new deterministic metric that compares top-level JSON fields between actual and expected output, returning the ratio of matching fields as the score. Supports lenient parsing of actual output wrapped in markdown code fences or surrounding text.
- Endpoint connection validation: Before running an evaluation, the platform now validates that all required endpoint connections on the version still exist — preventing runtime failures from deleted connections.
- Credit warning badge: A visual indicator in the sidebar shows your organization’s credit status — amber when credits are running low, red when exhausted — so you always know where you stand before launching evaluations.
- Retry defaults: New endpoint connections default to retries enabled with 3 attempts and exponential backoff, improving reliability out of the box.
- Sidebar overlay: The navigation sidebar now overlays content on hover instead of pushing the page layout, reducing visual disruption while browsing.
- Form and table polish: Clearer validation errors when switching to AI generation mode, a clear button for optional form fields, improved column width handling in tables, better form field reset behavior, and more consistent UI across the dashboard.
Improved Golden Dataset Generation
The golden dataset generation pipeline has been redesigned. You can expect more diverse questions, more accurate source attribution, and fewer duplicated or low-quality entries in your generated test cases.Platform Improvements
- Automatic text truncation: Long text in table cells across the dashboard is now automatically clamped to two rows. Click any truncated cell to expand and view the full content rendered as Markdown. This replaces inconsistent truncation behavior across tables and keeps the interface clean without hiding information.
- AI metric saving fix: An issue that prevented non-admin users from saving AI-generated metric candidates has been resolved. All team members can now review and accept metric suggestions without encountering permission errors.
- Evaluation link fix: The evaluation link generated in the SDK code snippets now navigates to the correct tab, ensuring you land on the evaluations view as expected.
AI-Powered Metric Creation
You can now generate evaluation metrics directly from your product’s specifications. Select one or more policy specifications, and the AI analyzes your product description and specification rules to produce ready-to-use metric candidates — complete with judge prompt, evaluation parameters, tags, and evaluator model. You review each candidate, edit anything that needs adjusting, and save the ones you want. Saved metrics are automatically linked back to their source specification.Learn more about the full workflow in the AI Metric Generation documentation.Data Catalogs for Behavior Tests
Behavior tests can now be grounded in real data. When creating a generated behavior test, you can upload a Data Catalog — a file containing real values like names, IDs, amounts, and dates. The scenario generator injects catalog values into the generated test cases, producing scenarios that reflect your actual data distribution instead of relying on fictional placeholders.Simplified Test Creation
The test creation form has been reorganized to reduce visual complexity. Optional fields — Custom User Focus, Data Catalog, and Language — are now grouped behind an Advanced Options toggle, keeping the main form focused on required fields and essential choices.Test Your Endpoint Connection
You can now test your endpoint connection directly from the creation and edit forms. A dedicated test panel sends a sample request to your endpoint and displays the raw response, so you can verify connectivity and response format before saving.Run Evaluations from Any Level
The evaluation run dialog now supports two new entry points: you can trigger a run from a specific test or from an individual test case, in addition to the existing version, session, and inference result levels. The dialog title, description, and available options adapt to the context you start from. As part of this change, the action has been renamed from “Run Inference” to “Run Evaluation” across the dashboard to better reflect what it does.SDK: Full Conversation Access in Custom Metrics
TheCustomScoreEvaluationMetric.measure() method now receives an optional inference_results parameter containing all conversation turns for the current session. This lets custom metrics reason over the full conversation history — not just the latest turn — when computing scores. This also fixes an issue where evaluating sessions with custom metrics only fetched the most recent turn, making multi-turn custom metric evaluation unreliable. See the custom metrics tutorial for usage details.Platform Improvements
- Failed turn tracking: The evaluation engine now detects which conversation turns failed during a simulated conversation and records them in a new
failedTurnsfield on each evaluation. Failed turns are visible in the evaluation detail view and included in CSV exports, making it easier to pinpoint where multi-turn conversations break down. - Clearer metric validation errors: When a metric is missing required evaluation parameters, the error message now lists exactly which parameters are absent — instead of the generic “at least one is missing” message.
- Inference error visibility: When a direct inference call results in an error, the error text is now displayed directly in the inference result view.
- Unified form components: AI-assisted creation forms and table selection dialogs across the dashboard now share a consistent design and interaction model. This also fixes an issue where discarding all AI-generated metric candidates left the review step in a dead end with no navigation options.
Human Evaluators
You can now bring humans into the evaluation loop. Human evaluation lets your team members review and annotate inference results directly from the Galtea platform — alongside the automated LLM-as-a-judge scores you already rely on.To get started, create a metric with the evaluation type set to Human, assign it to the relevant user group, and invite your evaluators. Each evaluator sees only the inference results assigned to their group and can submit scores through a dedicated annotation interface.User Groups
User Groups are the organizational layer behind human evaluation. A user group links a set of team members to specific metrics, so you can control who evaluates what. You can manage user groups from the dashboard or programmatically through the User Group Service in the SDK.AI-Assisted Specification Configuration
Specification creation is now smarter. AI-assisted configuration has been extended to all specification types — not just policies. When you create a new specification, the system analyzes your description and suggests the test type, test variant, and specification type automatically. You can accept the suggestions with a single click or adjust them before saving.For Policy specifications specifically, the AI also suggests which test type best validates the rule you are defining — helping you choose betweenACCURACY, SECURITY, and BEHAVIOR based on the policy’s intent.You can manage specifications from the dashboard or via the Specification Service in the SDK.Inference Monitoring and Tracing
Direct inference sessions now surface more visibility into what is happening during evaluation runs:- Session status: Each inference session displays a status indicator —
COMPLETED,PENDING, orFAILED— so you can quickly identify which sessions need attention. - Failed inference visibility: When a direct inference call fails, the error is now surfaced directly in the session view, rather than being silently logged.
- Distributed trace collection: Remote agents can now send traces back to Galtea automatically. Pass the
AgentInputobject to your agent handler, and all@trace-decorated calls are linked to the correct session. See Tracing Agent Operations for the full setup guide.
Metrics and Evaluation
- Flexible text metric validation: Text-based metrics no longer require a strict
answerkey in the response payload, making it easier to integrate custom output formats. - Version endpoint validation: Starting an evaluation now checks that the selected version has a valid endpoint connection configured, catching misconfigurations before the run begins.
- SDK fix — session custom metrics: An issue where
actual_outputwas alwaysNonewhen evaluating sessions with custom metrics has been resolved. - Filter tests by version: You can now filter the test list by the selected version, making it faster to find the tests relevant to your current evaluation run.
- Specification-aware validation: Products are now enriched with their linked specifications before evaluation validation runs, ensuring the evaluation engine has the full context it needs.
Platform Improvements
- Report terminology: Automated reports now use clearer dimension names — Accuracy instead of “Quality” and Security & Safety instead of “Red Teaming” — matching the terminology you see in the dashboard.
- Report visualization: Metric comparison charts now support multi-page groups, so products with more than six metrics per group render correctly without truncation.
- Evaluation run form: The evaluation run form layout has been reorganized for a cleaner workflow when configuring new runs.
- Improved augmentation: The data augmentation pipeline has received quality improvements to produce more diverse and representative test case variations.
- SDK improvements: The SDK now correctly supports JSON request bodies in HTTP DELETE operations and properly escapes JSON string values in inference input templates.
Documentation
A new Human Evaluation tutorial walks you through the full workflow — from creating a human metric and setting up user groups to annotating results in the platform. The tutorial includes step-by-step screenshots showing the metric creation form, the evaluator sidebar, and the annotation interface.Specifications
Specifications are now live — a structured way to define and test the behavioral expectations of your product.A Specification represents a single, testable claim about what your product should or should not do. Each specification has a type that classifies the expectation:- Capability — a core function the product can perform
- Inability — something the product fundamentally cannot do, regardless of user input
- Policy — a rule the product must follow, such as refusing certain requests or always adding a disclaimer
ACCURACY, SECURITY, or BEHAVIOR) that determines how the specification is evaluated.Specifications replace the legacy free-text Capabilities, Inabilities, and Policies fields that used to live on the Product form. Your behavioral expectations are now discrete, individually traceable, and can be linked directly to the metrics that validate them.You can manage specifications from the dashboard or via the Specification Service in the SDK.AI-Assisted Configuration
Specification creation is AI-assisted. With a single click, the system suggests the specification type, test type, and test variant based on your description — so you can focus on defining what your product should do, not on configuring how it gets tested.Improved Report Generation
The automated report generation pipeline has received a set of quality upgrades:- Realism analysis: Before generating narrative content, the engine now runs a realism check on the underlying data to ensure summaries reflect actual conditions rather than statistical artifacts.
- Pattern analysis: A new pattern analysis step examines the diversity and distribution of your test results, informing the structure of each report section. You can toggle this per generation run.
Better Data Augmentation
Data Augmentation now uses a dedicated diversity model alongside the main generation pass. Instead of producing variations that converge on the same patterns, augmented test cases are explicitly steered toward different scenarios, phrasings, and edge cases — giving you broader coverage from the same seed data.A demo video has also been added to the Data Augmentation documentation to walk you through the full workflow.Simplified Metrics
The Metric entity has been streamlined. Legacy fields — includingcriteria, evaluation_steps, test_type, user_persona, and stopping_criterias — have been removed from the metric creation form. These were carry-overs from an older architecture that are no longer needed. If you were passing any of these fields through the SDK, you can safely remove them.Platform Improvements
- 403 Forbidden page: Accessing a route without the required permissions now shows a clear, dedicated error page with an explanatory message, rather than a silent redirect or blank screen.
- Sorting fixes: Table columns that do not support server-side sorting no longer display a sort indicator, removing a common source of confusion in large result sets.
- Evaluation prompts: User prompts in evaluations are now rendered through Jinja2, enabling richer template-based customization.
Trace Collection During Platform Simulations
When running evaluations via Direct Inference, you can now collect traces from your endpoint and link them back to Galtea automatically.Add the new{{ inference_result_id }} placeholder to your Endpoint Connection input template. Your handler receives the ID, passes it to the SDK’s set_context, and all @trace-decorated calls made during that request are automatically linked to the correct inference result in Galtea. See Collecting Traces During Direct Inference for the full walkthrough.Documentation and Examples
Demo videos are now embedded throughout the documentation — covering test generation, Accuracy tests, Security & Safety tests, Behavior tests, endpoint connection setup, and data augmentation — so you can see each workflow in action without leaving the docs.Automated Report Generation
You can now export your analytics data as a comprehensive PDF report directly from the Galtea dashboard. The report is generated automatically and includes AI-written summaries for each section — covering scope, methodology, product evaluation, and conclusions — alongside dashboard-style visualizations.One-Click Test Case Augmentation
Scaling your Test Cases just got much simpler. If you have a small base of known-good test cases, you can now augment them with a single click directly from the dashboard. Galtea will automatically generate additional test cases based on your existing ones, letting you quickly expand coverage without manually writing each entry.New Quality Test Tasks
We have expanded the Quality Test configuration with two new task types: Correction and Other. When selecting Other, you can provide a custom task description to better classify your tests. This makes it easier to define quality evaluations that go beyond the predefined categories.New Tutorial: Direct Inferences and Evaluations from the Platform
A new step-by-step guide walks you through the full workflow of running inferences and evaluations directly from the Galtea dashboard — no SDK code required. It covers creating an Endpoint Connection, attaching it to a Version, running tests, and reviewing results.Performance and Reliability
- Faster Evaluations: The evaluation pipeline has been optimized to reduce redundant database lookups. Pre-fetched entities are reused across batch operations, validation order has been improved, and retry logic now groups evaluations by session for significantly faster batch processing.
- Improved Error Messages: The IOU and Spatial Match metrics now return clearer error messages specifying the expected JSON formats for bounding box inputs.
- Jinja2 Template Validation Fix: The conversation simulator template validator no longer raises false positives for valid Jinja2 for-loop patterns in JSON templates, such as the common OpenAI-compatible message format.
- Dashboard Polish: Several UI improvements including better dark mode contrast, refined Pill component styling, and a smarter toast notification system that reduces interruptions for frequent users.
SDK & Simulation Upgrades
We have significantly improved the flexibility of the Galtea SDK to better fit your existing workflows.-
Unified Simulation Method:
The
simulator.simulatemethod has been upgraded to handle both single-turn and multi-turn simulations. You no longer need to switch methods based on the test case complexity. -
Simplified Agent Integration:
We’ve made the
agentproperty much more flexible. You can now pass methods with various signatures directly (including async generators), removing the strict requirement to wrap everything in a specific Agent class.
New Platform Features
AI-Powered Product Definition
Bootstrapping a new product context is now faster than ever. You can upload a set of files (documentation, knowledge base, etc.), and Galtea will use AI to automatically generate a comprehensive Product Definition for you.Advanced Endpoint Connections
For more complex conversational agents, we have refined how connections are established. You can now configure distinct behaviors for:- The Initialization of the conversation.
- The Conversation messages.
- The Finalization of the conversation.
Single Inference Evaluation
You can now evaluate a single inference result directly from the platform UI. This is perfect for spot-checking model behavior without running a full simulation suite.Reliability and UX Improvements
- Graceful Metric Handling: Previously, using a metric incompatible with a specific Test Type would throw an error. Now, the system intelligently marks these metrics as “skipped” in the results, allowing the rest of your evaluation to proceed uninterrupted.
- Enhanced Custom Metrics: We have deprecated all DeepEval metrics in favor of our own Custom Metrics architecture. This transition significantly enhances the reliability, speed, and flexibility of your evaluations.
- Test Creation Experience: The Test creation form has been polished for better usability, making it easier to define and organize your test cases.
- Platform Stability: General improvements to platform stability and reliability to ensure a smoother experience during heavy load.
Evaluate Single-Turn Interactions
You can now run evaluations on specific single turns (even when directly creating an Inference Result), rather than being restricted to evaluating full threads or sessions. This granularity allows for more precise analysis, enabling you to pinpoint and score individual exchanges within a larger conversation context without the noise of the surrounding dialogue.Bulk Actions in Tables
We’ve enhanced table functionality across the dashboard to improve your productivity. You can now select multiple rows to perform actions in bulk, such as deleting multiple Test Cases or cleaning up old Sessions simultaneously.
Stability & Performance
- Metric Stability: We’ve deployed updates to improve the consistency and reliability of several core metrics.
- Platform Resilience: Significant backend optimizations have been implemented to ensure platform stability and maintain low latency, even under periods of high load.
Run Inferences and Evaluations from the Platform
You can now execute inferences and run evaluations directly from the Galtea dashboard. This new capability allows you to quickly test your Product Versions and validate performance without writing a single line of code or switching to your IDE.This streamlined workflow is perfect for:- Quick sanity checks on new model versions.
- Running specific test cases ad-hoc.
- Validating changes instantly before full-scale testing.

General Improvements
- Bug Fixes & UX Polish: We’ve addressed various minor bugs and refined the user interface to provide a smoother, more stable experience across the platform.
Standardized Tracing with OpenTelemetry
Galtea Traces have been upgraded to use the OpenTelemetry (OTel) standard. This major infrastructure update aligns our tracing capabilities with industry standards, paving the way for future seamless integrations with your existing observability tools and monitoring stacks. Read more about traces.New Models Available
We have expanded the selection of models available for evaluating your products. You can now leverage the latest capabilities of:- Claude-Sonnet-4.5
- GPT-5.2, GPT-5.1, GPT-5, and GPT-5-mini
Revamped Onboarding & Development Modes
We’ve improved the onboarding experience to better match your workflow:- Use-Case Based Onboarding: The setup process is now split between two clear paths: evaluating a product during Development or monitoring a product in Production with real users.
- Platform View Modes: You can now select your preferred view mode in the platform: Development or Production. This automatically filters the UI to remove unnecessary options, keeping your workspace clean and focused on the task at hand.

General Improvements
- Enhanced Forms: We’ve standardized and improved forms across the platform for better usability and a more consistent experience.
AI-Assisted Product Onboarding
Getting started with Galtea got even simpler. We’ve improved the Product creation flow by simplifying the integration of AI assistance directly into the product creation form. The system helps you draft comprehensive product descriptions, capabilities, and security boundaries, ensuring your product context is perfectly optimized for our test generation engine from day one.
Enhanced Inference Result Visualization
We have revamped how Inference Result metrics are visualized. The new layout provides a clearer breakdown of performance data, making it easier to correlate specific input/output pairs with their respective scores. This improvement allows for quicker diagnosis of issues within specific conversation turns.
Documentation Refresh
We have reorganized our documentation to improve discoverability and ease of use. The structure is now streamlined into clearer categories, making it easier to find SDK references, conceptual guides, and tutorials. Check out here the new docs structure here.General Improvements & Bug Fixes
- General Bug Fixes: Addressed various minor issues to improve platform stability and performance.
- UI Polish: Minor visual updates across the dashboard for a more consistent user experience.
Enhanced Product Creation Experience
We have completely revamped the onboarding flow for new products. A new, intuitive form is now available to help you create AI-based products faster and more efficiently. This update streamlines the initial setup configuration, improving the overall user experience when onboarding new agents or LLM apps.
Evaluating Tool Usage
As agents become more autonomous, validating how they use tools is just as important as the final answer. We have introduced new capabilities to strictly evaluate tool calls:New Property in Test Cases
Test cases now accept a new property:expected_tools. This allows you to define exactly which tools an agent should invoke during a specific test scenario. Read more about Test Case structure.New Evaluation Parameters
To support this validation, the evaluation process now accepts two specific parameters regarding tool usage:tools_used: The actual list of tools invoked by the model.expected_tools: The ground truth list of tools that should have been used.
New Metric: Tool Correctness
We have added a specialized metric to our library: Tool Correctness.This metric automatically compares thetools_used against the expected_tools to determine if the agent selected the right functions to solve the user’s problem. This is critical for ensuring reliability in agentic workflows. See metric details.
Full Visibility with Agent Tracing
You can now add Traces to your Sessions and Inference Results. This unlocks deep observability into your AI agents, allowing you to understand exactly what they are doing at each step of execution—not just the final output.Traces capture internal operations such as:- Tool Calls: API requests, calculations, or data fetching.
- Retrieval Steps: Vector database searches and RAG context retrieval.
- Chain Orchestration: Internal routing and decision-making logic.
- LLM Invocations: Prompts sent to underlying models.

Granular Credit Consumption and Transparency
We’ve updated our credit system to provide more granularity and transparency. You can now see exactly where your credits are being consumed, giving you detailed visibility into your usage patterns and helping you manage resources more effectively.Platform Robustness and Security
We’ve improved the overall robustness of the platform to ensure higher uptime and reliability. Additionally, we have increased safety by implementing rate limiting on our authentication endpoints, providing better protection against abuse and unauthorized access attempts.Improved SDK File Validation
We’ve enhanced file validation within the SDK. When uploading files—for example, when creating a Test usingtest_file_path or ground_truth_file_path—you will now receive stricter validation and clearer, more descriptive error messages to help you debug issues faster.Simplified and More Powerful Metric Creation with Partial Prompts
We’ve upgraded how custom judge Metrics are created by introducing a new Partial Prompt method. This approach simplifies the process by letting you focus on the core evaluation logic while Galtea handles the final prompt construction. This not only makes creating custom metrics faster but also significantly increases the quality and consistency of the evaluation results. Learn more about the new method.Clearer Pass/Fail with Binarized Metric Scores
To provide more decisive validation, we’ve updated three of our key metrics to return a clear binary score (0 for fail, 1 for pass). This change eliminates ambiguous 0.5 scores, making it easier to determine success or failure for critical test cases. The updated metrics are:More Robust Quality Test Generation
When creating Quality Tests from the dashboard, we now validate your uploaded files to ensure they are in the expected format. This proactive check helps prevent errors during test generation, leading to a smoother and more reliable workflow.Platform Enhancements
We’re always working to improve the core platform experience. This week’s updates include:- Improved Performance: We’ve optimized our underlying LLM usage for better concurrency, scalability, and robustness, resulting in a faster and more reliable platform.
- Enhanced Email Validation: The process for validating user emails has been made more robust to ensure better security and deliverability.
Flexible Sign-In with Google, GitHub, and GitLab
You can now sign in to Galtea using your Google, GitHub, or GitLab accounts! This makes accessing your workspace faster and more secure, providing a seamless single sign-on (SSO) experience alongside our traditional email and password login.
Simplified Metric Versioning with “Legacy” Tag
We’ve introduced a new way to handle updated Metrics. Galtea metrics might be marked as “Legacy,” allowing us to improve them over time with the same name. This approach simplifies versioning by removing the need to embed version numbers in metric names, ensuring that your historical evaluations remain linked to the correct metric version.Test Generation Transparency and Enhancements
We’re bringing more transparency and efficiency to our test generation process:- See What’s Under the Hood: The models used to generate a Test are now listed in the test’s details section, providing clearer insight into how your test cases were created.
- Faster Test Generation: We’ve significantly sped up the processing of multiple files within ZIP archives when generating tests from a knowledge base.
- Smarter Test Case Handling: The engine for generating Quality Tests now handles the
max_test_casesparameter more effectively.
Fixes & Improvements
- Fixed a bug where file names from ZIP archives were not displayed correctly in the UI.
- Resolved an issue that prevented the processing of ZIP files containing two files with the same name (e.g.,
folder1/file.txtandfolder2/file.txt).
A Fresh New Look for the Dashboard
We’ve polished the dashboard with a subtle but significant UI refresh. Expect updated colors, clearer icons, and more consistent dialogs designed to create a smoother, more intuitive, and visually consistent workflow.Enhanced Security and Performance
We’ve bolstered the platform’s security with improved user authentication mechanisms to better protect your data. Alongside this, we’ve enhanced how we handle concurrent requests, leading to better performance and reliability, especially during high-traffic periods.Smarter & Faster Red Teaming Generation
Creating Red Teaming tests is now better than ever. We’ve improved language control during generation for more robust and relevant tests, even when no specific language is provided. Plus, we’ve optimized the generation process for significantly faster test creation, allowing you to build robust security evaluations more efficiently.Introducing Partial Prompts: Create Custom Metrics Faster
We’re making it easier than ever to create custom judge metrics with our new Partial Prompt evaluation type. Now, you can focus solely on defining your evaluation criteria and simply select the data parameters you need—likeinput or retrieval_context. Galtea automatically constructs the full, correctly formatted judge prompt for you, saving you time and reducing complexity.This new approach streamlines the creation of powerful, tailored Metrics, allowing you to concentrate on what matters most: the evaluation logic itself.
Quality Generator with Improved Language Handling
The LLM language confusion is now patched for the quality generator. Languages are formally detected for the whole document, and test cases are ensured to be generated with the correct language. This ensures more accurate and contextually appropriate test case generation across different language inputs.Improved Long Context Processing for Quality Test Cases
Our quality generator now intelligently processes long documents, ensuring optimal context is provided for the best quality generation of test cases. This enhancement streamlines the evaluation process and improves the reliability of generated test cases for comprehensive quality assessments.Platform Experience Enhancements
We’ve also rolled out several quality-of-life improvements and bug fixes for a smoother experience. A notable update is that tables across the platform now refresh instantly after you delete an item with a right-click, making data management more fluid and intuitive.Define Product Behavior with Policies
Products now accept a set of policies describing how they should behave in specific situations. This allows you to set explicit rules, such as requiring a disclaimer for financial advice or defining standard responses for out-of-scope questions, to ensure consistent and safe model behavior.Advanced Red Teaming Strategies
We’ve added five new, sophisticated red teaming strategies to help you uncover more complex vulnerabilities. These strategies use your Product Description to create highly contextual and evasive prompts:- Persuasive Content: Disguises malicious requests as legitimate business tasks.
- Creative Writing: Reframes harmful prompts as creative exercises.
- Data Analysis: Hides malicious intent within analytical or data-generation tasks.
- Bait and Switch: Lowers model defenses with benign queries before introducing the adversarial prompt.
- Empathetic Framing: Uses emotional manipulation to pressure the model into unsafe compliance.
Smarter Test Generation Engine
Our test generation capabilities have been significantly upgraded for both Quality and Red Teaming tests:- New Quality Test Engine: The engine for generating Quality Tests has been rebuilt to produce higher-quality test cases. It can now also incorporate and validate the new product policies you define.
- Faster Red Teaming Generation: The generation process for Red Teaming Tests is now more reliable and significantly faster, allowing you to build robust security evaluations more efficiently.
Platform Usability and Performance
We’ve rolled out several enhancements to improve your workflow and the platform’s reliability:- Deletion Confirmations: To prevent accidental data loss, the platform will now ask for confirmation before deleting key entities like Test Cases, Evaluations, and Versions.
- Improved Table Experience: The performance and user experience of data tables and their filters have been optimized for a smoother, more responsive interface.
Platform Simplification and SDK v3.0
We’ve undertaken a major simplification of our core concepts to make the platform more intuitive.Evaluation Tasks are now simply Evaluations, and Metric Types have been renamed to Metrics. The old parent Evaluations entity has been removed entirely. These changes streamline the workflow and clarify the relationship between different parts of Galtea.To support this, we’ve released a new major version of our SDK.New Conversational Metrics
We’re excited to introduce two new metrics designed specifically for evaluating conversational AI:- User Objective Accomplished: Evaluates whether the user’s stated goal was successfully and correctly achieved during the conversation.
- User Satisfaction: Assesses the user’s overall experience, focusing on efficiency and sentiment, to gauge their satisfaction with the interaction.
Enhanced Test Case Feedback and Management
Improving test quality is now a more collaborative process. When upvoting or downvoting a Test Case, you can now add auser_score_reason to provide valuable context for your feedback.Additionally, you can now filter test cases by their score directly via the SDK using the user_score parameter in the test_cases.list() method.Dashboard and SDK Usability Improvements
We’ve rolled out several updates to make your workflow smoother and more efficient:- Improved Dashboard Navigation: Navigating between related entities like Tests, Test Cases, and Evaluations is now more intuitive. We’ve also adjusted table interactions—you can now single-click a row to select and copy text without navigating away. To see an entity’s details, simply right-click the row.
- Efficient Batch Fetching in SDK: The SDK now allows you to fetch objects by providing a list of IDs (e.g., fetching all Test Cases for a list of Test IDs at once), significantly improving usability for batch operations.
General Improvements
This release also includes numerous performance optimizations and minor UI/UX enhancements across the entire platform to provide a faster and more polished experience.Create Conversational Scenarios from Your Quality Tests
You can now generate comprehensive Scenario Based Tests directly from your existing Quality Tests. This powerful feature allows you to transform your single-turn, gold-standard test cases into realistic, multi-turn conversational scenarios, significantly accelerating the process of evaluating your AI’s dialogue capabilities.New Deterministic Metric: URL Validation
We’ve added the URL Validation metric to our deterministic evaluation suite. It ensures that all URLs in your model’s output are safe, properly formatted with HTTPS, and resolvable. It includes strict validation and SSRF protection, making it essential for any application that generates external links.Platform Enhancements and Performance Boost
We’ve rolled out major performance improvements across the platform, with a special focus on the analytics views, making them faster and more responsive. This update also includes several minor visual fixes, such as ensuring icons render correctly in different themes, to provide a smoother and more polished user experience.Consistent Sorting and Enhanced Navigation
We’ve made significant usability improvements to the platform. Table sorting is now more consistent and persistent; you can navigate away, refresh the page, and your chosen sort order will remain, ensuring a smoother workflow. Additionally, navigating from an Evaluation is easier than ever. You can now right-click on any evaluation in the table or use the dropdown menu in the details page to jump directly to the related Test Case or Session.
Upvote and Downvote Test Cases
To help teams better curate their test suites, we’ve introduced a voting system for Test Cases. You can now upvote or downvote test cases directly from the Dashboard, providing a quick feedback loop on test case quality and relevance.
SAML SSO Authentication
Organizations can now enhance their security by configuring SAML SSO for authentication. This allows for seamless and secure access to the Galtea platform through your existing identity provider.Improved Misuse Resilience Metric
The Misuse Resilience metric has been enhanced to accept the full product context. This allows for a more accurate and comprehensive evaluation of your model’s ability to resist misuse by leveraging a deeper understanding of your product’s intended capabilities and boundaries.New Analytics Filter
The analytics page now includes a filter for Test Case language. This allows you to narrow down your analysis and gain more precise insights into the performance of your multilingual models.Confidence Scores for Generated Test Cases
We’re introducing confidence scores for all generated Test Cases. This new feature provides a clear indicator of the quality and reliability of each test case, helping you better understand your test suites and prioritize human review efforts. Higher scores indicate greater confidence in the test case’s relevance and accuracy.Simplified and More Flexible Metric Creation
Creating Metrics is now more intuitive and flexible. Metrics are now directly linked to a specific test category (QUALITY, RED_TEAMING, or SCENARIOS), which simplifies the creation process by tailoring the available parameters to the relevant test category. This change makes it easier to define metrics that are perfectly aligned with your evaluation goals. See the updated creation guide for more details.Custom Judges for Conversational Evaluation
You can now create your own Custom Judges specifically for evaluating multi-turn conversations. This powerful feature allows you to define complex, stateful evaluation logic that assesses the entire dialogue, enabling you to measure nuanced aspects like task completion, context retention, and persona adherence across multiple turns. Learn more in our guide to evaluating conversations.Conversation Simulator Enhancements
We’ve added more control and realism to the Conversation Simulator:- Agent-First Interactions: The
simulatemethod now includes anagent_goes_firstparameter, allowing you to test scenarios where the agent initiates the conversation. See the SDK docs. - Selectable Conversation Styles: When creating a Scenario Based Test, you can now choose a conversation style (
writtenorspoken). This influences the tone and formality of the simulated user’s dialogue, enabling more realistic testing. This is available under thestrategiesparameter in the test creation method.
More Realistic Conversation Simulations
Our Conversation Simulator has been significantly upgraded to generate more realistic and human-like interactions. The simulator’s evaluation of stopping criteria is now more precise, ensuring that your multi-turn dialogue tests conclude under the correct conditions and provide more accurate insights into your AI’s conversational abilities.Additionally, conversations generated during simulations are now linked to the specific product version they were created with, allowing for better tracking and traceability.Bulk Document Uploads for Test Generation
You can now upload multiple files at once using ZIP archives to generate comprehensive test suites. This feature streamlines the process of creating tests from large document collections, saving you time and effort when building out your evaluation scenarios. For more details, see the test creation documentation.Streamlined Onboarding Experience
Getting started with a new product is now faster than ever. When you create a new product in the Galtea dashboard, an initial version is now created automatically. This simplification streamlines the setup process and helps you get to your first evaluation more quickly.Enhanced Document Processing
We’ve improved our backend for processing documents used in test generation. This enhancement leads to more accurate and relevant test cases being created from your knowledge bases, improving the overall quality of your evaluations.Fresh New Look: Galtea Rebranding is Live!
We’re excited to unveil Galtea’s complete visual transformation! Our new branding includes a refreshed logo, updated slogan, and a modern look across the entire platform. Experience the new design on our website and explore the updated dashboard interface. This rebrand reflects our continued commitment to providing a more intuitive and visually appealing user experience.
Enhanced Analytics with Improved Version Comparisons
The Analytics views have received significant improvements to make data analysis more efficient and user-friendly. Key enhancements include:- Streamlined Chart Comparisons: Charts now support easier side-by-side version comparisons, helping you quickly identify performance differences across iterations.
- Optimized Filter Layout: Filters no longer occupy valuable screen real estate, giving you more space to focus on your analytics data and insights.
- Improved Visual Clarity: Enhanced data presentation makes it easier to interpret results and make informed decisions about your AI systems.
Strengthened Authentication and Security
We’ve implemented more secure and robust authentication mechanisms across the platform. These behind-the-scenes security enhancements provide better protection for your data and ensure reliable access to your Galtea workspace, giving you peace of mind when working with sensitive AI evaluation data.Advanced PDF Knowledge Extraction
Our document processing capabilities have been significantly improved for PDF files. The enhanced extraction algorithms now provide more accurate and comprehensive knowledge retrieval from PDF documents, making it easier to create relevant test cases and evaluation scenarios from your existing documentation and knowledge bases.TestCase Scenarios Creation and Edition via Dashboard
You can now create and edit TestCase Scenarios directly through the Dashboard interface. This streamlined workflow allows you to define complex multi-turn conversation scenarios without leaving the platform, making it easier to set up comprehensive testing workflows for your conversational AI systems.Enhanced TestCase Dashboard with Test-Specific Columns
We’ve improved the TestCase Dashboard with smarter table displays that now only show columns relevant to each specific test. This reduces visual clutter and makes it easier to focus on the information that matters most for your particular testing scenario, whether you’re working with single-turn evaluations, conversation scenarios, or other tests.New Judge Template Selector for Judge Metrics
When creating Judge metrics, you can now select from pre-built judge templates to accelerate your metric setup process. This feature provides a starting point for common evaluation patterns while still allowing full customization of your evaluation prompts and scoring logic.Infrastructure Improvements for Event Resilience
We’ve enhanced our event handling infrastructure to provide better resilience against unexpected system events. These improvements help ensure that your tests and evaluations are preserved and continue running smoothly, even during system maintenance or unexpected interruptions.2025-08-04
Create Custom Judges via Own Prompts in Metric Creation, New Deterministic Metrics and UI Improvements
Create Custom Judges via Your Own Prompts
You can now define Custom Judge metrics by crafting your own evaluation prompts during metric creation. This allows you to encode your domain-specific rubrics, product constraints, or behavioral guidelines directly into the metric—giving you precise control over how LLM outputs are assessed. Simply write your prompt, specify the scoring logic, and Galtea will leverage LLM-as-a-judge techniques to evaluate outputs according to your standards.New Deterministic Metrics
Four new deterministic metrics are now available:- Text Similarity: Quantifies how closely two texts resemble each other using character-level fuzzy matching.
- Text Match: Checks if generated text is sufficiently similar to a reference string, returning a pass/fail based on a threshold.
- Spatial Match: Verifies if a predicted box aligns with a reference box using IoU scoring, producing a pass/fail result.
- IoU (Intersection over Union): Computes the overlap ratio between predicted and reference boxes for alignment and detection tasks.
Dashboard Redesign: Second Iteration
We’ve launched the second iteration of our redesigned dashboard with a refreshed visual language focused on clarity and usability. Key improvements include:- Modern Forms: Forms have been modernized to provide a more intuitive and visually appealing user experience, as well as giving a more professional look to the Dashboard.
Expanded Evaluator Model Support
We’ve added support for more evaluator models to enhance your evaluation capabilities:- Gemini-2.5-Flash: Google’s latest high-performance model optimized for speed and accuracy
- Gemini-2.5-Flash-Lite: A lightweight variant offering faster processing with efficient resource usage
- Gemini-2.0-Flash: Google’s established model providing reliable evaluation performance
Enhanced Conversation Simulation
Testing conversational AI just got more powerful with two major improvements:- Visible Stopping Reasons: You can now see exactly why simulated conversations ended in the dashboard, providing crucial insights into dialogue flow and helping you identify areas for improvement.
- Custom User Persona Definitions: Create highly specific user personas when generating Scenario Based Tests. Define detailed user backgrounds, goals, and behaviors to test how your AI handles diverse user interactions more effectively.
Classic NLP Metrics Now Available
We’ve expanded our metric library with three essential deterministic metrics for precise text evaluation:- BLEU: Measures n-gram overlap between generated and reference text, ideal for machine translation and constrained generation tasks.
- ROUGE: Evaluates summarization quality by measuring the longest common subsequence between candidate and reference summaries.
- METEOR: Assesses translation and paraphrasing by aligning words using exact matches, stems, and synonyms for more nuanced evaluation.
Enhanced Red Teaming with Jailbreak Resilience v2
Security testing gets an upgrade with Jailbreak Resilience v2, an improved version of our jailbreak resistance metric. This enhanced evaluation provides more comprehensive assessment of your model’s ability to resist adversarial prompts and maintain safety boundaries across various attack vectors.Dashboard Redesign: First Iteration
We’ve launched the first iteration of our redesigned dashboard with a refreshed visual language focused on clarity and usability. Key improvements include:- Modern Typography: Cleaner, more readable text throughout the platform
- Refined UI Elements: Updated buttons, cards, and form elements with reduced rounded corners for a more contemporary look
- Streamlined Tables: Enhanced data presentation with improved content layout
- Updated Login Experience: A more polished and user-friendly authentication flow
Improved SDK Documentation
We’ve enhanced our SDK documentation with clearer guidance on defining evaluator models for metrics, making it easier to configure and customize your evaluation workflows.Test Your Chatbots with Simulated Conversations
It is now possible to generate tests that simulate realistic, multi-turn user interactions. Our new Scenario Based Tests allow you to define user personas and goals to evaluate how well your conversational AI handles complex dialogues.This feature is powered by the Conversation Simulator, which programmatically runs these scenarios to test dialogue flow, context handling, and task completion. Get started with our new Simulating User Conversations tutorial.New Red Teaming Metric: Data Leakage
We’ve added the Data Leakage metric to our suite of Red Teaming evaluations. This metric assesses whether your LLM returns content that could contain sensitive information, such as PII, financial data, or proprietary business data. It is crucial for ensuring your applications are secure and privacy-compliant.Enhanced Metric Management
We’ve rolled out several improvements to make metric creation and management more powerful and intuitive:- Link Metrics to Specific Models: You can now associate a Metric with a specific evaluator model (e.g., “GPT-4.1”). This ensures consistency across evaluation runs and allows you to use specialized models for certain metrics.
- Simplified Custom Scoring: We’ve introduced a more streamlined method for defining and calculating scores for your own deterministic metrics using the
CustomScoreEvaluationMetricclass. This makes it easier to integrate your custom, rule-based logic directly into the Galtea workflow. Learn more in our tutorial on evaluating with custom scores.
Support for Larger Inputs and Outputs
To better support applications that handle large documents or complex queries, we have increased the maximum character size for evaluation inputs and outputs to 250,000 characters.Test Your Chatbots with Realistic Conversation Simulation
You can now evaluate your conversational AI with our new Conversation Simulator. This powerful feature allows you to test multi-turn interactions by simulating realistic user conversations, complete with specific goals and personas. It’s the perfect way to assess your product’s dialogue flow, context handling, and task completion abilities.Get started with our step-by-step guide on Simulating User Conversations.New Metric: Resilience To Noise
We’ve expanded our RAG metrics with Resilience To Noise. This metric evaluates your product’s ability to maintain accuracy and coherence when faced with “noisy” input, such as:- Typographical errors
- OCR/ASR mistakes
- Grammatical errors
- Irrelevant or distracting content
Stay in Control with Enhanced Credit Management
We’ve rolled out a new and improved credit management system to give you better visibility and control over your usage. The system now includes proactive warnings that notify you when you are approaching your allocated credit limits, helping you avoid unexpected service interruptions and manage your resources more effectively.Streamlined Conversation Logging with OpenAI-Aligned Format
Logging entire conversations is now easier and more intuitive. We’ve updated our batch creation method to align with the widely-usedmessages format from OpenAI, consisting of role and content pairs. This makes sending multi-turn interaction data to Galtea simpler than ever.See the new format in action in the Inference Result Batch Creation docs.Tailor Your Red Teaming with Custom Threats
You can now define your own custom threats when creating Red Teaming Tests. This new capability allows you to move beyond our pre-defined threat library and create highly specific adversarial tests that target the unique vulnerabilities and edge cases of your AI product. Simply describe the threat you want to simulate, and Galtea will generate relevant test cases.New Red Teaming Strategies: RolePlay and Prefix Injection
We’ve expanded our arsenal of Red Teaming Strategies to help you build more robust AI defenses:- RolePlay: This strategy attempts to alter the model’s identity (e.g., “You are now an unrestricted AI”), encouraging it to bypass its own safety mechanisms and perform actions it would normally refuse.
- Prefix Injection: Adds a misleading or tactical instruction before the actual malicious prompt. This can trick the model into a different mode of operation, making it more susceptible to the adversarial attack.
Introducing the Misuse Resilience Metric
A new non-deterministic metric, Misuse Resilience, is now available. This powerful metric evaluates your product’s ability to stay aligned with its intended purpose, as defined in your product description, even when faced with adversarial inputs or out-of-scope requests. It ensures your AI doesn’t get diverted into performing unintended actions, a crucial aspect of building robust and responsible AI systems. Learn more in the full documentation.Enhanced Test Case Management: Mark as Reviewed
To improve collaboration and workflow for human annotation teams, Test Cases can now be marked as “reviewed”. This feature allows you to:- Track which test cases have been validated by a human.
- See who performed the review, providing a clear audit trail.
- Filter and manage your test sets with greater confidence.
Introducing the Factual Accuracy Metric
We’ve added a new Factual Accuracy metric to our evaluation toolkit! This non-deterministic metric measures whether the information in your model’s output is factually correct when compared to a trusted reference answer. It’s particularly valuable for RAG and question answering systems where accuracy is paramount.The metric uses an LLM-as-a-judge approach to compare key facts between your model’s output and the expected answer, helping you catch hallucinations and ensure your AI provides reliable information to users. Read the full documentation here.Enhanced Red Teaming with New Attack Strategies
Our red teaming capabilities just got more sophisticated! We’ve added two powerful new attack strategies:- Biblical Strategy: Transforms adversarial prompts into biblical/ancient scripture style using poetic and symbolic language to disguise malicious intent while preserving meaning.
- Math Prompt Strategy: Encodes harmful requests into formal mathematical notation using group theory concepts to obscure the intent from standard text analysis.
Smarter Red Teaming Test Generation
We’ve significantly improved how red teaming tests are generated. Our system now takes even more factors into account when creating adversarial test cases:- Product-Aware Generation: Tests are now more precisely tailored to your specific product’s strengths, weaknesses, and operational boundaries.
- Context-Sensitive Attacks: The generation process better understands your product’s intended use cases to craft more relevant and challenging scenarios.
- Enhanced Threat Modeling: Our algorithms now consider a broader range of factors when determining the most effective attack vectors for your particular AI system.
Better Metric Source Visibility and Management
Understanding where your metrics come from is now easier than ever! We’ve enhanced the platform to provide clearer visibility into metric sources:- Source Classification: All metrics are now clearly labeled with their source - whether they’re from established frameworks, custom Galtea implementations, or other origins.
- Enhanced Filtering: You can now filter metrics by their source to quickly find the evaluation criteria that best fit your needs.
- Improved Descriptions: Metric descriptions now include more detailed information about their origins and implementation, with links to relevant documentation.

- Galtea: Custom metrics designed specifically for your needs, like our new Factual Accuracy metric
- G-Eval: Framework-based metrics that use evaluation criteria or steps for assessment
- Established Frameworks: Metrics adapted from proven evaluation libraries and methodologies
Generate Quality Tests from Examples
You can now create Quality Tests directly from your own examples using the new Few Shots parameter. This makes it easier to tailor tests to your specific use cases and ensure your models are evaluated on the scenarios that matter most. Learn more about test creation.Metric Tags
Metrics now support tags for easier classification and discovery. Quickly find and organize metrics relevant to your projects. See all metrics.Enhanced Product Details
Products now include new detail fields:- Capabilities: What your product can do.
- Inabilities: Known limitations.
- Security Boundaries: Define the security scope and constraints.
Improved Q&A Generation
Question-answer pairs are now generated with improved accuracy and clarity, thanks to better text filtering and processing.New Guide: Setting Up Your Product Description
We’ve created a comprehensive guide to help you set up your product descriptions effectively. This guide covers best practices, examples, and tips to ensure your product is presented in the best light. Check it out hereGeneral Improvements
We’ve made various bug fixes and UX/UI improvements across the Dashboard, SDK, and more, making your experience smoother and more reliable.Major Platform Overhaul
We’ve been hard at work reorganizing and expanding the Galtea platform to handle more use cases and prepare for exciting future features. This release brings significant improvements to the dashboard, SDK, and test generation.Dashboard Enhancements
-
Reorganized Version, Test, and Evaluation Views:
Detailed views have been streamlined and improved for clearer insights. -
New Sessions Visualizations:
Easily organize and navigate conversations through our new Sessions feature. -
Evaluations Visualization Removed:
The dashboard now focuses on Sessions and Evaluations as the primary elements. -
Better Filters Across Tables:
Quickly find what you need with improved filtering capabilities on the dashboard. -
General Bug Fixes & UX Improvements:
Enjoy smoother interactions, clearer tooltips, and more intuitive code snippets.
SDK v2 Released
The new Galtea SDK v2 is here! It includes breaking changes to simplify workflows and add session support.-
Implicit Session Creation:
Sessions are created automatically when needed for evaluations. -
Repurposed evaluations.create():
The old method is replaced bycreate_single_turn()for test-based evaluations, whilecreate()now exclusively handles session-based evaluations. -
New evaluations.create_single_turn() Method:
Use this for single-turn test cases. It now requiresversion_idinstead ofevaluation_id. -
Simplified Version Creation:
Thegaltea.versions.create()method now accepts all properties directly, no need for anoptional_propsdictionary. -
Sessions Support:
Group multiple inference results under a single session for better multi-turn tracking usinggaltea.sessions.create().
Improved Test Case Generation
-
Smarter Test Coverage:
Test cases are now distributed more intelligently across your documents for better coverage based on the number of questions you choose to generate. -
Single Threat per Red Teaming Test:
Red Teaming Tests now only allow a single threat per test, ensuring clearer results.
Improved Test Generation
Our test generation capabilities have been significantly upgraded:- Versatile Red Teaming: Red Teaming tests are now more powerful, allowing you to employ multiple attack strategies to thoroughly probe your AI’s defenses.
- Better Synthetic Data: We’ve made general improvements to the quality of synthetic data generation, ensuring your tests are more effective and realistic.
Code Snippets Now Available on the Dashboard
We’re making it easier than ever to integrate Galtea into your development process!- Simplified Evaluation Setup: The “Create Evaluation” form on the dashboard has been replaced with a convenient code snippet. Simply copy and paste it directly into your project to get started.
- Streamlined Creation: Similarly, a new code snippet for “Create Evaluation” is now available on the dashboard, simplifying how you send evaluation data to Galtea. You can easily copy and paste this into your project.

Usability Improvements
We’ve also rolled out several usability enhancements based on your feedback:- Enhanced Readability in Tables: Table cells now correctly render line breaks, making it easier to view multi-line content and detailed information at a glance.
- Controlled Test Case Generation: To ensure optimal performance and manageability, the maximum number of test cases automatically generated for a single test from a knowledge base is now capped at 1000.
Streamlined Onboarding and Quicker Starts
We’ve revamped the platform onboarding! It’s now more visually intuitive and to help new users get evaluating in no time, we now provide a default Metric and a default Test. This makes it easier than ever to get started with Galtea and run your first evaluation quickly.Deeper Insights with Visible Conversation Turns
Understanding the full context of interactions is key. You can now view the complete conversation turns associated with your test cases directly within the dashboard. This offers richer context, aiding in more thorough analysis and debugging of your conversational AI products.
Dashboard Usability Boost
We’re continually refining the Galtea experience. This update brings several UI enhancements across the dashboard, designed to improve overall usability and make your workflow smoother and more intuitive.Tailor Your Test Generation: Selectable Test Case Counts
Gain more control over your testing process! When generating tests, you can now specify the exact number of test cases you want Galtea to create. This allows you to fine-tune the scope and depth of your tests according to your needs.Track Your Team’s Work: Creator Attribution Displayed
Clarity in collaboration is important. Now, the user who created a Product, Test, Version, or other key assets will be clearly displayed on their respective details pages. This helps in tracking ownership and contributions within your team.Enhanced Table Functionality for Easier Data Navigation
Working with data tables in the dashboard is now more efficient:- Clear Filter Indicators: Easily see which filters are currently applied to any table.
- Quick Filter Reset: A new “Clear All Filters” button allows you to reset your view with a single click.
New Conversation Evaluation Metrics
You can now evaluate conversations using these new metrics:- Role Adherence - Assess how well an AI stays within its defined role
- Knowledge Retention - Measure how effectively information is remembered throughout a conversation
- Conversation Completeness - Evaluate whether all user queries were fully addressed
- Conversation Relevancy - Determine if responses remain on-topic and purposeful
Enhanced Security Framework
We’ve significantly improved user access management by implementing an Attribute-Based Access Control (ABAC) strategy, providing more granular control over who can access what within your organization.Extended Data Generation Capabilities
Our data generation tools have been expanded with:- Catalan Language Support - Create synthetic data in Catalan to enhance your multilingual applications
- Added support for text-based files - Upload your knowledge base in virtually any text-based format including JSON, HTML, Markdown, and more
Improved Test Creation Experience
We’ve enhanced the clarity of threat selection in the Test Creation form. The selection now displays both the threat and which security frameworks that threat covers, making it easier to align your testing with specific security standards.
Analytics & Navigation Enhancements
- Reduced Clutter in Analytics Filters - Tests and Versions filtering now only display elements that have been used in an evaluation
- Streamlined Navigation - Clicking the “input” cell in the evaluations table now navigates directly to the associated Test Case
Bug Fixes & Improvements
We’ve resolved several issues to ensure a smoother experience:- Fixed a bug that could trigger an infinite loop in the Test Cases List of the dashboard
- Addressed multiple small UI glitches and errors throughout the platform
Improvements in Red Teaming Tests
-
New “misuse” threat implemented
Now red teaming incorporates a new threat, misuse, which are queries that not necessaryly malicious however out-of-scope for you specific product. You can now test whether your product can successfully block these queries by marking “Mitre Atlas: Ambiguous prompts” in the threat list. -
Better “data leakage” and “toxicity” tests
The red teaming tests incorporate better your product meta data, to generate the most adequate test cases for “data leakage” and “toxicity”.
Analytics Page Upgrades
We’re continuing to expand the power of the Analytics page! This update introduces:-
Radar View for Version Comparison
You can now visualize performance across multiple metrics for a single version using the brand-new radar view. It provides a quick way to understand strengths and weaknesses at a glance. -
Smarter Metric Filters
Filters now only show metrics that have actually been used in evaluations—removing unnecessary clutter and making it easier to find relevant data. -
Graph Tooltips
Hovering over truncated names now reveals full labels with tooltips, helping you understand graph contents more clearly.

SDK Safeguards
We’ve added protections to ensure your SDK integration is as smooth and reliable as possible:-
Version Compatibility Checks
If the SDK version you’re using is not compatible with the current API, it will now throw a clear error to prevent unexpected behavior. -
Update Notifications
When a new SDK version is available, you’ll get a console message with update information—keeping you in the loop without being intrusive.
Bug Fixes
- Metric Range Calculation
Some default metrics were previously displaying inverted scoring scales (e.g., treating 0% as best and 100% as worst). This is now resolved for accurate interpretation. - Test Creation Not Possible Through
.txtKnowledge Base Files
Due to a recent refactor, the creation of tests using knowledge base files with.txtextensions was not possible. This has been fixed and you can now create tests using.txtfiles as the knowledge base again.
Monitoring Is Live!
Real-world user interactions with your products can now be fully monitored and analyzed. Using the Galtea SDK, you can trigger evaluations in a production environment and view how different versions perform with real users. Read more here.Improved Galtea Red Teaming Tests
Our simulation-generated tests have been upgraded—delivering higher-quality outcomes. Red teaming tests can now be directed to validate even more specific aspects of various security standards, such as OWASP, MITRE ATLAS, and NIST. Specifically, we have improved jailbreak attacks, in addition to new financial attacks and toxicity prompts.New Analytics Page
A completely redesigned analytics page is now available! It features:- Enhanced Filtering Capabilities.
- Improved Data Clarity and Layout.

And with monitoring active, you can see production evaluation results in real time on this page!
User Experience Enhancements
We’re continuously refining the platform based on your feedback. This week’s improvements include:-
Customizable Evaluations List:
You can now select which metrics you are interested in, so the evaluations list only shows the ones you need. -
Enhanced Evaluation List Filtering:
Easily filter evaluations by versions, evaluations, tests and test groups. -
Enhanced Test List Filtering:
Easily filter tests by its group. -
Smart Table Sorting:
When you apply a custom sort, the default (usually creation date) is automatically disabled.