Skip to main content
Specifications are the foundation of Galtea’s AI-powered evaluation workflow. Each specification is a single, testable behavioral expectation — and the quality of your specification descriptions directly impacts the quality of the tests and metrics generated from them. This guide shows you how to write clear, effective specification descriptions for each type: Capability, Inability, and Policy.
Galtea can generate specifications for you. During product creation in the dashboard, the AI analyzes your product description and optionally any documentation you upload (PDFs, docs, markdown — up to 50 files) to automatically generate a full set of specifications. You can review, edit, and refine them before saving. This works even without uploading docs, but documentation produces more targeted specs.
This guide helps you write better descriptions — whether you’re writing from scratch or refining AI-generated ones.

Why Descriptions Matter

Your specification descriptions are used by Galtea to:
  • Generate test cases that target the specific behavior described
  • Generate metrics with tailored judge prompts via AI Metric Generation
  • Auto-derive test types — the system chooses Accuracy, Security, or Behavior based on your description
  • Provide context to LLM-as-a-judge evaluators when scoring
A vague description produces generic tests. A precise description produces targeted, meaningful evaluations.

General Writing Principles

Be Specific and Purpose-Driven

Focus on: what the behavior is, who it affects, and when it applies. Good:
  • “Must refuse to provide personalized investment recommendations, even when users frame the request as hypothetical”
  • “Can retrieve and display the user’s current account balance when asked”
Avoid:
  • “Handles financial questions” (too broad)
  • “Uses GPT-4 to process financial data through LangChain” (too technical)

Focus on Observable Behavior

Describe what the user sees, not how the system works internally. Good: “Always includes a disclaimer when discussing financial topics that could be interpreted as advice” Bad: “The RAG pipeline retrieves context from the compliance vector database before generating responses”

One Expectation Per Specification

Each specification should describe exactly one testable claim. If your description contains “and”, consider splitting it. Good: Two separate specifications:
  • “Can explain basic investment concepts like stocks, bonds, and mutual funds”
  • “Can calculate retirement planning scenarios based on user goals”
Bad: One overloaded specification:
  • “Can explain investments, calculate retirement plans, compare products, and generate strategies”

Writing by Specification Type

We’ll use this example product throughout:
“A personal finance assistant that provides investment guidance, portfolio analysis, and financial planning for individual investors.”

Capability Specifications

Describe what the product can do — its core functions and the information it can access.
Start each description with “Can…” or a verb that explains what the product is capable of doing.
Examples:
  • “Can analyze investment portfolios and suggest rebalancing strategies”
  • “Can provide market insights and explain financial concepts in simple terms”
  • “Can calculate retirement planning scenarios and savings goals based on user-provided data”
  • “Can compare investment products (stocks, bonds, ETFs, mutual funds) side by side”

Inability Specifications

Describe what the product cannot do due to hard technical limits — tools it lacks, data it can’t access, actions it can’t perform.
Start with “Cannot…” or equivalent. Focus on plausible expectations the product explicitly does not support.
Examples:
  • “Cannot execute trades or access brokerage accounts”
  • “Cannot provide tax preparation services or file tax returns”
  • “Cannot access users’ bank accounts or financial institutions”
  • “Cannot store or transmit banking credentials or account numbers”

Policy Specifications

Describe rules the product must follow — refusal rules, mandatory disclaimers, interaction patterns, and behavioral guardrails. Policies are the most powerful specification type because they drive both Security and Behavior test generation.
Use “Must…”, “Always…”, “If asked about X, respond with Y” patterns. Be specific about the trigger and the expected behavior.
Security policy examples (drives Security test generation):
  • “Must refuse requests for market manipulation strategies or insider information”
  • “Must not provide advice on illegal tax avoidance schemes”
  • “Must decline to share proprietary trading algorithms”
Behavior policy examples (drives Behavior test generation):
  • “If asked for guaranteed returns, must respond that all investments carry risk”
  • “Always ends investment advice with a ‘consult a licensed professional’ disclaimer”
  • “Always includes a risk warning when discussing aggressive or speculative investments”

Testing Your Descriptions

A good specification description should allow someone unfamiliar with your product to:
  1. Understand what specific behavior is being tested
  2. Predict what a passing response looks like
  3. Predict what a failing response looks like
  4. Know when this rule applies (for policies)
If your description passes this test, it will produce high-quality AI-generated tests and metrics.

Next Steps

Specification-Driven Evaluations

End-to-end tutorial: create specs, generate metrics, run evaluations.

AI Metric Generation

Let AI generate judge prompts from your specification descriptions.