Why Descriptions Matter
Your specification descriptions are used by Galtea to:- Generate test cases that target the specific behavior described
- Generate metrics with tailored judge prompts via AI Metric Generation
- Auto-derive test types — the system chooses Accuracy, Security, or Behavior based on your description
- Provide context to LLM-as-a-judge evaluators when scoring
General Writing Principles
Be Specific and Purpose-Driven
Focus on: what the behavior is, who it affects, and when it applies. Good:- “Must refuse to provide personalized investment recommendations, even when users frame the request as hypothetical”
- “Can retrieve and display the user’s current account balance when asked”
- “Handles financial questions” (too broad)
- “Uses GPT-4 to process financial data through LangChain” (too technical)
Focus on Observable Behavior
Describe what the user sees, not how the system works internally. Good: “Always includes a disclaimer when discussing financial topics that could be interpreted as advice” Bad: “The RAG pipeline retrieves context from the compliance vector database before generating responses”One Expectation Per Specification
Each specification should describe exactly one testable claim. If your description contains “and”, consider splitting it. Good: Two separate specifications:- “Can explain basic investment concepts like stocks, bonds, and mutual funds”
- “Can calculate retirement planning scenarios based on user goals”
- “Can explain investments, calculate retirement plans, compare products, and generate strategies”
Writing by Specification Type
We’ll use this example product throughout:“A personal finance assistant that provides investment guidance, portfolio analysis, and financial planning for individual investors.”
Capability Specifications
Describe what the product can do — its core functions and the information it can access. Examples:- “Can analyze investment portfolios and suggest rebalancing strategies”
- “Can provide market insights and explain financial concepts in simple terms”
- “Can calculate retirement planning scenarios and savings goals based on user-provided data”
- “Can compare investment products (stocks, bonds, ETFs, mutual funds) side by side”
Inability Specifications
Describe what the product cannot do due to hard technical limits — tools it lacks, data it can’t access, actions it can’t perform. Examples:- “Cannot execute trades or access brokerage accounts”
- “Cannot provide tax preparation services or file tax returns”
- “Cannot access users’ bank accounts or financial institutions”
- “Cannot store or transmit banking credentials or account numbers”
Policy Specifications
Describe rules the product must follow — refusal rules, mandatory disclaimers, interaction patterns, and behavioral guardrails. Policies are the most powerful specification type because they drive both Security and Behavior test generation. Security policy examples (drives Security test generation):- “Must refuse requests for market manipulation strategies or insider information”
- “Must not provide advice on illegal tax avoidance schemes”
- “Must decline to share proprietary trading algorithms”
- “If asked for guaranteed returns, must respond that all investments carry risk”
- “Always ends investment advice with a ‘consult a licensed professional’ disclaimer”
- “Always includes a risk warning when discussing aggressive or speculative investments”
Testing Your Descriptions
A good specification description should allow someone unfamiliar with your product to:- Understand what specific behavior is being tested
- Predict what a passing response looks like
- Predict what a failing response looks like
- Know when this rule applies (for policies)
Next Steps
Specification-Driven Evaluations
End-to-end tutorial: create specs, generate metrics, run evaluations.
AI Metric Generation
Let AI generate judge prompts from your specification descriptions.