Prerequisites
A product set up in Galtea. You need a product with at least one version, and some way to produce sessions and inference results against it — either by running your app and logging to Galtea, or by connecting your endpoint directly. If you have not set this up yet, start with the Quickstart. Some metrics to evaluate against. The loop is only as good as what you measure. The recommended path is specification-driven evaluation: you describe what your product should do, cannot do, and must follow, and Galtea generates LLM-as-a-judge metrics from those specifications. You can also link metrics manually. See specification-driven evaluations. An AI agent with the Galtea Agent Skill installed. You can use any Agent-Skills-compatible assistant; we will use Claude. Install the skill by pointing your agent at the repository:galtea CLI underneath. See the Agent Skill docs for all installation options.
The Galtea CLI authenticated (optional but useful). The skill installs and drives the CLI for you, but it helps to know it is there if you ever want to run a command by hand. Grab a gsk_* key from the settings page and run:
The Concept
The workflow is a loop with four moves:- Evaluate your current version against your metrics.
- Diagnose — let the agent pull the failing evaluations and find the patterns.
- Improve — change the prompt, the retrieval, the tools, whatever the diagnosis points to, and cut a new version.
- Compare — re-run the evaluations on the new version and check it improved without regressing.
The Example Application
Our example is a support chatbot that answers user questions over a product knowledge base. It retrieves relevant docs, feeds them to an LLM with a system prompt, and returns an answer. Each interaction is logged to Galtea as a session, with the user message and the assistant answer as an inference result, and the retrieved chunks attached as retrieval context. We have a handful of sessions on versionv1 already — some clean answers, some that hallucinated, some that refused things they should have handled. That is enough to start the loop.
Step 1 — Run a Baseline Evaluation
First, get a score on where the current version stands. Ask the agent:Step 2 — Diagnose: What Is Failing, and Why
This is where the agent earns its keep. Instead of clicking through evaluations one by one, ask it to summarize the failures and look for patterns:- Unsupported answers — the bot answered confidently when the retrieved context did not actually contain the answer (a grounding/faithfulness failure).
- Off-scope responses — the bot tried to help with requests it should have deflected, or made claims it could not verify.
- Weak retrieval — the right document existed but was not retrieved, so the answer was built on the wrong chunks.
Step 3 — Map Failures Back to Your Product
Now turn the diagnosis into a concrete change. Ask the agent to connect each failure bucket to a likely cause in your system:- Unsupported answers → no instruction in the system prompt to ground answers strictly in the retrieved context and to say “I don’t know” when it is missing.
- Off-scope responses → no scope guardrail describing what the bot should and should not handle.
- Weak retrieval → a retrieval problem, not a prompt problem: too few chunks retrieved, or a chunking/embedding issue worth a separate look.
Step 4 — Make the Change and Cut a New Version
Apply the changes you agreed on (here: tighten the system prompt with a grounding rule and a scope guardrail), then register a new version in Galtea so the next evaluation is tracked against it. Ask the agent:Step 5 — Re-evaluate and Compare
Run the same evaluations againstv2 and compare them to your v1 baseline:
Step 6 — Keep Iterating
Repeat the loop until you stop seeing obvious failures. Each pass is the same four moves: evaluate, diagnose, improve, compare. In practice a few iterations get you to consistently good results on your current set of cases, at which point it is worth investing in something more structured and automated.Variations
We drove this loop from a small set of existing sessions, but the same approach works with other sources of signal:- Production sessions. If you log real traffic to Galtea, point the loop at production sessions instead of a hand-made set. Be selective — not every production failure is one you want to optimize for.
- Specification-driven tests. Drive the whole loop from specifications: describe the behavior you expect, let Galtea generate the metrics and tests, and evaluate against them. This keeps “what good looks like” explicit and reusable across versions.
- Human annotations feeding the metrics optimizer. Annotate a subset of evaluations yourself to define what a correct judgment looks like, then use the agent to apply that standard across a larger corpus — the input the metrics optimizer needs to make your LLM-as-a-judge metrics more reliable.
What’s Next
Once the obvious issues are gone, the next steps make the loop durable so changes do not silently regress later:- Harden your metrics. If the agent’s diagnosis ever felt off, the metric — not the product — may be the problem. Tightening your judge prompts and feeding human annotations into the metrics optimizer is its own improvement loop, and a natural follow-on playbook.
- Wire evaluations into CI/CD. The same CLI commands the skill runs by hand can run on every version in your pipeline, so a regression fails the build instead of reaching users. Because the CLI is non-interactive (
export GALTEA_API_KEY=gsk_...), it drops straight into CI.
Related
Agent Skill
Install the skill that drives the whole loop on your behalf.
CLI Usage
The
galtea binary the skill runs under the hood.Specification-Driven Evaluations
Generate metrics and tests from what your product should do.
Skill Source
The open-source Galtea Agent Skill repository.