Evaluation & Experimentation
Experiments and evaluations work together to test and measure AI quality. Experiments run prompts against datasets to generate outputs. Evaluations apply scores to measure the quality of those outputs (or datasets directly).
What it is
Experiments let you:
- Run a prompt version against a dataset to generate outputs for each row.
- Compare different prompt versions or models on the same dataset.
- See what your prompts produce before deploying.
Evaluations let you:
- Apply scores (metrics) to datasets or experiment results.
- Measure quality using scores (Numeric, Ordinal, Nominal, or RAGAS—a scoring framework), either manual or with an LLM evaluator prompt.
- View statistics and per-row results to see what scored well.
Together they form a workflow: create a dataset → run experiments → run evaluations → compare results. See Experiments & Evaluations for the underlying concepts.
What you can do
| Area | What you do |
|---|---|
| Experiments | Create an experiment with a prompt version and dataset, map inputs, run it to generate outputs, view per-row results. Compare different prompt versions or models. |
| Evaluations | Create an evaluation with scores and scope (Dataset or Experiment), run it to compute scores, view statistics and per-row results. Measure quality of datasets or experiment outputs. |
Evaluation scopes
Evaluations support two scopes:
| Scope | What it evaluates | Use case |
|---|---|---|
| Dataset | Dataset rows directly | Score a dataset without running experiments (manual or LLM-as-judge scores). |
| Experiment | Experiment results | Score outputs from one or more experiments to measure quality. |
Getting started
- Create a dataset — Either import a CSV or use Dataset Mode in Traces to map span attributes.
- Create a prompt — Build a versioned template for your model calls.
- Run an experiment — Test your prompt against the dataset to generate outputs.
- Define scores — Set up metrics to measure quality.
- Run an evaluation — Apply scores to your experiment results or dataset to measure quality.
Pages in this section
- Experiments — Run prompts against datasets to test performance.
- Evaluations — Apply scores to datasets or experiment results to measure quality.
Related
- Experiments & Evaluations — How experiments and evaluations work together.
- Datasets — Collections of items used in experiments and evaluations.
- Prompts & Scores — Prompts used in experiments; scores used in evaluations.