Experiments & Evaluations
Experiments and evaluations work together to test and measure the quality of your AI systems. Experiments run prompts against datasets to generate outputs. Evaluations measure those outputs (or dataset rows directly) using scores to assess quality.
Experiments
An experiment runs a prompt version against a dataset to generate outputs. You create an experiment, select a prompt version and dataset, map inputs, and run it to see what your prompt produces for each dataset row.
What experiments do:
- Run a prompt version (with a model configuration) against each row in a dataset
- Generate outputs for each input
- Let you compare different prompt versions or models on the same dataset
Why it matters: Experiments let you test how different prompts or models perform before deploying. You can compare variants side-by-side to see which produces better results.
How it works:
- Create an experiment with a prompt version and dataset
- Map dataset columns to prompt input variables
- Run the experiment to generate outputs for each row
- View results per row to see what each input produced
Example: You have a dataset of customer questions. You create an experiment with your "Support Bot" prompt version. The experiment runs the prompt against each question and generates a response. You can then compare this to another experiment with a different prompt version.
Evaluations
An evaluation runs scores on datasets or experiment results to measure quality. You create an evaluation, select scores to run, choose what to evaluate (a dataset or experiment results), and run it to see quality metrics.
What evaluations do:
- Apply scores (metrics) to measure quality
- Can evaluate dataset rows directly or experiment outputs
- Provide statistics and per-row results
Why it matters: Evaluations tell you if your outputs are good. They measure quality using scores—whether that's correctness, relevance, safety, or any other metric you define.
How it works:
- Create an evaluation with scores to run
- Choose scope: Dataset (score dataset rows directly) or Experiment (score experiment outputs)
- Run the evaluation to compute scores
- View statistics and per-row results to see quality metrics
Example: After running an experiment, you create an evaluation that runs "Correctness" and "Relevance" scores on the experiment outputs. The evaluation shows you which responses scored well and which need improvement.
How they work together
Experiments and evaluations form a workflow:
- Create a dataset — Curate inputs you want to test
- Run experiments — Generate outputs using different prompts or models
- Run evaluations — Measure quality of those outputs using scores
- Compare results — See which prompt or model performs best
Two evaluation scopes:
| Scope | What it evaluates | Use case |
|---|---|---|
| Dataset | Dataset rows directly | Score a dataset without running experiments (e.g., manual labels, LLM-as-judge on existing data) |
| Experiment | Experiment results | Score outputs from one or more experiments to measure quality |
Example workflow:
- You have a dataset of 100 customer questions
- You run Experiment A with prompt version "v1" → generates 100 responses
- You run Experiment B with prompt version "v2" → generates 100 responses
- You create an Evaluation with scope "Experiment" → scores both experiments' outputs
- You compare results to see which prompt version performs better
What you need
For experiments:
- Prompt version — A versioned prompt template
- Model configuration — Registered AI model (OpenAI, Anthropic, etc.)
- Dataset — Collection of items with columns that match your prompt inputs
For evaluations:
- Scores — Metrics to measure quality (scoring types: Numeric, Ordinal, Nominal, or RAGAS (a scoring framework); manual or with an LLM evaluator prompt)
- Dataset or Experiments — What to evaluate (dataset rows or experiment outputs)
Related concepts
- Datasets — Collections of items used by experiments and evaluations
- Prompts & Models — Prompt versions and model configurations used in experiments
- Scores — Metrics used in evaluations
- Experiments — Detailed guide on creating and running experiments
- Evaluations — Detailed guide on creating and running evaluations