Experiments & Evaluations

Experiments and evaluations work together to test and measure the quality of your AI systems. Experiments run prompts against datasets to generate outputs. Evaluations measure those outputs (or dataset rows directly) using scores to assess quality.

Experiments

An experiment runs a prompt version against a dataset to generate outputs. You create an experiment, select a prompt version and dataset, map inputs, and run it to see what your prompt produces for each dataset row.

What experiments do:

Run a prompt version (with a model configuration) against each row in a dataset
Generate outputs for each input
Let you compare different prompt versions or models on the same dataset

Why it matters: Experiments let you test how different prompts or models perform before deploying. You can compare variants side-by-side to see which produces better results.

How it works:

Create an experiment with a prompt version and dataset
Map dataset columns to prompt input variables
Run the experiment to generate outputs for each row
View results per row to see what each input produced

Example: You have a dataset of customer questions. You create an experiment with your "Support Bot" prompt version. The experiment runs the prompt against each question and generates a response. You can then compare this to another experiment with a different prompt version.

Evaluations

An evaluation runs scores on datasets or experiment results to measure quality. You create an evaluation, select scores to run, choose what to evaluate (a dataset or experiment results), and run it to see quality metrics.

What evaluations do:

Apply scores (metrics) to measure quality
Can evaluate dataset rows directly or experiment outputs
Provide statistics and per-row results

Why it matters: Evaluations tell you if your outputs are good. They measure quality using scores—whether that's correctness, relevance, safety, or any other metric you define.

How it works:

Create an evaluation with scores to run
Choose scope: Dataset (score dataset rows directly) or Experiment (score experiment outputs)
Run the evaluation to compute scores
View statistics and per-row results to see quality metrics

Example: After running an experiment, you create an evaluation that runs "Correctness" and "Relevance" scores on the experiment outputs. The evaluation shows you which responses scored well and which need improvement.

How they work together

Experiments and evaluations form a workflow:

Create a dataset — Curate inputs you want to test
Run experiments — Generate outputs using different prompts or models
Run evaluations — Measure quality of those outputs using scores
Compare results — See which prompt or model performs best

Two evaluation scopes:

Scope	What it evaluates	Use case
Dataset	Dataset rows directly	Score a dataset without running experiments (e.g., manual labels, LLM-as-judge on existing data)
Experiment	Experiment results	Score outputs from one or more experiments to measure quality

Example workflow:

You have a dataset of 100 customer questions
You run Experiment A with prompt version "v1" → generates 100 responses
You run Experiment B with prompt version "v2" → generates 100 responses
You create an Evaluation with scope "Experiment" → scores both experiments' outputs
You compare results to see which prompt version performs better

What you need

For experiments:

Prompt version — A versioned prompt template
Model configuration — Registered AI model (OpenAI, Anthropic, etc.)
Dataset — Collection of items with columns that match your prompt inputs

For evaluations:

Scores — Metrics to measure quality (scoring types: Numeric, Ordinal, Nominal, or RAGAS (a scoring framework); manual or with an LLM evaluator prompt)
Dataset or Experiments — What to evaluate (dataset rows or experiment outputs)

Datasets — Collections of items used by experiments and evaluations
Prompts & Models — Prompt versions and model configurations used in experiments
Scores — Metrics used in evaluations
Experiments — Detailed guide on creating and running experiments
Evaluations — Detailed guide on creating and running evaluations

Experiments​

Evaluations​

How they work together​

What you need​

Related concepts​

Experiments

Evaluations

How they work together

What you need

Related concepts