Skip to main content

Evaluation & Experimentation

Experiments and evaluations work together to test and measure AI quality. Experiments run prompts against datasets to generate outputs. Evaluations apply scores to measure the quality of those outputs (or datasets directly).


What it is

Experiments let you:

  • Run a prompt version against a dataset to generate outputs for each row.
  • Compare different prompt versions or models on the same dataset.
  • See what your prompts produce before deploying.

Evaluations let you:

  • Apply scores (metrics) to datasets or experiment results.
  • Measure quality using scores (Numeric, Ordinal, Nominal, or RAGAS—a scoring framework), either manual or with an LLM evaluator prompt.
  • View statistics and per-row results to see what scored well.

Together they form a workflow: create a dataset → run experiments → run evaluations → compare results. See Experiments & Evaluations for the underlying concepts.


What you can do

AreaWhat you do
ExperimentsCreate an experiment with a prompt version and dataset, map inputs, run it to generate outputs, view per-row results. Compare different prompt versions or models.
EvaluationsCreate an evaluation with scores and scope (Dataset or Experiment), run it to compute scores, view statistics and per-row results. Measure quality of datasets or experiment outputs.

Evaluation scopes

Evaluations support two scopes:

ScopeWhat it evaluatesUse case
DatasetDataset rows directlyScore a dataset without running experiments (manual or LLM-as-judge scores).
ExperimentExperiment resultsScore outputs from one or more experiments to measure quality.

Getting started

  1. Create a dataset — Either import a CSV or use Dataset Mode in Traces to map span attributes.
  2. Create a prompt — Build a versioned template for your model calls.
  3. Run an experiment — Test your prompt against the dataset to generate outputs.
  4. Define scores — Set up metrics to measure quality.
  5. Run an evaluation — Apply scores to your experiment results or dataset to measure quality.

Pages in this section

  • Experiments — Run prompts against datasets to test performance.
  • Evaluations — Apply scores to datasets or experiment results to measure quality.