Skip to main content

Prompts & Scores

Prompts and scores are the building blocks for testing and measuring AI quality. Prompts are versioned templates for model calls used in experiments. Scores are reusable metrics used in evaluations to measure quality.


What it is

Prompts let you:

  • Create versioned templates for model calls (system/user messages, parameters).
  • Track changes over time with versioning and diffs.
  • Use prompts in experiments to test different versions against datasets.

Scores let you:

  • Define reusable metrics with scoring type Numeric, Ordinal, Nominal, or RAGAS (a scoring framework); optionally attach an evaluator prompt for LLM-as-judge.
  • Use scores in evaluations to measure quality of datasets or experiment results.
  • Track quality consistently across different runs.

Together, prompts generate outputs (via experiments) and scores measure those outputs (via evaluations). See Prompts, Models & Scores for the underlying concepts.


What you can do

AreaWhat you do
PromptsCreate prompts with versions, edit to create new versions, compare versions, promote versions. Use prompts in experiments.
ScoresCreate scores with name, scoring type (Numeric, Ordinal, Nominal, or RAGAS—a scoring framework), optional scale options, and optional evaluator prompt for LLM-as-judge. Use scores in evaluations.

Getting started

  1. Register model configurations (Organisation Configuration → AI Models) so prompts can run.
  2. Create a prompt — Build a versioned template with system/user messages and parameters.
  3. Create scores — Define metrics you want to measure (e.g. correctness, relevance, safety).
  4. Use prompts in Experiments and scores in Evaluations.

Pages in this section

  • Prompts — Create, version, and manage prompts for experiments.
  • Scores — Define metrics for evaluations.