Skip to main content

Prompts, Models & Scores

Prompts and model configurations control how generations are produced; scores (often LLM-as-judge) measure their quality.


Prompts

  • What they are: Versioned templates for model calls.
  • How you manage: create/edit, version, diff, and roll back.
  • Where used: tied to projects; selectable in experiments to compare variants.
  • Good practices: keep concise system/user prompts; bump versions when intent changes; annotate with change reasons.

Models

  • What they are: Registered model configurations per project (provider, model name, params).
  • How you manage: add/update configs; set defaults; pair with prompts in experiments.
  • Good practices: record temperature/top_p/etc.; keep staging vs. prod configs distinct; note provider-specific limits.

Scores

  • What they are: Metrics to judge generations—scoring types Numeric, Ordinal, or Nominal (or RAGAS, a scoring framework); can be manual or use an LLM evaluator prompt.
  • How you manage: define name and type; used directly in evaluations and experiments.
  • Good practices: prefer clear, deterministic criteria; document prompts used for LLM judges; version when logic changes.

Why they matter together

  • Prompts + Models create the generations; Scores tell you if they’re good.
  • Experiments compare prompt/model variants; Evaluations apply Scores to datasets to guard against regressions.

  • Evaluations: run scores on a dataset to check quality.
  • Experiments: run prompt versions against a dataset; compare variants with shared scores. See Experiments.
  • Entities: model entities label spans so you can see which config ran in traces/conversations.