Skip to main content

Prompts, Models & Scores

Prompts and model configurations control how generations are produced; scores (often LLM-as-judge) measure their quality.

Prompts

What they are: Versioned templates for model calls.
How you manage: create/edit, version, diff, and roll back.
Where used: tied to projects; selectable in experiments to compare variants.
Good practices: keep concise system/user prompts; bump versions when intent changes; annotate with change reasons.

Models

What they are: Registered model configurations per project (provider, model name, params).
How you manage: add/update configs; set defaults; pair with prompts in experiments.
Good practices: record temperature/top_p/etc.; keep staging vs. prod configs distinct; note provider-specific limits.

Scores

What they are: Metrics to judge generations—scoring types Numeric, Ordinal, or Nominal (or RAGAS, a scoring framework); can be manual or use an LLM evaluator prompt.
How you manage: define name and type; used directly in evaluations and experiments.
Good practices: prefer clear, deterministic criteria; document prompts used for LLM judges; version when logic changes.

Why they matter together

Prompts + Models create the generations; Scores tell you if they’re good.
Experiments compare prompt/model variants; Evaluations apply Scores to datasets to guard against regressions.

Evaluations: run scores on a dataset to check quality.
Experiments: run prompt versions against a dataset; compare variants with shared scores. See Experiments.
Entities: model entities label spans so you can see which config ran in traces/conversations.

Prompts
Models
Scores
Why they matter together
Related concepts