Prompts, Models & Scores
Prompts and model configurations control how generations are produced; scores (often LLM-as-judge) measure their quality.
Prompts
- What they are: Versioned templates for model calls.
- How you manage: create/edit, version, diff, and roll back.
- Where used: tied to projects; selectable in experiments to compare variants.
- Good practices: keep concise system/user prompts; bump versions when intent changes; annotate with change reasons.
Models
- What they are: Registered model configurations per project (provider, model name, params).
- How you manage: add/update configs; set defaults; pair with prompts in experiments.
- Good practices: record temperature/top_p/etc.; keep staging vs. prod configs distinct; note provider-specific limits.
Scores
- What they are: Metrics to judge generations—scoring types Numeric, Ordinal, or Nominal (or RAGAS, a scoring framework); can be manual or use an LLM evaluator prompt.
- How you manage: define name and type; used directly in evaluations and experiments.
- Good practices: prefer clear, deterministic criteria; document prompts used for LLM judges; version when logic changes.
Why they matter together
- Prompts + Models create the generations; Scores tell you if they’re good.
- Experiments compare prompt/model variants; Evaluations apply Scores to datasets to guard against regressions.
Related concepts
- Evaluations: run scores on a dataset to check quality.
- Experiments: run prompt versions against a dataset; compare variants with shared scores. See Experiments.
- Entities: model entities label spans so you can see which config ran in traces/conversations.