Skip to main content

Quality

Quality in Arcane is where you test, measure, and improve your AI systems. You work with prompts and scores, run experiments and evaluations, manage datasets, and use annotation queues for human review.


What it is

Arcane's quality features help you:

  • Create and version prompts — Build prompt templates, track versions, and compare changes.
  • Define scores — Set up reusable metrics with scoring type Numeric, Ordinal, or Nominal (or RAGAS, a scoring framework). Scores can be manual (you enter results) or use an optional LLM evaluator prompt to judge outputs.
  • Run experiments — Test prompt versions against datasets to see what they produce.
  • Run evaluations — Measure quality by applying scores to datasets or experiment results.
  • Manage datasets — Curate collections of items for repeatable testing.
  • Annotate data — Use annotation queues to get human labels for quality tracking.

All of this works together to help you catch regressions, compare variants, and maintain quality over time. If you're new to these concepts, see Core Concepts.


What you can do

AreaWhat you do
DatasetsCreate datasets (manually or import CSV), map span attributes using Dataset Mode, view and edit rows. Use datasets in experiments and evaluations.
Prompts & ScoresCreate and version prompts (templates for model calls). Define scores (metrics for measuring quality). Use prompts in experiments; use scores in evaluations.
ExperimentsRun a prompt version against a dataset to generate outputs. Compare different prompt versions or models on the same dataset.
EvaluationsApply scores to datasets or experiment results to measure quality. View statistics and per-row results.
AnnotationsCreate annotation queues, add traces/conversations, answer questions, and use labeled data for training or evaluation.

Getting started

  1. Register model configurations (Organisation Configuration → AI Models) so prompts can run.
  2. Create a dataset — Either import a CSV or use Dataset Mode in Traces to map span attributes.
  3. Create a prompt — Build a versioned template for your model calls.
  4. Run an experiment — Test your prompt against the dataset to see outputs.
  5. Define scores — Set up metrics to measure quality.
  6. Run an evaluation — Apply scores to your experiment results or dataset to measure quality.

Pages in this section

  • Datasets — Create, import, and manage datasets. Use Dataset Mode to map span attributes.
  • Prompts & Scores — Create prompts and define scores for quality measurement.
  • Evaluation & Experimentation — Run experiments and evaluations to test and measure quality.
  • Annotations — Create queues, add items, and answer questions for human review.

  • Core Concepts — Datasets, prompts, models, scores, experiments, and evaluations.
  • Traces — Use Dataset Mode to build datasets from traces.
  • Conversations — Add conversations to annotation queues.