Quality

Quality in Arcane is where you test, measure, and improve your AI systems. You work with prompts and scores, run experiments and evaluations, manage datasets, and use annotation queues for human review.

What it is

Arcane's quality features help you:

Create and version prompts — Build prompt templates, track versions, and compare changes.
Define scores — Set up reusable metrics with scoring type Numeric, Ordinal, or Nominal (or RAGAS, a scoring framework). Scores can be manual (you enter results) or use an optional LLM evaluator prompt to judge outputs.
Run experiments — Test prompt versions against datasets to see what they produce.
Run evaluations — Measure quality by applying scores to datasets or experiment results.
Manage datasets — Curate collections of items for repeatable testing.
Annotate data — Use annotation queues to get human labels for quality tracking.

All of this works together to help you catch regressions, compare variants, and maintain quality over time. If you're new to these concepts, see Core Concepts.

What you can do

Area	What you do
Datasets	Create datasets (manually or import CSV), map span attributes using Dataset Mode, view and edit rows. Use datasets in experiments and evaluations.
Prompts & Scores	Create and version prompts (templates for model calls). Define scores (metrics for measuring quality). Use prompts in experiments; use scores in evaluations.
Experiments	Run a prompt version against a dataset to generate outputs. Compare different prompt versions or models on the same dataset.
Evaluations	Apply scores to datasets or experiment results to measure quality. View statistics and per-row results.
Annotations	Create annotation queues, add traces/conversations, answer questions, and use labeled data for training or evaluation.

Getting started

Register model configurations (Organisation Configuration → AI Models) so prompts can run.
Create a dataset — Either import a CSV or use Dataset Mode in Traces to map span attributes.
Create a prompt — Build a versioned template for your model calls.
Run an experiment — Test your prompt against the dataset to see outputs.
Define scores — Set up metrics to measure quality.
Run an evaluation — Apply scores to your experiment results or dataset to measure quality.

Pages in this section

Datasets — Create, import, and manage datasets. Use Dataset Mode to map span attributes.
Prompts & Scores — Create prompts and define scores for quality measurement.
Evaluation & Experimentation — Run experiments and evaluations to test and measure quality.
Annotations — Create queues, add items, and answer questions for human review.

Core Concepts — Datasets, prompts, models, scores, experiments, and evaluations.
Traces — Use Dataset Mode to build datasets from traces.
Conversations — Add conversations to annotation queues.

What it is​

What you can do​

Getting started​

Pages in this section​

Related​

What it is

What you can do

Getting started

Pages in this section

Related