🎯 Evaluations

Evaluations measure how well an AI meets goals like accuracy, safety, and user satisfaction.

Eval Type

Focus

Usage

Examples

Standard KPIs

Model-Centric Evals

Technical model performance

Model tuning, regression testing, performance tuning

Accuracy, F1 score, BLEU, perplexity, hallucination rate

Accuracy %, Latency p50/p95, Cost per token

Product-Centric Evals

End-to-end behavior in product

User-facing quality, tradeoff analysis, experience tuning

Task success rate, latency, coverage, cost per completion

Task success %, Latency p50/p95, Override rate %

Human-in-the-Loop Evals

Human judgment of output quality

Subjective quality, instruction-following, tone, coherence

Ranking A vs B, rubric scoring, open-ended reviews

Satisfaction score, Override rate %

Behavioral/Scenario Evals

Simulated real-world or edge cases

Robustness, safety, contextual reliability

Prompt injections, adversarial inputs, multi-turn tasks

Unsafe-output rate %, Failure rate %, Recovery success %

Updated on