Eval Type | Focus | Usage | Examples | Standard KPIs |
---|---|---|---|---|
Model-Centric Evals | Technical model performance | Model tuning, regression testing, performance tuning | Accuracy, F1 score, BLEU, perplexity, hallucination rate | Accuracy %, Latency p50/p95, Cost per token |
Product-Centric Evals | End-to-end behavior in product | User-facing quality, tradeoff analysis, experience tuning | Task success rate, latency, coverage, cost per completion | Task success %, Latency p50/p95, Override rate % |
Human-in-the-Loop Evals | Human judgment of output quality | Subjective quality, instruction-following, tone, coherence | Ranking A vs B, rubric scoring, open-ended reviews | Satisfaction score, Override rate % |
Behavioral/Scenario Evals | Simulated real-world or edge cases | Robustness, safety, contextual reliability | Prompt injections, adversarial inputs, multi-turn tasks | Unsafe-output rate %, Failure rate %, Recovery success % |
🎯 Evaluations
Evaluations measure how well an AI meets goals like accuracy, safety, and user satisfaction.
Updated on