Beyond Accuracy
In the early days of machine learning, evaluation was simple: accuracy, precision, recall, F1.
If your model scored higher, it was considered better.
Then large language models arrived.
They generate open-ended text, reason across context, and occasionally hallucinate—forcing a shift in evaluation from correctness alone to usefulness, coherence, and trust.
From Predictive Metrics to Generative Metrics
Traditional metrics such as accuracy and ROC-AUC work well for deterministic classification. Regression metrics like MAE, MSE, and R-squared quantify numeric deviation.
Generative models break these assumptions. The same prompt can yield multiple valid outputs—each correct in different ways. This gave rise to Generative Metrics, which assess quality, diversity, and linguistic fidelity.
Key metrics include:
-
BLEU (IBM, 2002)
Measures n-gram overlap with reference text. Fast and widely used, but penalizes semantically correct paraphrases. -
ROUGE (USC/ISI)
Recall-based comparison common in summarization. Efficient but weak on semantic diversity. -
METEOR (CMU)
Incorporates synonym matching and stemming. More forgiving than BLEU/ROUGE but computationally heavier. -
BERTScore (TTI-Chicago)
Uses contextual embeddings to measure semantic similarity. Effective for open-ended generation, though less interpretable. -
Perplexity & Diversity Metrics
Perplexity evaluates fluency, not truthfulness. Diversity metrics track variability but must balance coherence.
Emerging Advanced Metrics
- MAUVE — Measures distributional similarity between human and model text
- BLEURT & MoverScore — Embedding-based quality estimators
- Factuality metrics — Crucial for retrieval-augmented and knowledge-grounded systems
No single metric is sufficient. Each exposes only part of model behavior.
Semantic and Task-Aware Evaluation
As LLMs evolved into multimodal, reasoning-driven systems, semantic similarity and task relevance became central questions:
Does this output capture the intended meaning?
Frameworks such as LangChain’s LangSmith help bridge this gap by supporting both string-based and semantic evaluations.
LangSmith differentiates between:
- Unit evaluations — prompt → output
- Dataset evaluations — batch testing at scale
- Chain evaluations — evaluating reasoning steps, tools, and agent behavior
This marks a shift from evaluating outputs to evaluating systems.
Model Evaluation vs. Task Evaluation
A classic distinction in data science remains relevant:
- Model evaluation asks how well a model performs on benchmarks
- Task evaluation asks how well a system performs in its real context
A strong summarization model may still fail in customer support if it mishandles tone or factual accuracy. Real-world evaluation blends metrics with judgment.
ML-based vs. LLM-based Evaluation
Two dominant approaches are emerging:
ML-based Evaluations
Supervised models classify responses as “good” or “bad.”
They are reproducible and efficient—but limited to what they were trained to detect.
LLM-based Evaluations
Here, LLMs act as judges:
“Rate how helpful this response is from 1–5.”
Popularized by LangSmith, OpenAI Evals, and PromptBench, this enables scalable qualitative assessment.
However, LLM judges can be biased or inconsistent. The strongest setups combine:
- reference data
- human feedback
- model-based scoring
Practical Evaluation with Evidently AI
If LangSmith structures evaluation, Evidently AI operationalizes it.
Evidently focuses on observability through metrics such as:
- text and embedding drift
- semantic similarity
- response diversity
- bias and toxicity detection
- data quality over time
This transforms evaluation from a one-time exercise into a continuous feedback loop.
Human-in-the-Loop Still Matters
No metric fully captures nuance, trust, or empathy—especially in healthcare, finance, or education.
Modern evaluation systems combine:
- automated scoring for speed
- semantic analysis for nuance
- human review for grounding
This layered approach reflects how intelligent systems are responsibly deployed.
Looking Ahead
LLM evaluation is not static. It evolves as models gain reasoning, memory, and personalization.
The trajectory looks like this:
Predictive → Generative → Semantic → Task-aware → Human-aligned
The objective isn’t perfection—it’s trust, consistency, and learning through feedback.
Evaluating LLMs is not about finding a single “right” metric.
It is about designing the right conversation between models, humans, and data.
That dialogue is how intelligent systems are built.
— Sanuwar Rashid
References
-
BLEU — Papineni et al., IBM, 2002
https://en.wikipedia.org/wiki/BLEU -
ROUGE — Lin, USC/ISI, 2004
https://www.aclweb.org/anthology/W04-1013/ -
METEOR — Lavie & Denkowski, CMU, 2005
https://www.cs.cmu.edu/~alavie/METEOR/ -
BERTScore — Zhang et al., 2019
https://arxiv.org/abs/1904.09675 -
MAUVE — Pillutla et al., 2021
https://arxiv.org/abs/2109.03687 -
Evidently AI LLM Evaluation Guide
https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics