Evaluation of LLMs, RAGs, Agents

AUTHOR: @Marko Budiselic DATE: December 17, 2024

<aside> 🤔

Eric Schmidt: “Analytical programming is gradually being replaced by learning the answer.” https://youtu.be/2Zg--ouGl7c?t=1680

Satya Nadella: “Every app needs a database, k8s cluster and a model that runs on an AI accelerator.” https://youtu.be/9NtsnzRFJ_o?t=3516

</aside>

In the “agentic” world, one of the central questions is how to improve agents over time. In general, to improve anything, there has to be a clear feedback loop when something changes so that one can immediately conclude if the change was helpful. In the text below, an agent is the most generic word; it refers to a simple RAG pipeline and advanced software that autonomously executes actions in the real world.

Why Is Evaluation Custom to Each Agent?

To continuously improve an agent, it's essential to have a clear way of evaluating its performance. The most effective evaluation method will depend on the agent's input, underlying data and output. Different approaches and metrics may be more suitable for different types of results. In the same way as an agent is an app that varies for various use cases, the evaluation itself is also custom for each use case.

For example, factoid questions usually have short answers, and the quality of the results is typically binary (true or false). On the other side, broad questions that may involve summarization or trend analysis don't have quality expressed in a binary way, and it's more suitable to evaluate that type of result by using a set of scores that will contribute to the final score for a given answer.

Let’s learn about some of the ways how text answers could be evaluated 👀

Automated Methods of Evaluation

G-Eval uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria.
- https://arxiv.org/abs/2303.16634
- https://docs.confident-ai.com/docs/metrics-llm-evals
RAGAS (Retrieval-Augmented Generation Assessment System) is a reference-free framework that evaluates RAG systems across several dimensions without relying on annotated datasets.
- https://arxiv.org/abs/2309.15217
- https://docs.ragas.io/en/stable/
ARES (Automated Retrieval-Augmented Generation Evaluation System) takes a more structured approach by incorporating synthetic data generation and the Prediction-Powered Inference (PPI) technique.
- https://arxiv.org/abs/2311.09476
- https://ares-ai.vercel.app/
N-gram Overlap Metrics
- BLEU (Bilingual Evaluation Understudy)
  - https://en.wikipedia.org/wiki/BLEU
- ROGUE (Recall-Oriented Understudy for Gisting Evaluation)
  - https://en.wikipedia.org/wiki/ROUGE_(metric)
  - https://www.freecodecamp.org/news/what-is-rouge-and-how-it-works-for-evaluation-of-summaries-e059fb8ac840/

Some of these methods are just based on LLM providing information about the agent output. Can we trust that? 🤔 My current reasoning is that, e.g., G-Eval is providing really good results because it’s not just LLM evaluating something, it’s actually a human-guided evaluation (again, custom for a given use case).

But wait, there are also a lot of words talking about different properties of the results, let’s list/define them as well 👀

From Another Perspective

<aside> 🤔

Dylan Patel: “Subjective vs Objective Grading” https://youtu.be/QVcSBHhcFbg?t=2154

</aside>

Objective scoring exists in: code, math, engineering, …

On the other hand, subjective: what’s the most beautiful image, what’s the best style for a given email, …

Why Is Evaluation Custom to Each Agent?

Automated Methods of Evaluation

From Another Perspective

The Evaluation Glossary