Cross post from Medium.

It seems you can’t look online without their being a new state of the art (SOTA) LLM being released. Over the past few weeks just in the LLM space we’ve had Falcon-180B, DeciLM-6B, LLama-v2 and many more. We’ve also had multi-modal models and capabilities coming out left, right and centre. It’s hard to keep up with what’s available, and it’s even harder to know what the best model to use is.

That’s an interesting concept, what does it mean for a model to be the “best”? This is the question of evaluation of LLMs and the topic of this post.

In the ML project development lifecycle, understanding the performance of the model being used is critical for success and effectively calculating this performance is an important part of the process. After training and then performing testing and validation, how do we actually know that one model is better than another? Well, in more ‘classic’ ML, we have a plethora of metrics we can use. For example, in classification models we can use accuracy, precision, recall, F1-score and so on and for regression we can use R² and RMSE and so on. For an LLM though, this question can be a bit trickier to answer. If your model is generating text for example, how do you put a number on how “well” this text was generated? How do you define performance for an LLM?

This is not only important at the initial model building phase in a project, but is absolutely critical to understand if we are to succesfully monitor our models post-deployment. The ability to automatically track and calculate the performance of our models in production is a core tenet of good MLOps practices and is just as important for generative AI applications.

Luckily, many clever people have been thinking about this and we are not starting from scratch. We can try and group the sorts of metrics that are important for LLMs in particular based on whether they are trying to measure:

Appropriate use of grammar and syntax
Truthfulness
Summary capabilities
Problem solving capabilities

Also, a nice strand of work has developed around creating testing and observaibility frameworks that allow us to do things more consistently.

Let’s discuss some of these points below.

Good grammar — Stop me if you think that you’ve heard this one before

How you to measure try start sentence grammar and good syntax or? Yes, exactly. That was probably hard to read, so what I meant to say was “how do you start to try and measure good grammar or syntax?”. How can we come up with metric that we can calculate and it will tell us that the first sentence is not great, but that the second is much better? Also, if you asked me to summarize a newspaper article about hiking and I said “yoghurts are great”, how can we put a number on the fact that this really isn’t a great summary? Well, these two points are what the first category of evaluation metrics that we will look at are all about.

Perplexity

First, enter one of the most popular metrics applied to LLM outputs, perplexity.Perplexity is what is known as an intrinsic evaluation metric for an NLP model, which means that it’s calculation only involves the model’s training data and outputs and does not use any other reference or ground truth data (in this case it would be called an ‘extrinsic’ method). As hinted in the previous paragraph, the aim of perplexity is to try and provide some quantification of the likelihood of a sentence being valid language. It turns out that there are a few lower level definitions, but this is the general aim.

The main way to define perplexity is as the inverse probability of predicting the test set. For all ML models we have training and test sets. In this case, what we can do is check what is the probability (or log(probability)) of predicting the sequence of tokens in the entire test set that we have. If we have a high probability of this, this suggests that the model has seen very similar sequences of tokens in its training data. If the probability is very low, this suggests it’s quite different. In information theory, this corresponds to us asking the question of how well a probability distribution predicts a sample, which is known as perplexity. If it does well at predicting, which means high probability, then we have a low perplexity. If it does poorly at predicting, meaning the probability is low, we have high perplexity [2]. This is quite easy to remember. If the sequences look like valid language, the model should not be that perplexed by them. If the sequences are a bit garbled, then the perplexity should be higher. Easy as that!

Now, an important point to note is that since test sizes can be different sizes and models use different n-grams for the tokens in the data there are some nuances around calculating and normalizing this that we won’t go into here, but these are dealt with in most implementations. You can then use perplexity to compare against different autoregressive LLMs for the same test set. The reason we call out autoregressive models here, like GPT models, is that perplexity is not a well defined measure for masked models like BERT [4].

Bilingual Evaluation Understudy (BLEU) Score

NLP and linguistics have long had a need to computationally evaluate the quality of machine translations from one language to another. The BLEU score (which is more accurately described as a family of BLEU scores) was developed to do exactly this by taking the candidate translation sentence and comparing it to a reference sentence. This comparison is done by counting the number of matched n-grams across the two sentences, but it can be modified to work at the level of multiple sentences [5, 6]. Now, for many LLM tasks we do not actually want to translate, but perhaps we do want to do something like summarize some text. In this case we can replace our candidate translated sentence with the candidate summary sentence and compare this with the reference text.

BLEU scores are defined to range between 0 and 1, with the specific thresholds for what counts as “good” being relatively subjective. In general though, a good rule of thumb seems to be that 0.4–0.5 is considered a high quality “answer” and >0.6 is often considered better than a human [7].

The Importance of Being … Honest or “You can’t handle the truth!”

One of the biggest criticisms levelled at LLMs is their tendency to hallucinate and generate language that is not factually correct. Although many people like Yann Le Cun and Gary Marcus believe this is a fundamental feature of LLMs that requires new innovation to remove, many (the author included) still believe that it is worth trying to evaluate and optimize the current generation of LLMs to have better “truthfulness” and factual accuracy.

Again, the story for LLMs is not a simple one, so let’s dive into some ways of getting to the truth, part of the truth and mostly the truth but likely some other bits*.

First, how can we quantify how “untruthful” these models are? To start to tackle this question, Lin et al released a benchmark dataset called TruthfulQA [8] and showed that LLMs were in general poor at answering questions truthfully across several different domains. In fact, they found that the models tested only answered “truthfully” for 58% of questions whereas human performance was at 94%. This benchmark was specifically designed to use questions that a human may answer incorrectly due to a false belief or assumption, so the poor performance of the models suggested that the models were learning false answers from the training data. A more recent paper from Azaria and Mitchell**[9] suggested a method for determining the truthfulness of LLM generated statements by training a true/false classifier on the hidden layer activations of the LLM. It should be noted that the data that is used for generating the LLM activations and statements are out of distribution, hold-out datasets. This is quite an interesting approach, but it remains to be seen if it scales under further experimentation.

There are many other approaches being developed to try and help ascertain the truthfulness of LLM outputs, but this gives you a flavour of what’s going on. I also think it’s important to note, that metrics such as accuracy, precision, recall, F1-score etc, which have been used in classification models for a long time, can still be applied here as long as the problem and dataset are appropriate.

*As opposed to “the truth, the whole truth and nothing but the truth”, part of the oath traditionally used when you are sworn in to testify in a US court. At least from what I’ve seen on television.

**I have a slight issue with the title of Azaria and Mitchell’s paper, The Internal State of an LLM Knows When Its Lying”, as it overtly anthropomorphises LLMs. But I do like the approach they have taken.

Techniques to improve truthfulness

A lot of work has gone into developing techniques that can help improve the task performance of LLMs, including truthfulness. One approach is to fine-tune existing models and more relevant or higher-quality data. The team behind the phi-1 code generation LLM went so far as to say that “Textbooks Are All You Need” in their paper [10] where they seemed to show that a curated, high quality dataset let to SOTA performance even with less data and less compute. This makes intuitive sense and is quite a powerful result if it generalises across different domains and LLM applications.

Another promising direction is by using prompt engineering techniques to improve an existing LLMs output quality. For example, the chain of verification (CoVE) prompt pattern [11] works by having a model answer a question, have it then create questions to help catch errors in its answer, answer those questions independently and then finally use this to create a final, verified response. This study shows a variety of interesting results, including improving precision in list based tasks, improving F1-score in for QA based on a given text, improved precision in generated longform text among others. This shows that sometimes clever prompt engineering can go a long way. For a nice, detailed summary of the CoVe approach, check out this post by Raphael Mansuy [12].

Testing & Observability Frameworks

One thing that is definitely going to be the case for LLMOps, is that we will need to rethink how we monitor our deployed models and applications, As we have been discussing in this article, many of the techniques and benchmarks we need to apply to LLMs are just different.

This in turn means we’ll need to build a lot of new tools to help apply these different techniques. The good news is that this is happening in earnest across the globe. LLMs.HowTo has a great list of existing and in development testing and observability frameworks that I suggest you check out [13]. The application of these frameworks is going to become a critical part of LLMOps in my view. It’s only through standardised approaches that we can create reproducible workflows and not just effectively compare system performance, but also monitor it through time.

In a future post I’m going to get into some of these frameworks in a more hands-on way.

Conclusion

In this article we’ve recapped some of the challenges around measuring the performance of LLMs, introduced you to some useful metrics and briefly mentioned a list of great LLM testing and observability frameworks. In part 2 we will dive into more details and get more hands-on, but until then, happy coding!

LLMOps 101: Metrics & Evaluation — Part 1

Good grammar — Stop me if you think that you’ve heard this one before

Perplexity

Bilingual Evaluation Understudy (BLEU) Score

The Importance of Being … Honest or “You can’t handle the truth!”

Techniques to improve truthfulness

Testing & Observability Frameworks

Conclusion

Further Reading

Good grammar — Stop me if you think that you’ve heard this one before

Perplexity

Bilingual Evaluation Understudy (BLEU) Score

The Importance of Being … Honest or “You can’t handle the truth!”

Techniques to improve truthfulness

Testing & Observability Frameworks

Conclusion

Further Reading

Related Posts