LLM Deep-dive: Llama 3.1

By 
Leo Hiselius
October 14, 2024

Introduction

Towards the end of July, Meta unveiled their latest family of open-source Llama models: Llama 3.1 8B, 70B, and 405B. While the smaller 8B and 70B models are incremental upgrades from the 3.0 versions released in April this year, the 405B model represents a significant milestone for open-source LLMs, challenging proprietary models like OpenAI’s GPT-4o, and Anthropic’s Claude 3.5 Sonnet across multiple benchmarks. In this blog post, we will give a brief overview comparison between the mentioned models, and explore some of the most intriguing aspects of the 92-page report that accompanied the release of Llama 3.1.

Overview

Before we get into the benchmark performances and technical findings, let’s make an overview comparison between Llama 3.1 405B and the previously mentioned models.

Llama 3.1 uses a standard dense transformer architecture, just like its predecessors, and  according to Meta, the performance gains over earlier Llama models are the result of data quality and diversity, scale, and training FLOPs. Unlike GPT-4o and Claude 3.5 Sonnet, Llama 3.1 is not multimodal, but it should be noted that Meta will be releasing a compositional approach to mulitmodality in the foreseeable future.

Prompt examples

Let’s look at how the three Llama 3.1 models compares on a logical puzzle: 

Prompt:

Kim is a developer. Kim has two colleagues working in sales. Each salesperson has two colleagues who are developers. How many colleagues who are developers does Kim have?

Llama 3.1 8B

Let's break it down step by step:
Kim has 2 colleagues working in sales.
Each sales person has 2 colleagues who are developers.
So, each sales person has 2 * 2 = 4 colleagues who are developers.
Since there are 2 sales people, we need to multiply 4 by 2 to get the total number of colleagues who are developers:
4 * 2 = 8
Therefore, Kim has 8 colleagues who are developers.

Llama 3.1 70B

Let's break this down:
Kim is a developer and has 2 colleagues in sales. Each of these sales colleagues has 2 colleagues who are developers. Since Kim is one of the developers, each sales colleague's 2 developer colleagues include Kim and one other developer.
So, each sales colleague has Kim and 1 other developer as colleagues. Since there are 2 sales colleagues, and each of them has 1 other developer colleague (in addition to Kim), that means there is a total of 1 other developer colleague (not Kim).
Therefore, Kim has 1 colleague who is a developer.

Llama 3.1 405B

Since each sales person has two colleagues who are developers, and Kim has two colleagues working in sales, that means there are 2 x 2 = 4 developer colleagues that the sales people have.
However, Kim is also a developer, and she is not a colleague of herself. Therefore, we need to subtract 1 from the total count (because one of the developer colleagues is Kim herself).
So, the number of colleagues who are developers that Kim has is 4 - 1 = 3, plus the fact that we already know that Kim is a developer and she has 2 sales colleagues who are not developers but they each have 2 developer colleagues, but one of them is Kim, so that leaves 3 other developer colleagues, but that was from the perspective of the sales colleagues, from Kim's perspective, she would have those 3 plus the original 2 that the sales colleagues were not counting, which is one of them, because the other one is herself, so that's 3 + 1, no 3 + 2 no 3 plus 1 which is 4 no 3 + 1 - 1 is 3, that would be if you were counting Kim, so it's 3 + 1 which is 4.

For starters, the right answer is that Kim has one colleague who is a developer. In this tiny experiments, the 70B model is in fact the only model who gets the riddle right. The 8B model clearly contradicts itself when it first states that ”each sales person has 2 colleagues who are developers” and then in the next sentence states that ”each sales person has 2 * 2 = 4 colleagues who are developers” and ends up completely wrong. The 405B model gets the important fact that ”Kim is not a colleague of herself”, but overthinks the logic in the last reasoning step and gets totally lost in mathematical nonsense. It should be noted that for a fair comparison on this particular prompt, the models should be evaluated multiple times, with their final answers averaged to account for variability in responses.

In standardized reasoning benchmarks, such as the ARC Challenge, the 405B model does however outperform the 70B model, and furthermore even outperforms GPT-4o and Claude 3.5 Sonnet. In the next section we will look a little closer on the benchmark results.

In my small experiment, it is also worth noting that in terms of speed, the 8B model processes (input and output) on average 110 tokens per second, the 70B model 40 tokens per second and the 405B model 22 tokens per second. The 8B model is therefore preferred if token processing speed is a priority.

The models were deployed in Azure AI Studio.

 

Benchmarks

As the table demonstrates, Llama 3.1 405B is indeed on par with, or even surpasses, the strongest competitors in the field, and this fact holds true for all the seven benchmark categories. One particularly spectacular benchmark result is that of long context category, where the model’s performance on long-context tasks is evaluated. On two out of three benchmarks, Llama 3.1 405B obtains a higher score than all competitors. This is indeed very big news for the AI community, signaling a significant shift towards open-source models that can compete head-to-head with proprietary counterparts.

Now that we have seen that Llama 3.1 405B is truly a top tier LLM (however failing my riddle!), let’s look at some interesting findings from the technical report.

Llamas helping Llamas

One of the most interesting aspects of how Llama 3.1 405B was trained is how Meta’s researcher utilized the earlier Llama 2 model as part of the data cleaning pipeline. In short, a quality classifier was trained on data annotated by Llama 2 with regards to a set of quality requirements. Before being fed to the pre-training loop of Llama 3.1 405B, the data had to pass the quality classifier. Apart from improving general token quality with this approach, a similar approach was applied to reasoning and coding data. In the future, we will undoubtedly see more examples of large language models aiding in training of other large language models.

Scaling laws

Another very interesting revelation in the paper is that Meta has developed a model which predicts performance on various benchmarks based on the amount of computational resources used during training. They found that their model for predicting benchmark performance aligns very well with actual performance. In principle this means that given a training budget, the intelligence of the model can be predicted prior to actually training it. This has positive implications for AI safety, as it gives developers a heads up on what to expect from the model before actually training it.

It should be noted that many of the popular benchmarks used to evaluate Llama 3.1 are subject to contamination, which means that at least part of a given benchmark has leaked into the training data. In a following blog post we will discuss the need for uncontaminated benchmarks for the evaluation of LLM intelligence.

A note on open source

While Llama 3.1 is open source in the sense that its weights are downloadable and free, and its architecture and training is described in an extensive paper, it is not open source in the definition provided by open source initiative: that would require Meta sharing what data was used to train the model. In other words, even if you had the capacity to actually recreate Llama 3.1, you wouldn’t be able to, as the training data is unknown.

Malicious use

There is widespread and legitimate concern that the release of powerful LLMs to the public may pose a threat to society in several ways, including cyberattacks and the creation of biological weapons. Meta approached this concern with a small empirical study, where 62 participants were asked to perform a cyberattack or create a biological weapon with or without the assistance of Llama 3.1 405B. Luckily, they found no significant increase in the performance of these malicious use cases when using Llama 3 as compared to only using internet search. 

Conclusions

The release of Llama 3.1 405B marks a significant milestone in the landscape of large language models, showcasing that open-source models can now rival proprietary alternatives like GPT-4o and Claude 3.5 Sonnet. As Meta's Llama 3.1 continues to demonstrate its strength across various benchmarks, it highlights the growing potential and influence of open-source models in the AI community.

This development raises important questions about the future dynamics between open-source and proprietary models. How will the leading players like Google, OpenAI, and Anthropic react to this challenge? And will Llama 3.1’s accessibility and competitive performance drive a shift in the adoption of open-source AI tools?

While we're still in the early stages of fully understanding the capabilities and implications of foundational models, it's clear that Llama 3.1 will play a pivotal role in shaping the future of AI. The model's impact will likely be felt for years to come as the boundaries of artificial intelligence continue to expand.

Learn more