Apr 2024 | Llama 3 Opens the Second Chapter of the Game of Scale

Yao Fu | Website | Blog | Twitter / X

University of Edinburgh | yao.fu@ed.ac.uk

Released on Apr 22 2024

The journey of this blog tells the story of the game of scale. We started from a thorough discussion of the evolution of GPT in 2022, why complex reasoning is the core capability (May 2023), how to do instruction tuning (Jun 2023), and how to deploy the model efficiently (Dec 2023). Now as Llama 3 is released, the community can have a conclusion of the first chapter of the game of scale whose goal is to build GPT-4 level models, and start a new chapter of multimodal models.

💡 Key takes

The scaling of text data is likely reaching a ceiling as most of the easy web text (Common Crawl, Github, Arxiv .etc) are now used up.

There will surely be new text data like digging harder from the internet, scanning library books and synthetic data. Yet it is quite challenging to increase another order of magnitude — more likely, they are just incremental within the current order.

The next chapter of the game starts from multimodal, particularly unified video-language generative model, because only video data gives orders of magnitude increase.

However, the bad news is, it seems that video data can not increase the reasoning capability of the model — recall that reasoning is the number one key capability that marks strong models.

But the good news is, video data increase everything else, particularly grounding to real-world, and exhibit strong potential to become neural world models (instead of hard-coded physical engines like Zelda), which leads to the possibility of learning from simulated physical feedback.

Scaling up reinforcement learning from X feedback seems to be the most promising direction to continue increase model’s reasoning capability, where X means human, AI, and environment feedback.

Just like how AlphaGo Zero achieves super-human performance on Go, self-play and interacting with the environment could be a direction for super-human generative models. Making the model online and iteratively learn from the feedback (instead of a single-sound offline optimization) could potentially lead to continuously increased reasoning capability.

The first chapter of the game of scale focus on scaling text data, which peaks at GPT-4 and concluded by Llama 3. The second chapter of this game would be unified video-language generative modeling and iterative reinforcement learning from X feedback.

Table of Content

1 - How good is Llama 3? 2 - The limit of text scaling data 3 - Scaling always wins, but what to scale next? 4 - Scaling unified video-language generative model 5 - AlphaZero-styled Agents through iterative reinforcement learning from X feedback 6 - Conclusion: the second chapter of the game of scale

Disclaimer: This article is essentially a quick personal research note about future work after reading through the release note of Llama 3. The opinion presented could be different than existing beliefs. I welcome any criticisms and contradictory opinions. You can either directly comment on this document, message me on X, or send me an email for detailed discussions.

1 - How good is Llama 3?

Pretty good.

For the base model, we check MMLU, MATH, GPQA, and BBH as key metrics because they measures advanced knowledge and reasoning, and the leaderboard looks like this.

ㅤ	#params	MMLU	MATH	GPQA	BBH
Claude 3 Opus	?	86.8	61.0	50.4	86.8
GPT-4 Turbo 0409	?	86.5	72.2	49.1	?
GPT-4 initial release	?	86.4	52.9	35.7	83.1
Llama 3 400B (still training)	400B	84.8	?	?	85.3
Gemini 1.0 Ultra	?	83.7	53.2	35.7	83.6
Reka Core v0.5	?	83.2	?	38.2	?
Llama 3 70B	70B	82.0	50.4	39.5	?
Claude 3 Sonnet	?	79.0	40.5	40.4	82.9
Mistral 8x22B	8*22B	77.7	?	?	?
QWen 1.5 72B Chat	72B	77.1	45.2	29.3	75.7
Reka Flash v1.5	21B	75.9	?	34.0	?
Cohere Command R+	104B	75.7	?	?	?
Gemini 1.0 Pro	?	71.8	32.6	27.9	75.0
DeepSeek	67B	71.3	18.7	?	68.7
Mistral 8x7B	8*7B	70.6	?	?	?

One thing exceptional about LLaMA 3 70B is that its performance is much better than its peer 70B-level models whose (MMLU is typically about 70+) and enters the frontier model regime (of 80+ MMLU).

There could be two reasons for LLaMA 3 70B achieving so good MMLU:

It uses 15T training tokens, which is much larger than its peers.

Particularly, mixing the code / arxiv data could improve reasoning.

It uses benchmark-related continue pretraining data (e.g., Llemma/ MetaMath/ Mammoth) to improve/ decorate the benchmarks.

Yet when the score becomes 80+, it would be quite difficult, though not impossible, to decorate MMLU — the dataset itself is a hard one.

The chat version also looks not bad from chatbot arena

But note that there was a clear score boosting pattern after the initial release of LLaMA 3 — it is not hard to tell which answer is provided by LLaMA 3 based on the textual patterns — resulting in about rank 3 initially, but now the ELO gradually goes down. Yet you can see the confidence interval (+9/-11) is much larger than other models (+5/-5) so its rank may continue goes down.

Llama 3’s initial leaderboard climbing with few votes and high variance

Honestly it is completely unnecessary to do performance decoration and score boosting — it is already a pretty good model — doing so may increase its reputation among the general public (or may not), but definitely will harm the reputation among the professionals. And again, it is already the best publicly accessible model.

My bet is that eventually it may converge to about GPT-4 0314 of ELO 1180 — about Claude 3 Haiku performance (again, pretty good).

2 - The limit of text scaling data

is probability already here. As we observe that GPT-4 Turbo / Gemini Ultra / Claude 3 Opus / Llama 3 400B are all about the same range (MMLU around 85). To continue scale up text we need more data, the problem is whether it is possible to substantially increase the amount of text data beyond Llama 3’s 15T tokens.

There are the following directions, ranked by the potential scale of new data:

CC is only part of the whole internet.

We have not yet finished digging and crawling from CC.

Relaxing the filtering and deduplication threshold.

Use existing models to produce synthetic data.

Scanning more books from libraries

We discuss them one by one.

CC is only part of the internet

This factor is the largest undetermined factor about text scaling. We do not know how large the actual internet is.

For companies like Microsoft / Google and Meta who can readily dump a much larger portion of the internet beyond CC, they can still dig more data.

But the problem is, after deduplication and quality filtering, how much tokens could remain.

We are still digging from CC

The problem with this approach is, the final number of tokens we can produce of existing CC is upper bounded by the data processing pipeline, and may not change in terms of orders of magnitude

New CC dumps increases linearly with time, still no change in orders of magnitude.

Yet scaling law says exponentially increasing data result in linearly increased performance. So eventually we may result in another 5T new tokens on top of the 15T Llama 3 data, yet what we truly want is another 50T tokens.

Releasing the filtering and deduplication threshold

Raw data is large, we are not using all of them because of data quality and duplication. Baichuan tech report has a nice figure on the filtering influence the amount of final tokens:

It is a research problem (see Shayne et. al., Muennighoff et. al. and Xue et. al.) to what extend you should keep your quality and deduplication standard. The general impression is, probably not too loose.

Use synthetic data

Recently, Liu et. al. gives a nice summary of synthetic data, highlighting data sources of reasoning, tool-using, multimodality, multilingual, and alignment data.

The key challenge remains: it seems that most existing data work cannot change the order of magnitude, such that they are mostly used for continue-pretraining and finetuning, but not directly on pretraining.

The only exception is the Phi model series as they use GPT-4 produced data to train a much smaller model. The problem here is whether their approach can scale to larger models and break the GPT-4 upper bound.

Scanning more books from libraries

Clearly these are promising directions because library books are definitely of extremely high quality — much higher than the web — and can clearly improve professional knowledge benchmarks like MMLU. Below gives a list of largest libraries in the world

The problem is not on the tech side — buying copyrights from these libraries may simply exhaust all the AI investments — and a large portion of them are not for sale. And again, if on average there is 70K tokens per book, then 200 million books is only about 14T tokens. It doubles the existing number, but not much larger.

3 - Scaling always wins, but what to scale next?

So far we have discussed that it is likely GPT-4 level frontier models is approaching the ceiling of text scaling, and further scale up text data may encounter much harder challenges (but still possible). We surely want to continue the carnival because scaling is a law. Scaling always wins, the problem is what to scale next.

Video data may not improve reasoning, but it improves everything else

A clear direction is multimodal data, particularly video data. Youtube and Tiktok is presumably orders of magnitude larger than text — yes that’s where the new order of magnitude come from. But there is an immediate challenge: does multimodal data improve text-based reasoning?

The answer is hard and probably not. Then there is a follow-up realistic question: if OpenAI release GPT-5 next month, with its MMMU score increase from 56 to 70 but MMLU retains to be 86, what kind of message would it send and how would the public react to it?

MMMU leaderboard screenshot

Yet the good news is, even if video data do not increase reasoning, it improves everything else, particularly grounding, thus enabling the model to receive feedback from the real-world.

To improve reasoning, one need to scale up exploration and exploitation in reinforcement learning

Specifically one may need to scale:

The horizon of model’s exploration, e.g., deployment the model online for a year and update on a weekly basis, instead of single-step optimization
The search space, e.g., let the model generate one million responses and pick the best of them, instead of the original InstructGPT’s best-of-7 approach.
The source of feedback, specially gradually move from human feedback (because human feedback does not scale and the model is becoming stronger than their human annotators) to AI and environment feedback (thus the requirement for world model).

A very unfortunate fact is that many of the existing research work is looking into tiny details of small-scale single-round optimization, e.g., adding one loss term on DPO. Yet the key here is online iterative large scale exploration and exploitation.

4 - Scaling unified video-language generative model

So just scaling up video-language right? Doesn’t sound too hard?

The current situation is, unlike text scaling where we have quite standard architecture (MoE transformer), standard objective (next word prediction), standard pipeline (pretraining then alignment), the design space of vision / multimodal generative models are much larger than language models, and we have not yet converged on even some basic questions, for example:

Should we train on separate modalities first, then use adapters to bridge modalities, as is the current practices like LLaVA, or should we directly train on the mixture of all modalities?

Should we use a unified transformer backbone, or some CV stuff, like UNet and CNNs, for the image / video part? What type of modifications we should make on the transformer architecture (e.g., 3D positional encoding)? How to make the best of mixture of expert layers?

Adding new modalities should at least not harm existing modalities, yet it is common observation that adding vision may negatively influence language. How to reconcile the contradiction between different modalities?

For the video understanding part, how to do tokenization/ representation learning? Should one consider VQ-VAE styled discrete tokens or Sora-styled continuous space-time patches? Should one use contrastive-styled objective like CLIP or reconstruction-styled objective like the original VAE?

For the video generation part, should it be like autoregressive like VideoPoet or diffusion-based like Sora? How to train a transformer model that can simultaneously perform diffusion-styled generation and autoregressive generation?

The final solution may be quite simple and only modifies small parts of existing solutions, but to identify these small yet crucial modifications, the community need saturation attack on these problems.

5 - AlphaZero-styled Agents through iterative reinforcement learning from X feedback

Having discussed there might be limited new data for pretraining and multimodal may not improve reasoning, to further improve the reasoning capability (which is after all the core capability of language models), we turn our focus on scaling reinforcement learning.

The problem is, again, what to scale; the good news is, basically any dimension in RL can and should be scaled. We first discuss a particular metric: pass@K, which means given K trials, if the model can succeed at least once. DPO is basically optimizing pass at 2 (choice a good one, reject a bad one), and the InstructGPT bass is pass at 7 (choose the best one out of 7 candidates).

What if we scale K and, instead of considering pass at 7, we consider pass at 1million?

From the AlphaCode paper, we see how model pass rate continuously improve when scaling K:

Yuxuan Tong verified on MATH, how DeepSeek and Mistral continuously improve when scaling the search space K:

Clearly the curve have not yet saturated.

One immediate question is, how do we pick the best one out of 1 million candidates? We take a look at GPT-4’s approach by tracking its MATH performance improvements over Mar 2023 to Apr 2024:

Base model	Research work	Score	Source of improvement
Initial release	ㅤ	42.5	Scaling from GPT 3.5
ㅤ	+ Complexity based prompting	53.9	Improving data complexity
ㅤ	+ Code-based verification	73.5	Code-based verification
ㅤ	+ Let’s verify step by step	78.2	Process-based feedback
ㅤ	+ CSV + Voting	84.3	Code-based verification + Search and voting
GPT-4 Turbo 0409	ㅤ	72.2	?

These improvements show:

Use code-based feedback to verify the answer

Use process-based reward model to verify the answer

Use expert-level annotation to produce the feedback

Also important to note that these improvements are not from one-shot optimization, but iteratively through multiple rounds of optimizations, which Anthropic referred as online iterative RLHF:

The effectiveness of iterative improvements is also verified by the practice of Llama 2:

Llama2’s iterative improvements over multiple versions

And also Shangmin’s online AI feedback:

6 - Conclusion: the second chapter of the game of scale

Actually, the fact that human have approaching the limit of text data should be the problem already realized by OpenAI by mid 2022, when they have finished the training of the initial version of GPT-4. Now is Apr 2024, with the release of Llama 3, it is time to summarize the first chapter of the game of scaling, where most of the frontier models are about GPT-4 parity.

In 2023, the race of multimodal generative model already began with the image capability. Currently, only Gemini and Reka are capable of understanding video (but cannot generate video), and Sora seems to be the only model that is capable of generating minute-length video (but video only). Currently, only GPT-4 Turbo, AlphaCode, and DeepSeek Math have explored how to scale up the search space and the feedback signal, and only GPT-4 / Claude have reported extensive results on online iterative RLHF.

The second chapter of the game of scaling, begins now.