June 2023, A Stage Review of Instruction Tuning

University of Edinburgh | yao.fu@ed.ac.uk

yao.fu@ed.ac.uk

June 29th 2023

Following the great success of ChatGPT, on February 24, 2023, the emergence of LLaMA heated up the direction of instruction tuning. On March 18, Alpaca demonstrated the potential of distilling smaller models from mature ones to become decent ChatBots, triggering a Cambrian explosion of llama-based models. However, just three months later, people began to recognize the various problems of training LLaMA with ChatGPT's data. This article reviews the development of LLaMA-based models over the past three months and discusses the next challenges of Instruction Tuning.

Disclaimer: This article is essentially a quick research memo, edited from the outline of my recent presentation, with some cuts and additions. Currently, there are many things the open-source community is unclear about building LLMs. I have tried my best to ensure that the content I refer to or discuss has solid evidence, rather than being based on rumors. Much of the content comes from direct discussions with the original authors of the corresponding papers. Even so, my take may still be very wrong and there may be many unresolved issues, so please feel free to comment directly beside the article and participate actively in the discussion — I will keep all the comments that point out my errors. The truth becomes clearer with debate.

Table of Contents

1 - The origin 2 - After LLaMA 3 - How to do eval 4 - FLANv2 and long context 5 - Coding 6 - Putting them together: capability balancing 7 - Summary

1 - The origin

The first three papers:

Natural Instructions v1: Cross-task generalization via natural language crowdsourcing instructions

The very beginning. Initially released in April 2021. Two years ahead of LLaMA. Extremely visionary!!!

FLANv1: Finetuned Language Models Are Zero-Shot Learners

T0: Multitask Prompted Training Enables Zero-Shot Task Generalization

InstructGPT: Training language models to follow instructions with human feedback

Comparisons:

The goal of InstructGPT is alignment, with zero-shot and cross-lingual being side effects.

This paper used a 7B Reward model corresponding to a 175B Policy model, which was then followed by DeepSpeed Chat and a series of subsequent RL open-source works. This approach should be incorrect.
The correct approach should be to scale up the Reward model to reduce the size of the Policy model, as seen in Scaling Laws for Reward Model Overoptimization — that is, reversing the sizes of the two models, using a 175B Reward to PPO a 7B policy.
For the model to be deployed online, 10 - 50B is a more affordable scale, anything larger would be too costly.

The goals of FLANv1 and T0 are zero-shot, so they are not for alignment.

Then there's Self-instruct:

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Important points about self-instruct:

The base model can be arbitrary (any pretrained checkpoint); it doesn't need to be a model (ChatGPT) that has undergone alignment.

It reproduces the process from the first generation Davinci to text-Davinci-001 — incredibly insightful!!

Then comes FLANv2 — very important, I may have read it more than ten times and suggest just memorizing the entire content.

Two papers:

Scaling Instruction-Finetuned Language Models. This focuses more on generalization, which, as of Jun 2023, should be well-known by the community
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. This focuses on data engineering and mixture methods, which, as of Jun 2023, is still under active investigation.

Quick takes:

The flan collection dataset almost imporves all aspects of model capability except for human preference (because human prefer long text and flan is short), which we will discuss later.
While human preference indeed leans towards models that are talkative. Models that do the work may not necessarily be talkative. FLAN is productive but does not talk much, much like a programmer.

2 - After LLaMA

Alpaca: The starting article, but the model itself may not be very strong.

Vicuna

In open-source, it does well in dialogue, the format conforms to human preferences, generates a lot of content, and has many unique tokens.
In automatic evaluation, in-context learning / reasoning / knowledge might be suboptimal (reflected in MMLU, BBH scores). This doesn't mean it's bad, but it could be better.
It's unclear whether GPT-4 evaluation works well or not. The LMSys team says it does, provided prompt engineering is done well enough, see: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
Additionally, the LMSys team is very strong in terms of efficiency. See the model serving in the vllm project, which may be the fastest in open-source.

Then, a series of models that use GPT-4 as a judge and claim to have reached the standard of GPT3.5 x% are not recommended because Eval is unreliable.

However, there are several works that didn't rely on ChatGPT during alignment. These works are recommended, including:

LIMA: Less Is More for Alignment — Pay attention to their method of selecting data. It's recommended to spend an hour reading their data to understand what good SFT data may look like.
Dromedary: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision — Pay attention to their method of prompt engineering. This is basically a LLaMA version of Constitutional AI - SFT.

Then, some papers (finally) start analyzing the data mixture of instruction tuning.

Tulu: How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

The results are very mixed, so it's impossible to conclude which mixture is good.
Yet data from classical NLP benchmarks may not be good (for human preference).

3 - How to do eval

Firstly, do not calculate average scores on a bunch of benchmarks, especially do not average the scores on the original GPT-3's test tasks, because everyone comes out the same on average. It is recommended to only focus on core benchmarks with differentiability.

Examples of benchmarks with less differentiability:

Original GPT-3 eval: there are too many benchmarks where good and bad perform similarly, averaging them masks true differentiating benchmarks.

In fact, if MMLU and MATH are included, they are very likely to be averaged out with other datasets.

Summarization + Rouge / Translation + BLUE:

The differences in Rouge and BLUE scores between stronger and weaker models are only four to five points, the numbers are too small v.s. the differences in accuracy scores between stronger and weaker models are between 90 points and 10 points, the numbers are large enough.
Rouge and BLUE are not completely aligned with human preferences — note that BLUE is also not completely aligned.

So, which ones are recommended for pretraining?

Again, pay attention to differentiability, the strength and weakness of models should be apparent at a glance.

Separate directions, for now, these can be temporarily divided into:

English knowledge — MMLU
Chinese knowledge — C-Eval
Reasoning — GSM8k / BBH
Coding — HumanEval / MBPP

After balancing the above four items, the following can be pursued:

MATH: high-difficulty reasoning
Dialog: this might only be achievable through human eval, automatic eval cannot handle it.

Next, let's talk about Automatic Evaluation.

Automatic Evaluation - suitable for pretrained checkpoints - the practice at

chain-of-thought-hub

FranxYao • Updated Jun 30, 2025

can generally be followed (shameless self-promote).

Knowledge: MMLU

This dataset is quite stable especially for models that is larger than 7B LLaMA, basically no sensitivity issue — just use the original official prompt. The smaller the model is the more sensitive the score will be.

Reasoning:

GSM8k: Also relatively stable, but pay attention to the rate of answer extraction, if it's less than ninety, add more regular expressions.
BBH - Algorithmic:

Not very stable, pay attention to the rate of answer extraction.

BBH - Language:

Not very stable, pay attention to the rate of answer extraction — Chain-of-thought Hub will soon release an analysis of the sensitivity of the result to the rate of answer extraction, the conclusion is that BBH is quite sensitive.

Currently, aside from increasing the model size and using FLAN, it's not clear what operations can increase the scores on the BBH dataset.

Coding:

Human Eval / MBPP: It seems to be stable, but pay attention to doing unbiased estimation.

First look at the above datasets, after the scores can match llama, then look at MATH.

MATH:

Super hard, GPT-4's scores:

naive prompting: 42
→ complexity based prompting: 50 https://openreview.net/forum?id=yf1icZHC-l9
→ progressive hint prompting: 53 https://arxiv.org/abs/2304.09797
→ majority voting over 1.8k: 69.6
→ best of n with outcome based reward modeling: 72.4
→ best of n with process-based reward modeling: 78.2 (note that process-based and outcome-based reward use different set of tuning data)
→ PPO + process-based reward modeling = ? Guess it would go up to 90
Generalization? — Should be quite strong. Generally speaking, generalization is positively related to the base model size, negatively related to the total amount of single-direction SFT data, and positively related to the diversity of SFT data.

If it's not GPT-4

Minerva / PaLM-2: 34.3
Galactica: 33.6 — This paper did a good job, yet because its hallucination, it was much criticized and pulled off — leading to its importance being seriously underestimated.

88B paper + 7B code + 7B encyclopedias, textbooks and educational material + 2B KB + 1B CC + 0.4B prompt / instruction * 4 epochs — pay attention that code and papers can go through multiple epochs (unlike webpages)

LLaMA 65B: 10.6
Others: Less than 10 points.

For a model that has already been fine-tuned into a chatbot:

First, go through the above benchmarks in a few-shot setting to ensure that you don't drop points.

If it's just fine-tuning using dialog data, it might harm existing capabilities (MMLU / BBH).
If points are dropped, consider LM mixing / FLANv2 mixing.
Note that the few-shot prompting for chatbots should use the dialog version because if too many in-context examples are stuffed into a single round, the model may not be strong enough in instruction-following. See CoT Hub's standard prompt library.

Then it's time to evaluate user preferences, which can only be done by humans at this point.

If there is a large, already trained reward model, it can be used to evaluate online small / medium models, which is actually not much different from human evaluation.
For a very large Policy Model:

Online iterative RLHF requires expert evaluation in the early stages, no matter what.
Later stages require expert evaluation with AI assistance.

Yes, slightly weaker models can be used for evaluation, but one needs to pay attention to the difficulty and distribution of the queries, as well as the prompt engineering.

Without prompt engineering, it definitely won't work due to various biases.

If the difficulty of the queries is not enough, and there's not enough diversity, it may not work either.

If the difficulty of the queries is sufficient and you have undergone extensive prompt engineering, it may work for evaluations related to information seeking. You can refer to this paper for more details.

But for reasoning-related, non-information seeking-related evaluations (like TLDR), it may not work again.

For queries related to information seeking, the evaluation may be biased towards long responses.

The more the number of unique tokens, the more GPT-4 prefer

4 - FLANv2 and long context

FLANv2 is indeed an intriguing dataset. It incorporates a variety of capabilities except for user preference.

Pay attention to CoT (Chain of Thought) prompting

It only becomes better than Direct prompting after 62B.
It does not include knowledge (MMLU) and only focuses on reasoning (BBH).

The improvements brought by FLANv2 include:

Knowledge (MMLU)
Reasoning (BBH)
Multilingual capabilities (TyDiQA / MGSM)
Please note that the authors of FLAN have verified that there's no data leakage.

The content above applies to both in-context learning and zero-shot learning.

However, FLAN's responses are short, hence it does not include user preference — the personality of FLAN is like a nerdy geek, capable and effective, but not very talkative / attractive.

Please distinguish between data leakage and in-distribution generalization:

If the test set of a dataset is used for model training, it's called data leakage, and the model's score will be significantly high and untrustworthy.

If the training set of a dataset is used for model training, it's called in-distribution generalization, and the model's score is trustworthy.

Some datasets have low in-distribution generalization difficulty, such as MMLU / C-Eval, and merely scaling up the data can improve scores.

Some datasets, even if the model has seen the training set, it may not perform well on the test set if the model is not powerful, such as GSM8K — these types of datasets are high-quality evaluation datasets.

The difficulty of coding tasks may lie between MMLU and GSM8k; in-distribution generalization is not as difficult as GSM8K, but it's not simple either.

Furthermore, according to Zero-scrolls leaderboard, Flan can further improve long-context reasoning. No clue why.

Note that the only difference between FlanT5 and T0pp is their instruction tuning data. However, FlanT5, solely relying on T5's relative positional encoding to naively scale to an 8k context length, significantly outperforms T0.

Perhaps for long context, data engineering is as important as neural architecture engineering.

5 - Coding

The data engineering in these two articles is outstanding:

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

They construct an instruction tuning dataset by constantly prompting AlpacaCoder, without relying on ChatGPT (if I understand the paper correctly, yet its method should work for any strong enough base model).
On HumanEval and DS-1000, they rank only behind GPT-4, surpassing Claude / Bard.
The base model used is StarCoder, which implies that the quality of The Stack V3 is once again validated. Also, note that pretraining code data can go through multiple epochs while web pages only goes through one epoch.

Phi-1: Textbooks Are All You Need

The pretraining dataset comes from filtered code + prompting ChatGPT.
The instruction tuning dataset comes from prompting ChatGPT.
The base model has only 1B parameters.

How to evaluate:

It's critical to thoroughly study how they prompt the base model. As long as the base model scores high on MMLU / BBH / HumanEval, its potential surpasses your imagination.

The dataset prompted can be considered as providing a super-large training set for shorter algorithmic problems like HumanEval / MBPP.

However, it cannot be assumed that it optimizes for the test set, because its generalization space should be larger than HumanEval / MBPP. This generalization space is significantly positively correlated with the model scale.

Based on this, the more challenging aspects are:

Repo-level code understanding / completion. HumanEval / MBPP is still a bit short.
Ability balance. If you follow Phi-1's approach, abilities other than code will be overwhelmed.

Also, about the data mixture of code and text, discussed in the corner of the recent Huggingface paper:

In continue training, the code data can be 50% in the mixture — significantly larger than imagined

The 50% code continue training mixture is further verified by Salesforce XGen model

Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length

TLDR We trained a series of 7B LLMs named XGen-7B with standard dense attention on up to 8K sequence length for up to 1.5T tokens. We also fine tune the models on public-domain instructional data. The main take-aways are: * On standard NLP benchmarks, XGen achieves comparable or better results

https://blog.salesforceairesearch.com/xgen/

Honestly this approach is super effective — scaling this recipe to a 100~200B model is very like to to match the performance of code-davinci-002, the base model of GPT-3.5. Then you are good to go.

6 - Putting them together: capability balancing

Goal:

Construct an instruction tuning data mixture to increase dialog/coding.

Ensure MMLU (English knowledge) / BBH and GSM8K (reasoning) do not drop.

Ensure in-context learning does not drop.

Approach:

You can let the model go through FLAN first - it is very large and almost equivalent to continue training.

For the code part, refer to the methods of WizardCoder and Phi-1.

After preparing the above data, search for hyperparameters for the instruction tuning data mixture and data curriculum.

Use the methods mentioned above for Eval.

7 - Summary

Translation:

At this stage, the core issue of instruction tuning is the balance of abilities.

The evaluation of basic abilities can refer to the Chain-of-thought Hub (shameless self-promote again), but dialog evaluation still requires human participation, and even humans may not evaluate it well enough.

FLAN is very magical, and it might be worth considering creating a Chinese version.

Let’s conclude the instruction tuning faster and move to the reward modeling stage.

Be sure to first refine the reward modeling itself, ensuring the reward model has discriminative power before doing PPO.
Don't rush to implement PPO without fully understanding the reward model; taking too big steps could lead to problems.