A Closer Look at Large Language Models Emergent Abilities

Yao Fu, yao.fu@ed.ac.uk

University of Edinburgh

with Tushar Khot and Hao Peng, work done during the internship at

Allen Institute for AI

Thank Aristo teammates, Jingfeng Yang, and Yi Tay for the insightful discussions and suggestions.

Please also check the related blog posts from the CoT team.

Sun Nov 20, 2022

Other versions: [pdf] [google docs] [arxiv] [中文]

Recently, there has been great interest and progress in showing great abilities in large language models (chain of thought, scratch pad). Collectively referred to as emergent abilities of large language models, these are abilities likely to only exist in large models but not in smaller ones, hence the “emergence” framing. Many of the abilities are quite impressive, like complex reasoning, reasoning with knowledge, and out-of-distribution robustness, as we will look closely below. These abilities are potentially close to what the NLP community have urged for decades, thus representing a potential research paradigm shift from fine-tuning small models to in-context learning with large models. For pioneers, the paradigm shift may be straightforward without the need for justification. Yet, for scientific rigor, we do need very explicit reasons why one should shift to large language models, which are expensive, hard to access, and potentially not as good. In this post, we will scrutinize what these abilities are, what large language models may deliver, and what are their potential advantages in a broader NLP/ ML context.

Table of Content

The emergent abilities that exist in large models but not in small models. Three Typical Examples of Emergent Abilities Complex Reasoning Reasoning with Knowledge Out-of-distribution Robustness Summary so far Emergent abilities transcend the scaling law What does paradigm shift mean?How large should the model be?Is scale the only factor?Conclusion

Prerequisites: we assume the readers have the following knowledge:

Pretraining, fine-tuning, prompting (average Natural Language Processing / Deep Learning practitioners should know).

Chain-of-thought prompting / scratch pad (maybe less known by average practitioners, but it should be OK to read without knowing them in advance)

The emergent abilities that exist in large models but not in small models.

Figure copy and paste from Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. X-axis means model scale. GSM8K is a primary school level math problem dataset. — Figure copy and paste from Wei. et. al. 2022. *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*. X-axis means model scale. GSM8K is a primary school level math problem dataset.

In the above performance figure, we observe that the model performance:

does not increase much when the model size is relatively small

significantly increase when the model becomes large

This indicates fundamentally, certain abilities may not be in small models but may be obtained by large models.

There are many emergent abilities, as summarized by Wei et. al. 2022. Certain abilities are fun, but we are not interested, for example, last letter concatenation, as we believe that’s the task for Python but not for language models, or 3-digit addition, as we believe that’s the task for a calculator but not for language models.

The abilities that we are interested in are the abilities (1) that the NLP community have urged for years but have yet to come, (2) that all previous NLP models struggle with, (3) that root from the deepest nature of human language, and (4) that potentially reach the highest level of human intelligence.

Three Typical Examples of Emergent Abilities

Many interesting abilities fall into the above-mentioned categories. Among them, we consider the following three typical abilities:

Complex Reasoning

Reasoning with Knowledge

Out-of-distribution robustness

Now we take a close look at them one by one.

Complex Reasoning

This is an example where prompting significantly outperforms fine-tuning. We start from a quick first example from GSM8K:

Although this is simple for a 10-year-old, it is nontrivial for a language model to solve, largely due to the mixture of math and language.

GSM8K was originally proposed by OpenAI in Oct 2021. At that time, they fine-tuned the first version of GPT3 with a verifier on the full training set, and the accuracy was about 35%. The authors were rather pessimistic because their results show a scaling law of language models: the performance increase linearly as the model size increases exponentially (I will get back to this later). Consequently, they ponder in their Section 4.1:

“it appears likely that the 175B model would require at least two additional orders of magnitude of training data to reach an 80% solve rate.”

Note that two additional orders of magnitude would be 17500B. How many years would you think this should take?

Three months later, In Jan 2022, with a 540B PaLM model, Wei et. al. pushed the accuracy to 56.6% within only 8 chain-of-thought prompt examples (no need to fine-tune on the full training set). Then in Mar 2022, with the same 540B PaLM model, Wang et. al. improved the number to 74.4% by majority voting. The current SOTA performance is from my own work at AI2 (Fu et. al. Nov 2022), where we achieve 82.9% accuracy with the 175B Codex by using complex chains of thought. Technology improvement is indeed exponential.

Chain-of-thought prompting is a very typical example showing the emergence with scale:

Emergence: although 17500B is not required, the model size indeed has to be larger than 100B for chain-of-thought to outperform standard answer-only prompting. So the ability only exists in large models.

Performance: the performance of chain-of-thought prompting is significantly better than its previous fine-tuning methods.

Annotation Efficiency: chain-of-thought prompting only require annotations of 8 examples, while fine-tuning takes the full training set.

One may argue that primary school math is still not meaningful (in a sense, they are indeed not so cool). But GSM8K is just the beginning, recent works have pushed the frontier to high school, college, and even International Mathematical Olympiad problems. Cooler now?

Reasoning with Knowledge

The next example we look at is reasoning that requires knowledge (e.g., question-answering and commonsense reasoning). In this setting, prompting large models does not necessarily outperform fine-tuning small models (which one gives a better score is yet to see). But the annotation efficiency in this setting is amplified, because:

in many datasets, to obtain the required background/ commonsense knowledge, the (previously small) model needs an external corpora/ knowledge graph to retrieve from or needs to be trained on augmented data with multi-task learning

with a large language model, it is possible that one directly drops the retriever and only relies on the internal knowledge from the model, and again, without tuning

Figure from Yu et. al. 2022. Previous SOTA models require retrieving from outside knowledge sources. GPT-3 performs on par with/outperforms previous models without needing to retrieve.

As is shown in the table, unlike the math case, GPT-3 does not significantly outperform previous fine-tuned models. But it does not require the retrieval from external documents because it contains the knowledge within itself.

To understand the significance of these results, we need a bit of historical note: the NLP community have been facing the challenge of efficiently encoding knowledge since the very beginning. Knowledge should be stored somewhere anyway, either within or outside the model. Since the 90s, people have been trying to store knowledge outside the model by handwriting all the rules of language and the world into a gigantic library. Yet this later turned out to be extremely hard because one simply cannot enumerate every single rule of the world. So instead of writing all possible rules, researchers retreat back to domain-specific libraries of knowledge that are useful. This domain-specific knowledge can be unstructured text, semi-structured like Wikipedia, or fully structured like knowledge graphs. Usually, structured knowledge is hard to construct (because one needs to design the schema) but easy to reason with (because there are structures), unstructured knowledge is easy to construct (one just stores them) but hard to reason with (there are no structures ready to use). Language models, given this context, provide a way to simultaneously extract knowledge easily from unstructured text and reasoning upon the knowledge effectively without the need for a predefined schema. The table below shows the comparison of pros and cons:

ㅤ	Construction	Reasoning
structured knowledge	hard to construct need to design schema and parse	easy to reason useful structures already defined
unstructured knowledge	easy to construct just store the text	hard to reason need to extract useful structures
large language model	easy to construct trained on unstructured text	easy to reason just prompt

Out-of-distribution Robustness

The third ability that we discuss is out-of-distribution robustness. During 2018 - 2022, quite a lot of research across NLP, CV, and general machine learning, under the name of distribution shift/ adversarial robustness/ compositional generalization, observed the behavior that when the test set distribution is different than the training distribution, model performance may drop significantly. Yet this seems not to be the case for large language models’ in-context learning. A quick experimental result is from Si et. al. 2022:

Table from Si et. al. 2022. Although GPT-3 cannot outperform RoBERTa in the in-distribution setting, it outperforms RoBERTa in the out-of-distribution setting with a significantly smaller performance drop.

Again, in this setting, prompting GPT-3 does not outperform the fine-tuned RoBERTa in the in-distribution setting. But it outperforms RoBERTa in three out-of-distribution (domain shift, noisy and adversarial perturbation) settings, meaning that it is more robust.

Furthermore, certain good generalization behaviors of prompting can be consistent even if there is a distribution shift. For example:

Figure from Fu et. al. 2022. Complex prompts consistently induce better performance than simple prompts, even when the test distribution is different from the training distribution.

In Fu et. al. 2022, we observe that the more complex the input prompt is, the better performance the model gives. This trend the consistent under distribution shift: no matter whether the test distribution is the same as the prompt distribution or is a noisy distribution, or the prompt is transferred from another distribution, complex prompts consistently outperform simple prompts.

Summary so far

We have discussed three emergent abilities that potentially only large models have. They are:

Complex reasoning, where large models significantly outperform previous smaller models without the need for full-dataset training.

Reasoning with knowledge, where large models may not outperform previous smaller models, but do not require the additional source of knowledge (which can be expensive to obtain or hard to extract from unstructured data).

Out-of-distribution Robustness, where most previous fine-tuned models struggle. Here large models may still not outperform previous methods in the in-distribution setting, but they seem to be much better in the out-of-distribution setting.

Emergent abilities transcend the scaling law

Given the above advantages, one may start to feel large language models are indeed very good. Before going into further discussions, let’s take a step back and look a little bit more at the literature. One curious question is that GPT-3 was released in 2020; why is it until now we are interested and start to think if the paradigm is shifting?

The answer traces back to two different types of scaling curves: the log-linear curve and the phase change curve. As is shown below:

Left: scaling law. Model performance increases linearly as the model size increases exponentially. Right: emergent abilities show a phase change at a certain scale where the performance suddenly increases.

Initially, (OpenAI) people believed that the performance of language models w.r.t. model size can be predicted by a log-linear curve, where the performance increase linearly as the model size increase exponentially. This behavior is called the scaling law of language models, as is initially discussed in Kaplan et. al. 2020 then observed in GPT-3 Original paper. Importantly, at this stage, prompting even the largest GPT-3 cannot outperform fine-tuning smaller models. This cast the question of why bother using so large expensive models (even prompting is annotation-efficient). Later in Cobbe et. al. 2021, the scaling law seems also to apply to fine-tuning. This is rather pessimistic because it implies that we are maybe locked to our model scale — different tweaks of model architectures may improve the model performance to a certain extent, but still, we seem to be locked in its corresponding scale range and cannot have a significant performance breakthrough.

During the age of scaling law (2020 - 2021), since GPT-3 cannot outperform fine-tuning T5-11B, and T5-11B is already troublesome to fine-tune, the community’s focus is more on prompting smaller models or parameter-efficient adaptation. Prefix tuning is one notable example in the intersection between prompting and adaptation, which was later partially unified by He et. al. 2021. The logic of our bets is simple: if fine-tuning is better, we should put more effort into parameter-efficient tuning; if prompting is better, we should put more effort into training large language models.

Later in Jan 2022, chain-of-thought comes out, and the story begins. As is shown by the authors, chain-of-thought prompting exhibits a clear phase change in the performance-scale curve. When the model size is large enough, the performance increases significantly and clearly transcends the scale curve.

When prompted within chain-of-thought, we get significantly better performance than fine-tuning on complex reasoning, also competitive performance on reasoning with knowledge, and there is a further potential of distributional robustness. All these advantages only require about 8 in-context examples. This is why the paradigm may shift.

What does paradigm shift mean?

What does paradigm shift mean exactly? Below we give a comparison between fine-tuning and prompting:

Scale	Small Model	Large Model
Learning	Fine-tuning	In-context Learning
Learning	Supervised	Supervised??
Data	Full training set	Few in-context demonstrations
Generalization	In-distribution	In-distribution + distribution shift

The advantage is very clear: we no longer need the cumbersome data annotation and full-set fine-tuning. We only need to write prompts and get at least good enough results, which is substantially faster than fine-tuning.

Two more things to note:

Is in-context learning still supervised learning?

I don’t know.

They are similar in the sense that in-context learning also needs demonstrations that are conceptually similar to training data.

They are different in the sense that in-context learning’s generalization behavior is systematically different than supervised learning, making none of the previous generalization theories (e.g., Rademancher Complexity or Neural Tangent Kernel) apply.

Is in-context learning really better than fine-tuning?

The answer is not known.

Most existing comparisons between prompting v.s. fine-tuning is prompting + large model v.s. fine-tuning + small model. A fair comparison should be prompting + large model v.s. fine-tuning + large model. Also, the fine-tuned large model should be the same as the prompted large model. So in the original CoT paper, if Wei. et. al. really claim that prompting is better than fine-tuning; they should have the fine-tuned performance from PaLM, rather than reporting the number from GPT-3.

My hypothesis: fine-tuning will improve in-distribution performance, but hurt out-of-distribution robustness. Prompting will be better in the distribution shift setting, but worse in the in-distribution setting.

If this is true, then a straightforward research problem is how to fine-tune the model without sacrificing its in-context learning ability
Note that the OOD generalization of fine-tuning will also change with the model scale. As a quick example, Yang et. al. 2022, Table 4 shows that Bart-based finetuning decreases OOD generalization, but Bart-large finetuning improves OOD generalization. It is totally possible that with large models, fine-tuning may also improve performance in distribution shift settings if the test distribution is related/ not far away from the training distribution.

Again, recall the simple logic: if fine-tuning is better, we should put efforts into parameter-efficient tuning; if prompting is better, we should put efforts into getting better large language models.

So, although we believe that large language models exhibit great potential. There is still no hard evidence showing if one is absolutely better than the other, so we do not know if the paradigm should really shift or to what extent it should shift. It would be very meaningful to compare the two paradigms carefully such that we have a clear picture of the future. We leave more discussions to our next post.

How large should the model be?

The quick answer is two numbers: 62B and 175B.

For chain-of-thought to be better than standard answer-only prompting, one needs the model to be at least 62B

For chain-of-thought to be better than fine-tuning small models (say T5-11B), one needs the model to be larger than 175B where the number 175B comes from GPT-3.

The number 62B is from Chung et. al. 2022 Table 5.

For all models smaller than 62B, direct prompting outperforms CoT. The first model where CoT outperforms direct prompting is Flan-cont-PaLM 62B on BBH. For 540B models, there are more settings where CoT outperforms direct prompting, but not all of them. Also, the number can be smaller than 540B. In Suzgun et. al. 2022, the authors show that the 175B InstructGPT and 175B Codex also have better CoT performance than direct prompting. Combining all the results, we get the two numbers 62B and 175B. So yes indeed, to enter the game of scale you do need a ticket to larger models than average.

However, there are also other large models like OPT, BLOOM, and the first version of GPT-3. They all have 175B, yet their CoT performance is significantly worse, or even cannot do CoT. This naturally leads to our next question.

Is scale the only factor?

No.

Scale is a necessary but not sufficient factor. There are models that are large enough, like OPT and BLOOM, both 175B, but cannot do CoT.

There are only two model families that fully do CoT (TODO: add discussions about UL2):

GPT-3 models including text-davinci-002 and code-davinci-002 (Codex). These are the only two publicly accessible models that have strong emergence

Other than the above two models, other GPT-3 models, including the original GPT3, text-davinci-001, and other smaller GPT-3 models, none of them can do CoT
By “can do CoT”, we mean CoT performance being better than: (a). direct prompting (b). fine-tuning T5-11B
Also, note that code-davinci-002’s performance is consistently better than text-davinci-002 on language tasks by a large margin. This observation is quite interesting and intriguing. It means a language model tuned on code can outperform a language model tuned on language. We do not know why.

PaLM models, including PaLM, U-PaLM, Flan-PaLM, and Minerva. These models are not accessible to the general public. (@Google, Please release more checkpoints!!)

The source of emergence is still unclear. Yet we identify the following factors/ indicators that may also contribute to emergent abilities:

Instruction tuning: GPT-3 text-davinci-002 is instruction-tuned with reinforcement learning. Before that, text-davinci-001 could not do CoT well. It seems that PaLM is not instruction-tuned, but later Google does, and the performance increases.

Tuning on code: Codex code-davinci-002, tuned on code, is consistently better than text-davinci-002. PaLM is also tuned on code. Code superficially has little to do with language, yet we don’t know why they help. But it seems that code is very helpful.

Tuning on chain-of-thought: by the time text-davinci-002 was released, Google has already released PaLM for 3 months. So OpenAI should have information about the chain-of-thought. Also there are works showing directing tuning on CoT data can enable the model’s CoT ability.

Yet all these factors are conjectures at the current stage. It would be very meaningful to unveil the recipe for training the model to unlock emergent abilities. We leave more discussions to our next post.

Conclusion

In this article, we scrutinize the emergent abilities of language models. We emphasize the importance and opportunities with complex reasoning, reasoning with knowledge, and distributional robustness. Emergent abilities are promising and exciting because they transcend the scaling law and exhibit a phase change in the scaling curve. We carefully discuss whether the research paradigm is actually shifting from fine-tuning to in-context learning, and our answer for that is not yet because we are still not sure about the performance comparison between fine-tuning and in-context learning in in-distribution and out-of-distribution settings. Finally, we discuss three potential factors that lead to emergent abilities: instruction-tuning, tuning on code, and tuning on chain-of-thought. We very much welcome comments, suggestions, and discussions.

There are two more interesting questions we mentioned but have not throughly discussed yet:

Can we set up a face-to-face battle between fine-tuning v.s. in-context learning?

Can we come up with a recipe of training, such that as long as one follow the recipe, one is guaranteed to get the emergent, or at least CoT ability?

We leave these two questions to our next posts.