一文尽览大模型经典论文

password

Sub-item

type

status

date

summary

模型

OPENAI

*GPT3

Language Models are Few-Shot Learners, 2020, OpenAI

核心发现：增大（scaling up)语言模型的size，可以极大提升其in-context learning能力。注意，in-context learning完全不需要对模型进行梯度更新，而传统意义上的few-shot learning依旧需要梯度更新。

GPT3做不好比较（comparison）任务，如WiC benchmark、ANLI。作者在5 Limitations提出，这可能是由于GPT3的结构缺陷带来的（单向LM，天然不适合需要”looking back and comparing”的任务。然而， 22年google的工作证明，只要模型足够大，decoder-only的LM一样可以做好这个任务。

Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer.

不同语料，采样频率不同：

低质量：Common Crawl、Bookss，训练过程未全部过完
高质量：Books1、WebText2、Wikipedia，训练过程重复2-3次

训练参数：

模型	发布时间	层数	头数	词向量长度	参数量	预训练数据量
GPT-1	2018 年 6 月	12	12	768	1.17 亿	约 5GB
GPT-2	2019 年 2 月	48	-	1600	15 亿	40GB
GPT-3	2020 年 5 月	96	96	12888	1,750 亿	570GB（~400B个 token）

先有scaling law，后有GPT3，前者的预测让后者的训练有了底气。而再后来的emergent ablility，这属于意外之喜。

GPT-3的few-shot能力，在通用任务上提升并不显著，从13B到175B，最好的设置下Acc也只涨了10%。

GPT-3的数学能力有了很大提升，从13B到175B，有明显的jump：

GPT-3数据预处理的时候有个小bug，导致评估数据出现在了训练数据中，对于评估而言影响不大，【clean评估数据】vs 【all评估数据】，效果没有系统性差异。

GPT4

https://openai.com/research/gpt-4

GPT-4能以top10%的成绩通过职业律师考试，而ChatGPT排名则是倒数10%

For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%.

GPT-4的训练前所未有稳定：

As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time.

模型能力主要是大力出奇迹，通过算力和数据得到的，RLHF并无助于提升各类考试成绩。RLHF主要是为了理解人类意图、更好地跟人类交互。

Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it).

GPT-4在训练刚开始，已经能够预测GPT-4最终的los

To verify this scalability, we accurately predicted in advance GPT-4’s final loss on our internal codebase (not part of the training set) by extrapolating from models trained using the same methodology but using 10,000x less compute

Codex

Evaluating Large Language Models Trained on Code, OpenAI, 2021

OpenAI花了很多时间讲数据集和评估（注意论文的名字就以Evaluating开头）：

评估不能用BLUE，而应该用在一系列unit test上的通过率。BLEU很多时候会产生字面正确但functionally inequivalent的代码
评估不能用现有代码数据集，因为很有可能已经在现有预训练语料当中了。OpenAI自己创建了一个手写数据集（总共有164个问题），每个题目有平均7.7个单元测试。
训练语料上，只用了Github上的python代码，清洗前后分别是179G、159G。

整体方案是finetune GPT（用的是无标注语料，严格来说是领域预训练）。OpenAI发现直接用159G的数据预训练，就可以打平从GPT热启的效果，但从GPT热启能够更快收敛：

Since Codex is evaluated on natural language prompts, we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family, which already contains strong natural language representations. Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments

python代码会有不同层级的缩进，这些不同层级的whitespace如果全用一个字符很浪费空间。

The largest source of inefficiency arises from encoding whitespace, so we add an additional set of tokens for representing whitespace runs of different lengths.

Codex：在不同参数量的 GPT-3 上做”finetune”得到的模型，模型参数量有12M、25M、42M、85M、300M、679M，2.5B 和 13B。

Codex-S：Github上的python代码往往包含了”class implementations, configuration files, scripts,” 等与『做题』不相关的的内容。论文将 Codex 在 Supervised Fine-Tuning 数据集上进行了finetune，得到的模型称为 Codex-S，在HumanEval这个衡量做题能力的数据集上效果更好。

Codex-D：为了训练 Codex 生成 docstrings 的能力，论文采用 Codex 在 Docstrings 生成数据集上进行了finetune，生成的模型称为 Codex-D。在训练时，所有代码（函数名+函数体）中的token都被mask。

*instructGPT

Training language models to follow instructions with human feedback,2022.3,openai

让GPT输出对齐(align)到用户intent，得到1.3B的小模型，产生的输出比175B的GPT更加被人类prefered。

整体流程：

训练分三步：

监督学习：对GTP进行finetune（SFT），这里的GPT对应强化学习概念里的policy network

监督学习：训练reward model（RM），把模型输出进行人工标注、排序，训练RM

强化学习：GPT产生输出，reward model产生reward，利用PPO算法训练GPT

其中，2，3两步可以循环进行。使用当前最佳policy收集更多的数据，用于训练新的RM，进而训练新的policy。OPENAI提到：

In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies

这一点似乎说明，最早的监督学习的GPT贡献了绝大多数data，后面被PPO优化的GPT只贡献了少数数据。此处尚不理解，可能是2、3两步没循环多少次？

对齐税（alignment tax）：PPO会让阅读理解、翻译等任务效果下降，OPENAI称为对齐税，同时提出可以通过PPO + lm共同训练的方式减少对齐税（PPO-ptx）。

We qualitatively probe InstructGPT’s capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of “following instructions.” They retain some alignment even on tasks for which they get very little direct supervision signal.

附录E给出了NLP公开集合上的表现，可以看到：

InstructGPT：SFT相比GPT，很多任务上zero-shot效果是下降的（与model size无关）。
FLAN：model size足够大时，instruction tuning之后的效果好于不tuning。
结论的不同，可能是训练语料不同带来的：FLAN训练、评估用的都是NLP语料，所以instruction tuning能够带来收益也是可以预期的。

RLHF不仅仅是扩充数据，更重要的是能够让模型真正能够”following instructions”：OPENAI发现InstructGPT能够回答finetune时很少见到的指令，如为代码生成摘要、对代码进行问答。可以说，指令学习（instruction tuning）具有泛化性，才真正让通用人工智能成为可能。

InstructGPT训练的prompt从何而来？通过用户与InstructGPT交互而来，这也是OPENAI领先优势越来越大、能够形成飞轮（flywheel）的体现。当然，最初的InstructGPT，需要人工标注的数据来启动（这里说的应该就是上图中的step1，SFT）。

As previously mentioned, for the majority of the project, we obtained prompts directly from external users of the instruct beta models in the OpenAI API. However, this strategy only works once you have a model that accepts instruction-like prompts. In order to train the very first such model, we asked contractors to write prompts themselves.

InstructGPT用到的数据可分为三个集合

数据集名	prompt数据量	promp来源	监督信号	用法
SFT dataset	13K	API & 标注员手写	有，”labeler demonstrations”	用few shot的方式训练SFT模型
RM dataset	33k	API & 标注员手写	有，”labeler rankings”	标注员提供ranking，从中两两配对pairwise训练reward model
PPO dataset	31K	API	无，纯prompt	PPO训练

关于few shot标注数据如何用来训练监督模型：

For example, the instruction could be “Give the sentiment for a tweet,” and the queries would be tweets and the responses either “Positive” or “Negative.” We can then format these as few-shot prompts like those in Brown et al. (2020). With K query-response pairs, we create K training examples using the other K-1 in the context.

instruct-query1-res1-query2-res2-query3-

instruct-query2-res2-query3-res3-query1-

instruct-query1-res1-query3-res3-query2-

SFT的训练就是用few-shot的形式训练的，在附录A.3里有提到：

For SFT, note that we have many more labeler-written prompts than customer prompts… …. We synthetically constructed multiple SFT datapoints from the same instruction by sampling different sets of few-shot examples.

训练过程

SFT：训了16个epoch，用在validation set的上的RM score选最终的模型。
RM：

初始化模型：SFT模型初始化，去掉最后的unembedding layer（应该是用pooling或者CLS输出）
模型大小：6B的RM又快又稳定，175B的RM训练不太稳定。
输入：promp-response
输出：reward（scalar），并通过bias的方式让reward均值为0
损失函数：

从结果来看，单独使用SFT监督学习，就能够让整个学习过程更加可控，PPO锦上添花。相比GPT-3的预训练，model alignment这种对齐方式是非常efficient的。

*Scaling Laws

Scaling Laws for Neural Language Models, openai, 2020

这篇文章研究的是影响大模型 LM loss的因素，文章提供了很多有用的empirical findings，为后续大模型迭代指引方向。

模型效果主要取决于三个因素：1）模型参数量 2）数据集大小 3）训练时的计算量。其他架构上的设计（如depth vs width，nums of self-attention head），对大模型的效果影响有限，说明对Transformer的结构进行调参没有意义。这一点与小模型时代的结论（模型深度重要性大于宽度）也是不同的。

其他两个因素不成为瓶颈的前提下，模型效果与每个因素遵循power-low，这其实是很让人悲观的，因为想让模型效果提升一小点，就需要容量&数据量十倍、百倍的提升。

注意这里的等常数取决于词表大小等，不具有太多fundamental meaning。等式右侧对数话

模型参数量与训练数据量应该等量提升。具体地，每次增大模型大小8倍，应该相应增大训练数据量5倍。

计算资源有限的前提下，应该着重提升model size，而不是训练数据量（注：与的结论有差异）。从公式(1.7)可以得出，如果计算量提升10倍，应该提升model size 5.37倍，数据量1.86倍。

大模型更加地sample-efficient，需要更少数据量和计算资源便可以达到与小模型相同效果。

本文发现的Power-low不仅是mere observation，也提供了一个预测框架，帮助OpenAI在训练大模型的时候，去估算所需要的计算资源、过拟合程度、早停步数等：

We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework.

Google

T5

主要贡献：把所有NLP任务转换成text-to-text的生成任务，还是需要逐任务finetune，还是不具有zero-shot能力，但已初备雏形。

模型结构上：T5采用的是Encoder-Deocder结构，T5实验发现在NLP任务上效果要略好于LM。。

预训练目标：T5采用的是BERT-style的MLM，T5实验发现在NLP任务上，这种训练目标比LM效果略好（站在2023年回头看，这个实验很有可能是misleading的，低估了LM的潜力，又或者GPT-3用MLM也可以更好？）。

如何用Encoder-Decoder做MLM呢？T5提了几种变种，发现效果类似，但replace span（上图I.i.d. noise, replace spans）的target更短、训练效率更高。

*FLAN

Finetuned Language Models are Zero-shot Learners., 2021, googe

整体感觉，google的论文还是在研究NLP，论文逻辑还是刷分打榜（虽然是zero-shot）；而OpenAI的论文似乎不care数据集表现，写作都是围绕着AGI在展开。研究的核心问题

Our paper has explored a simple question in zero-shot prompting: does finetuning a model on a collection of tasks phrased as instructions improve its performance on unseen tasks? We operationalize this question via instruction tuning, a simple method that combines appealing aspects of both the pretrain–finetune and prompting paradigms. Our instruction-tuned model, FLAN, improves performance against an untuned model and surpasses zero-shot GPT-3 on the majority of tasks that we evaluate on. Ablation

方法：instruction tuning

与prompt范式的不同：instruction tuning有反向传播的
与finetune的不同：

instruction tuning在A上finetune，可以在B上预测
instruction tuning数据组织形式上是统一的

训练细节：

FLAN是instruction-tuned之后的LaMDA
模型结构：137B参数的Decoder
不同training example通过packing的方式结合起来（论文2.4，后面详细看引用论文）

实验：

NLP任务数量，对于held-out任务效果的影响。从实验看基本符合预期，任务越多模型zero-shot的效果越好，而且并没有饱和的趋势

instruction-tuning对于小模型而言，在NLP任务上可能是有害的（存疑）：

The behavior on held-out tasks for the 8B and smaller models, however, is thought- provoking—instruction tuning actually hurts performance on held-out tasks. One potential explanation for this result could be that for small-scale models, learning the ∼40 tasks used during instruction tuning fills the entire model capacity, causing these models to perform worse on new tasks. Under this potential explanation, for the larger scale models, instruction tuning fills up some model capacity but also teaches these models how to follow instructions, allowing them to generalize to new tasks with the remaining capacity.

这个结果可能更说明了，FLAN的训练数据及评估方式（NLP任务转换得到）不适合8B及以下的模型。正如作者分析的，40个NLP task填满了模型capacity，模型失去了泛化性，不能真正的follow instruction。当然，后面也有反例，此处不需要过度纠结：

1.3B的InstructGPT只是能够生成『人类更喜欢的回复』，但在NLP任务上还是不会好InstructGPT on NLP tasks。

这篇论文的5.2节，同样提到了instruction-tuning对小模型也是有效的。

比较：FLAN的instruction tuning和InstructGPT里的SFT的区别，主要是语料不同

FLAN面向NLP：通过template把NLP任务转化为instruction tuning的语料
InstructGPT面向真实场景：OpenAI自己接口的prompt + labelor写example。

从效果上看，InstructGPT的论文（）里提出，人类显著更喜欢InstructGPT的回复（78% of the time），说明InstrctGPT的语料构造方式更适合真实场景，而利用NLP语料来转换的问题在于NLP任务只是真实用户需求的一小部分，不够diverse。

We believe our InstructGPT model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation. However, classification and QA are only a small part (about 18%) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about 57% of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.

Few-shot：输入形式instruct(x1)-res1-instruct(x2)-res2-instruct(x3)-？，跟instructGPT基本是一样的。结果上，输出空间越大，few-shot相比zero-shot的提升越大。如果是个简单的二分类，则few-shot相比zero-shot提升不明显。

Interesting notes：

GPT做不好NLI任务的一个原因，可能是NLI的语料形式，很难出现在无监督语料中。FLAN的形式是：

For FLAN, we phrase NLI as the more natural question “Does <premise> mean that <hypothesis>?”, achieving much higher performance.

CoT

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022, Google

Chain-of-Thought(CoT)指的是，在（用few-shot）prompt的时候不仅输入给模型结果，还要输入reasoning过程。下面这个例子来自于OpenAI提出的GSM8K数据集，用于衡量模型简单的数学推理能力。

一些实验结论：

CoT能够显著提升大模型的推理能力
CoT对小模型而言并不生效，即CoT是大模型的凸显能力（emergent ability）。
对于CoT，任务越复杂（如GSM8K），其效果相比于普通prompt的效果越显著。

CoT并不需要复杂的语言技巧，推理过程可以用较为任意的语言表达出来，即可激发模型的CoT能力

CoT不仅对于数学推理有效果

Common sense reasoning：如CSQA（common sense QA），需要先验知识去解决的问答测试集。
symbolic resoning：如Last letter concatenation（“Amy Brown” → “yn”），CoT的效果可以说是一骑绝尘。

Emergent Ablity

什么叫emergent ability？就是量变引起质变：An ability is emergent if it is not present in smaller models but is present in larger models。类似于三体提到的技术爆炸，无法预测。

Hence, their emergence cannot be predicted by simply extrapolating performance on smaller-scale models

论文里的这个表很有价值，大多数emergent ability需要60B（最好100B以上)

作者评估模型采用的是Big Bench，BIG bench由 204 项任务组成，任务主题涉及语言学、儿童发展、数学、常识推理、生物学、物理学、社会偏见、软件开发等等领域的问题。作者指出，当前还有几十个任务是当前最大的GPT-3和PaLM都解决不了的（no model better than random），后续增大模型容量或许可以解决这些问题。

大模型在WiC上的表现很有启发意义，OpenAI的研究者最早认为（2020年）GPT-3天然不适合做好比较类任务（如Wic、NLI）。然而，本文发现结构缺陷并不是主因，关键

emergent ability并不能通过cross-entropy来判断，在很多任务上，模型表现接近瞎猜，但其cross-entropy loss却一直在降低。

Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However,

scaling不是唯一的因素，也有14个任务大模型（LaMDA 137B、GPT-3 175B）表现不好，更小的模型（PaLM 62B)表现好。

这篇论文同样提到，FLAN的这个结论并不足够solid，后面有很多反例。

Although Wei et al.(2022a) initially found that instruction-based finetuning only worked for 68B parameter or larger decoder-only models, Sanh et al. (2022) induced similar behavior in a 11B model with an encoder-decoder architecture, which typically has higher performance after finetuning than decoder-only architectures (Wang et al., 2022a). As another example, Ouyang et al. (2022) proposed a finetuning and reinforcement learning from human feedback approach for the InstructGPT models, which enabled a 1.3B model to outperform much larger models in human-rater evaluations on a broad set of use cases.

LaMDA

LaMDA是专注于对话的137B预训练模型，由于其训练语料中不仅包含对话数据，也包含通用的互联网文本，因此其也可以当做通用的语言模型。通过进一步地finetune，能够让模型生成更加safe和factual grounding的回复。

LaMDA关键细节如下：

结构：Transformer Decoder
大小：137B （non-embedding）
语料：2.97B documents，1.12B dialogs，13.39B dialog utterance，共1.56T words

LaMDA的Decode策略采用sample-and-rank，具体参考Meena论文的3.4节：链接。（疑问：从表述上看，似乎要decode N次，这样岂不是会极大提高infer的时延？我估计直接采用topk sample可能）

（from Meena paper 3.4）Sample-and-rank, works as follows: First, we sample N independent candidate responses using plain random sampling with temperature T. Second, we select the candidate response with the highest probability to use as the final output.

(from LaMDA paper 3) uses the same sample-and-rank strategy as Meena [17] for decoding. We first sample 16 independent candidate responses using top-k (k = 40) sampling (no temperature). The final output is the highest-scoring candidate, where the score is based on the candidate’s log-likelihood and its length.

*PaLM

PaLM: Scaling Language Modeling with Pathways, google, 2022

PaLM发现，BIG-bench里面25%的任务出现了了emergent ability，而不在遵循大模型训练的power-law 。

PaLM的模型结构是Transformer Decoder（ps：看起来google自己也抛弃了T5的enc-dec结构？），并进行了以下改进（Google采用说明这些优化对transformer是真得有用的，值得关注）：

SwiGLU activation：目标是优化模型效果。SwiGLU的公式为Swish(xW) · xV，其中Swish的公式为:Swish(x)=x * sigmod(x).
Parallel Layers：把transformer block改为并行计算，能够实现了15%的训练加速。对于540B的大模型，效果能够持平。

multi-query attention：key、value不再使用多头，query还是映射成多头。能够显著节省decoding时间。
RoPE embedding：采用RoFormer（的对position的编码方式，相比绝对、相对位置编码，效果都更好一些。
Shared Input-Output Embeddings：共享encoder-decoder参数，如今常见的做法
No Biases：去掉所有的bias，作者发现可以提升训练稳定性
vocabulary：从训练数据中用SentencePiece算法得到，同样保留了whitespace（对代码而言很重要）

训练数据上，总共780B token，训练1个epoch。此外，PaLM在训练数据上加上了代码（包含Java，html，python等），总共196G（注意，PaLM是有CoT能力的，这很有可能来自其代码训练）。

Google定义了新的衡量训练效率的metric——MFU，且PaLM的MFU显著高于其他模型：

hardware FLOPs utilization（HFU）：实际FLOPs 除以硬件理论峰值FLOPs

model FLOPs utilization（MFU）：实际token吞吐量，除以硬件理论峰值FLOPs下的token吞吐量

关于GPU训练推理时间，这篇文章比较详细，但阅读成本较高：Transformer推理时间

训练设置（所有改造，都可能是训练大模型所需要的关键know-how，重点关注）：

weight initialization：除了embedding和layernor scale，全部用fan-in variance scaling：。维度越大，方差越小。
optimizer：实际用的是Adam + parameter scaline（把学习率进行缩放，系数为root-mean-square），起到的实际作用跟GPT3一样，都是让实际学习率变小，只不过不同参数矩阵的lr能够进一步差异化。

The model was trained with the Adafactor optimizer (Shazeer & Stern, 2018), without factorization. This is effectively equivalent to Adam (Kingma & Ba, 2014) with “parameter scaling,” which scales the learning rate by the root-mean-square of the parameter matrix. Because the weight initialization is proportional to 1/ √ n, the effect of this is similar to the manual scaling down of Adam learning rate as in Brown et al. (2020). However, parameter scaling has the benefit that parameter matrices which operate at different scales (the embeddings and layer norm scales) do not have their learning rate scaled down at the same rate.

Optimization hyperparameters
Loss function：标准LM loss，加上辅助loss，能够增强训练稳定性。这里的辅助loss，是让softmax的分母接近1（为什么这样做还没有想清楚）。

We additionally use an auxiliary loss of z loss = 10−4 · log2 Z to encourage the softmax normalizer log(Z) to be close to 0, which we found increases the stability of training.

Sequence length：PaLM的处理方式比较独特，把所有文本concate起来，用【eod】分割不同document，然后依次截断为2048。
batch size: PaLM对于batch-size的设计也很独特，其batch-size虽然训练过程变大（512-1024-2048），原因有2：

小batch更加的sample-efficient，即达到相同loss水平，小batch可以用更少token，这对于训练前期是很有必要的。大batch对梯度估计地更准，对于后期来说更重要。
大batch能够增大TPU利用率。

bitwise determisism：对效果应该没有影响，只不过是能够严格保证可复现性，尤其是每个batch的数据只跟step有关，这一点后面有需要可以详细了解
dropout：PaLM在预训练的时候没有假dropout

训练不稳定（traning spikes）：在训练PaLM的过程中，大约发生了20次 loss的剧增（尽管使用了gradient clipping），作者排除了脏数据的原因，只能是特定【数据与模型参数的组合】导致的。解决方案上就是回退到更早的checkpoint，然后跳过200-500 data batch。

This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state. In

Prompt Tuning

The power of scale for parameter-efficient prompt tuning, google, 2021

其他

T0 - Hugging Face

Multitask Prompted Training Enables Zero-Shot Task Generalization, 2021, hugging face

跟同时期的FLAN（）一个思路：在任务ABCD上finetune，在E上预测。区别在于：

模型结构上：FLAN是个纯粹Deocder，T0是基于T5的Encoder-Decoder结构。Encoder-Decoder的优势（可能也是劣势），是其不需要预测输入：

Unlike decoder-only language models such as GPT-3, it is never trained to generate the input.

Since T5’s pretraining objective is generating tokens and only tokens that have been removed from the input text, it is different from the natural text generation format of prompted datasets. Therefore, we use Lester et al. (2021)’s LM-adapted T5 model (referred to as T5+LM), produced by training T5 on 100B additional tokens from C4 on a standard language modeling objective.

任务形式上：

FLAN：训多个模型，每个评估一个held-out task
T0：训一个模型，在多个held-out task上评估

输入形式上：FLAN 是instruction，T0是prompt。

结果上，T0发现prompt-tuning在小模型上（3B）效果也很好，这一点与FLAN的结论（）相违背。作者认为可能有两点因素：

T0（T5）的预训练目标是MLM，FLAN是LM，作者认为已经有大量工作证明MLM是更有效的训练方式。对此我比较存疑，毕竟T5是google自己提出来的，google自己都切换到LM了。MLM优于LM，有可能仅在语料&模型不够大下成立。
T0认为自己的prompt更加diverse（length和creativity）

Gopher - Deepmind

280B 参数，除此之外没有特别值得关注的

*Chinchilla - DeepMind

Chinchilla研究的核心问题是，在计算资源固定（i.e.,知道GPU数量和训练时间）的前提下，应该如何分配model size和训练语料数量，能够使得LM loss最小。即：

Chinchilla的发现与OpenAI（）不同，OpenAI的论文更侧重model size，Chinchilla发现model size和数据量应该等量分配。下图可以看到，Chinchilla的model size显著更小，而GPT-3等模型显著undertrained。

Specifically, given a 10× increase computational budget, they suggests that the size of the model should increase 5.5× while the number of training tokens should only increase 1.8×. Instead, we find that model size and the number of training tokens should be scaled in equal proportions.

最终Chinchilla只有70B大小，但training token则显著多于GPT-3等大模型

Anthropic LLM

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

跟ChatGPT非常接近，比ChatGPT早5个月（InstructGPT是QA（单轮），而ANtropic LLM是对话（多轮），更像Chatgpt。），但没有发布模型（在提升模型安全性），导致ChatGPT出来后轰动世界、OpenAI拿到微软100亿投资。联想自己的经历，迅速上线、小步快跑迭代才是互联网的生存逻辑。

在线学习看起来效果很显著：

这篇文章指出，在模型足够大时，RLHF并不需要交对齐税（仅对大模型成立）：

Alignment with Human Values Has Many Benefits and Essentially No Cost to Performance

Smaller models experience severe ‘alignment taxes’ – their performance on a wide variety of eval- uations declines after RLHF training. However, we find a variety of alignment bonuses, with our 13B and 52B5 RLHF-trained models performing better at zero-shot NLP evaluations, and the same at few-shot evaluations.

与InstructGPT相比：

Anthropic只用RL，没有用SFT

Anthropic想要兼顾harmleness和helfulness，OpenAI没关注harmlessness

Anthropic的RM直接用了最大的模型

Anthropic包含了online traning（一周更新一次）

Reward model（又叫Preference Model，PM）的训练方式如下：

通用预训练，即正常训练一个GPT
领域预训练，在reddit 等网站上，用用户的upvote等数据构造大量的数据pair，用于训练打分模型，learning rate是第一阶段十分之一。
领域精调：在人工标注的数据上训练模型，learning rate是第一阶段百分之一。

在第二、三阶段，会在最后补一个end-of-context的特殊token，在上面对打分进行预测。

reward model的准确率如下，可以看到其表现也是符合scaling law的（训练数据指数增长，performance线性增长）

随着模型生成质量越来越高，RM模型也应该同步更新训练，这一点跟是一样的。

These observations also have an implication for RLHF training, namely that we should expect diminishing returns from further RLHF training once our policies achieve a sufficiently high PM score. This also motivates online training, so that we can update our PMs to stay on-distribution as RLHF policies improve.

如何检验RL是否真得能够拟合人类分布，而不是RM？作者在训练RM时耍了个小花招，即训练两个模型（用不同数据）：training RM和test RM，前者用来训练policy，后者用来判断policy是否在test RM上也能取得比较高的分数。可以发现，在150K样本以内，两者比较一致，后面出现了过拟合，但test RM也是上涨的。instructGPT也有类似结论，即过拟合的问#题不大。

Berkery False Promise

The False Promise of Imitating Proprietary LLMs

研究的核心问题：基于小模型，finetune ChatGPT的输出，可否逼近ChatGPT的效果？

结论：不能，finetune数据越多，后期与ChatGPT的gap反倒越大：

Small model 的模仿能力很强，在同领域数据上效果能够表现得很好，似乎说明训练垂直领域的小模型是完全可行的。我的疑惑在于，小模型其本身模型参数规模，大概率不能支持存储所有的world-knnowledge，因此即使是在垂直领域，base小模型的domain knowledge也大概率是残缺的。通过imitate ChatGPT能够对齐，但能够弥补知识本身的缺失吗？

For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy. Consequently,

Self-Instruct

SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions

基本思想：用ChatGPT来产出instruction、instance，并用来finetune GPT3。整体流程如下，注意这个流程仅仅是产出数据的过程，不涉及到finetune GPT3的过程。所谓self-instruct，是在instruction generation这一步，用到的few-shot instruction，（部分）是来自其本身产出的instruction：

不仅可以用GPT4生成instance，甚至也可以用来生成instruction。具体是用few-shot的方式：

实际效果如下：

用GPT4生成instance，仅需要提供instruction。注意，如果额外需要input，那么也是可以用GPT4来生成的，不需要自己写。另外作者做的更细一些，针对classification任务，额外指定label（不然模型可能永远只输出正例或负例）。我尝试了一个instruction，效果确实还不错：

Standford Alphca

Alpaca: An instruction-following LLaMA model

基本思想：是否可以用GPT4的语料，finetune一个基于LLaMA的小模型？

与算法无关，需要注意imitation相关的商用是存在困难的：

We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

基本方法沿着self-instruct来做，有一些小改动：

Program of Thoughts - PoT

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

基本思想：在一些数学问题和金融问题上，不再让LM直接出答案，而是让其生成解决问题的代码，并调用python interpreter来得到最终答案。注意，这与CoT的不同是需要额外调用一次python：

Emergent ability Mirage

Are Emergent Abilities of Large Language Models a Mirage?

基本思想：确认了emergent ability的来源：评估方式的不平滑。

Big-Bench里面所有出现emergent ability的任务，绝大多数都是0/1评估的：

证伪emergent ability的实验:

进一步地，作者在CV里通过设计一个non-linear的评估指标，也可以认为创造emergent ability：

Emergent ability的作者Jason Wei 的回应：

exact match等指标是我们最关心的，更能反应任务本身做的好坏

一些任务，如IPA（单词音标），即使用平滑的指标也出现kink：

资料

链接	简介	ㅤ	ㅤ
💡拆解追溯 GPT-3.5 各项能力的起源	兼顾深度和广度	ㅤ	ㅤ
🪄 深入理解语言模型的突现能力	本文完成于 ChatGPT 上线之前的一个月	ㅤ	ㅤ
通向AGI之路：大型语言模型（LLM）技术精要	张俊林写在chatgpt出现后	ㅤ	ㅤ
ChatGPT 背后的“功臣”——RLHF 技术详解	hugging face出品	ㅤ	ㅤ
ChatGPT会取代搜索引擎吗	张俊林对RLHF的思考有参考价值	ㅤ	ㅤ
【解读】\| 爆火的 ChatGPT 到底是什么？	“没有技术是一蹴而就的，相比看到其不足，一个科学工作者更应该对其潜力有敏感性。”	ㅤ	ㅤ
ChatGPT/InstructGPT详解	待阅读	ㅤ	ㅤ
ChatGPT学习资料汇总	chatgpt	ㅤ	ㅤ
AIGC学习资料汇总	主要是生成图片	ㅤ	ㅤ
ChatGPT（instructGPT)开源实现	by ColossalAI	ㅤ	ㅤ
关于Instruct GPT复现的一些细节与想法 - BlueRum的文章 - 知乎	作者应该是参与到CollossalAI 复现工作了	ㅤ	ㅤ
大语言模型调研汇总	分为基础模型和instruction-finetuned模型，强推	ㅤ	ㅤ
为什么现在的LLM都是Decoder only的架构？	top答案都看下，基本原因就是1）zero shot更好 2）更好scale	ㅤ	ㅤ
deepspeed-chat	DeepSpeed Chat: 一键式RLHF训练，让你的类ChatGPT千亿大模型提速省钱15倍	ㅤ	ㅤ
RLHF几大常用框架实践对比（trlx、deepspeedchat、colossalaichat）	作者读了几个框架的源码	ㅤ	ㅤ
promptingguide	关于目前的prompt的一些方法	ㅤ	ㅤ

模型