Docs

So-Large-LM-Task06: 大模型的 Adaptation

输入“/”快速插入

So-Large-LM-Task06: 大模型的 Adaptation

飞书用户6850

1月25日创建

Adaptation

仅通过 prompting 语言模型（例如，上下文学习），我们已经可以完成一些任务。然而，prompting 并不能在所有 downstream tasks 上 work，例如 NLI, QA, table2text, 解析 EHR (电子健康记录)等 tasks。​

Downstream tasks 可能与 LM 的训练数据（eg. The Pile）在格式或主体上不同，或者需要随着时间的推移更新新知识。因此，LM 需要使用 task-specific 数据或者领域知识来适应 downstream tasks。​

（主要就是说要想将 LM 应用到某个垂直领域，就要让 LM 学习到该垂域的知识，那时 RAG 2005.11401 可能还没有流行，因此默认学习方式就是 Adaption）​

Why Adapt the Language Model ?

•
LM 使用 task-agnostic way (任务无关的方式) 被训练。​

•
Downstream tasks 可能与在 the Pile 上进行的语言建模截然不同。​

举个例子，考虑 downstream task - NLI 任务

Is the hypothesis entailed by the premise? (前件是否蕴含了后件？显然，没见过苹果=>没见过红苹果)​
​
Premise: I have never seen an apple that is not red.​
Hypothesis: I have never seen an apple.​
Correct output: Not entailment (the reverse direction would be entailment)​

这种只需要回答 T or F 的 NLI 任务，对于一般的 LM 来说不太自然。​

Ways downstream tasks can be different

那么具体地，downstream tasks 与一般的 LM 的不同在于

•
格式​
◦
NLI: 在 NLI 中，语言模型接收两个句子，比较后产生 T or F 这种二进制输出。这种格式与自回归生成下一个 token，或者填充 [MASK] token 这种 LM 任务完全不同。​
◦
Specific toekns: 以 BART 模型为例，其训练时使用了 [MASK] token，而许多 downstream tasks 不使用 [MASK] token。​

•
Topic 转变​
◦
Downstream 关注新的，或者非常具体的 topic，如医疗或法律。​

•
时间变化：​
◦
新知识：downstream tasks 需要新的知识，而 LM 在这些新知识出现前就已经训练完毕。例如，GPT-3 无法提供关于拜登成为总统后的信息。​
◦
非公开知识：downstream tasks 所需要的知识是非公开的。LM 训练时，开发者无法获得，只能由具有版权的人进行 adapt。​

General Adaptation Setup

我们使用数学语言来形式化 Adaptation。设有

预训练 LM，其参数为

。

Downstream dataset

，从 downstream task 分布

中采样。

参数

，是参数族

的一个子集，在 task loss

上优化。

参数族

，是现有参数的子集，或者是引入的新参数。

Adapted 参数 
 ，是 adapted model 
 (语言模型本质上是概率分布) 的参数。​

因此，Adaptation 可以形式化为

Probing / 探针

probing 引入了一组新的参数 
，它们定义了 probes 族，通常是线性网络或者浅前馈网络。​

Probing 通常用于检查 / 理解模型的 representations。例如，如果一个在 representations 上的简单 probe 可以预测 part-of-speech (POS) tags, 那么 representations 就会 "store" POS 信息。

为了 Adaptation，我们在 LM 的最后一层 representations 到 output (eg. class label) 间训练一个 probe (or 预测头)。​

Probing 主要适用于 Encoder-Only models，但也可以用于 Decoder-Only models (Liu et al. 2021).

（意思就是在最后一层后再 append 一个线性层之类的网络，以适应 downstream tasks，和机器学习改分类数差不多。）​