Shortcut: ChatGPT

Language Models are Few-shot Learners

Abstract Takeaway

A large corpus of task-specific and fine-tuning text datasets are still needed nowadays. Set human’s ability as objective, task-agnostic and few-shot performance are two main goals to achieve. Here large amount of training data (scaling up language model to 175 billion params) is the German-beer.

Problems to notice: methodological issues related to large amount of web corpora and some failure on specific datasets.

Being Good at overall: translation, question-answering, cloze task, on-the-fly reasoning, domain adaptation.

text-generation based on input prompts/conditions: generating news articles, product description, creative writings… (text completion). language translation, as well.

Introduction Takeaway

History-sense:

Single-layer representations (word vectors) → task-specific architectures

RNNs with multiple layers representations + contextual state → task-specific architectures

Pre-trained recurrent (transformer language models) → task-agnostic architectures, task-specific datasets & fine-tuning

Limitation of the task-specific properties of datasets & fine-tuning is better to remove.

Meta-learning is one of the possible solution, meaning that there could be some repeated sub-tasks (in-context learning) embedded within a single sequence and could be shared between sequences (the sequence is supposed to be forward-pass in the paper). But still fine-tuning results would be the winner.

Notice that there is an intension of meta-learning, which is to not distinguish the origin of learning new tasks from scratch at inference time or the fact of pattern recognition correlated to the training samples.

Another way to ease the limitation is the scaling of ****training with increasing capacity of the transformer language model (100 milli params → 300 milli → 1.5 billi → 8 billis → 11 billi → 17 billi). Log loss is proved to correlate well with multiple downstream tasks by the scaling practice.

So could in-context learning benefit from the scaling strategy? It is tested by training GPT-3 to measure the ability of in-context learning.

Three cases are set for this test:

a) few-shot learning / in-context learning: few demonstrations (ie input-output pairs to demonstrate what the prompt is like and what expected output would be preferred) are given to the model during the development process, the amount of demonstrations would be form 10 to 100 (ie the maximum examples the model able to gather for learning context).

b) one-shot learning: one demonstration only.

c) zero-shot learning: no demonstrations and only an instruction (instruction is like to tell the model ‘Generate the summary for the attached article’ directly).

Quick development of a model: architectures → datasets → training → after-training (fine-tuning / task-desired). In this process, a trend of moving from task-specific to task-agnostic is presenting. In other words, to generalize the model so that it can be the basic of solutions across sites.

Language Models are Few-shot Learners

Abstract Takeaway

Introduction Takeaway

Methodologies Takeaway