Untitled

More FLOPS = lower training losses

Function of:

  1. Number of parameters
  2. Amount of data that you train it

Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?

Untitled

Tokenisation

Byte Pair Encoding: It will search for the most frequent consecutive pair of existing tokens

Successive merge rules are used until the desired vocabulary size is achieved.

SentencePiece:

Attention Mechanism

Compute attention score from keys and queries (dk is head size): $\text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)$

Dividing it makes the variance independent of the embedding size.

Transformers Explained Visually (Part 1): Overview of Functionality

Transformers Explained Visually (Part 2): How it works, step-by-step

Transformers Explained Visually (Part 3): Multi-head Attention, deep dive