Transformers | Notion

Untitled

More FLOPS = lower training losses

Function of:

Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?

Untitled

Byte Pair Encoding: It will search for the most frequent consecutive pair of existing tokens

Successive merge rules are used until the desired vocabulary size is achieved.

SentencePiece:

Compute attention score from keys and queries (dk is head size): $\text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)$

Dividing it makes the variance independent of the embedding size.