More FLOPS = lower training losses
Function of:
Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
Byte Pair Encoding: It will search for the most frequent consecutive pair of existing tokens
Successive merge rules are used until the desired vocabulary size is achieved.
SentencePiece:
Compute attention score from keys and queries (dk is head size): $\text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)$
Dividing it makes the variance independent of the embedding size.
Transformers Explained Visually (Part 1): Overview of Functionality
Transformers Explained Visually (Part 2): How it works, step-by-step
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive