overview

Problem: Current language models are autoregressive and can’t handle continual learning well. The biggest problems with just training the network on new data are 1) forgetting and 2) compute (although less important in some cases)—especially with horizontal continual learning

current solutions

https://arxiv.org/abs/2404.16789

Replay-Based Methods: Has a buffer of past memory, which then can be used to train the model to make sure it doesn’t forget. Hard to determine which data is relevant to keep/would need the model to learn—doesn’t seem optimal
Regularization-Methods: Functionally similar to momentum? Tries to regularize the network but prevents the network from making large updates/remember small things about the data

“some” regularization term + losses—downside being the regularization term is probably either too weak or too strong, it can’t offer any guarantees

Architecture changes: Seems the most promising, most architectures focus on giving out tasks to specialized models ⇒ which can then be trained
- https://arxiv.org/abs/2311.17601 — use LoRA matrices to specialize a model in specific tasks, but is dependent on a good classifier
- giving out tasks to specialized models? seems like a task for MoEs—in fact they theoretically can train very well at parallel training (see https://arxiv.org/abs/2403.07816)

inspiration for solution

https://arxiv.org/abs/2407.04153
- showed that it’s possible to scale up experts to an extremely large amount given a good router
- the router is functionally “selecting” layers out of a very wide network
https://arxiv.org/abs/2403.07816
- MoE models can train in parallel with different information + tasks and be concatenated together with less compute
https://openreview.net/pdf?id=ySS7hH1smL
- new paper showing tree based routers might be helpful
- specialized language routers + MoE layers for language translation etc.

architecture idea

idea: what if there was a large mixture of experts model (100k+) where you could incrementally store data as it comes + functionally have a much wider network that could “construct it’s own layers” as it goes

main network

*correction, n^m possible experts to route to, k^m for experts active at once (though actual route sampling could be done differently the idea of combinations is the same)

would work well with continual data streaming sources
possibly use tree router steps like https://openreview.net/pdf?id=ySS7hH1smL to handle larger amount of experts
more efficient routing techniques showed in 1m experts showed scaling this far is possible

continual learning

freeze the ffn and only train the router + specific experts or even have a constant KL divergence between router distributions & the intended experts