overview
Problem: Current language models are autoregressive and can’t handle continual learning well. The biggest problems with just training the network on new data are 1) forgetting and 2) compute (although less important in some cases)—especially with horizontal continual learning
current solutions
https://arxiv.org/abs/2404.16789
- Replay-Based Methods: Has a buffer of past memory, which then can be used to train the model to make sure it doesn’t forget. Hard to determine which data is relevant to keep/would need the model to learn—doesn’t seem optimal
- Regularization-Methods: Functionally similar to momentum? Tries to regularize the network but prevents the network from making large updates/remember small things about the data

“some” regularization term + losses—downside being the regularization term is probably either too weak or too strong, it can’t offer any guarantees
- Architecture changes: Seems the most promising, most architectures focus on giving out tasks to specialized models ⇒ which can then be trained
inspiration for solution
architecture idea
idea: what if there was a large mixture of experts model (100k+) where you could incrementally store data as it comes + functionally have a much wider network that could “construct it’s own layers” as it goes
main network

*correction, n^m possible experts to route to, k^m for experts active at once (though actual route sampling could be done differently the idea of combinations is the same)
- would work well with continual data streaming sources
- possibly use tree router steps like https://openreview.net/pdf?id=ySS7hH1smL to handle larger amount of experts
- more efficient routing techniques showed in 1m experts showed scaling this far is possible
continual learning

freeze the ffn and only train the router + specific experts or even have a constant KL divergence between router distributions & the intended experts
- could functionally freeze layers and add to the number of experts dynamically
- would change the continual learning problem from a “forgetting” problem to a routing problem
- is it possible to find a deterministic way to merge routers pre/post training?
- could retrain routers using general data from before
- saves cost/compute and lower memory footprint
- increase in efficiency compared to scaling up
benchmarks/baselines