overview

Problem: Current language models are autoregressive and can’t handle continual learning well. The biggest problems with just training the network on new data are 1) forgetting and 2) compute (although less important in some cases)—especially with horizontal continual learning

current solutions

https://arxiv.org/abs/2404.16789

“some” regularization term + losses—downside being the regularization term is probably either too weak or too strong, it can’t offer any guarantees

“some” regularization term + losses—downside being the regularization term is probably either too weak or too strong, it can’t offer any guarantees

inspiration for solution

architecture idea

idea: what if there was a large mixture of experts model (100k+) where you could incrementally store data as it comes + functionally have a much wider network that could “construct it’s own layers” as it goes

main network

*correction, n^m possible experts to route to, k^m for experts active at once (though actual route sampling could be done differently the idea of combinations is the same)

*correction, n^m possible experts to route to, k^m for experts active at once (though actual route sampling could be done differently the idea of combinations is the same)

continual learning

freeze the ffn and only train the router + specific experts or even have a constant KL divergence between router distributions & the intended experts

freeze the ffn and only train the router + specific experts or even have a constant KL divergence between router distributions & the intended experts

benchmarks/baselines