I've been working on a Python implementation that uses Gradient Boosted Decision Trees (LightGBM/Treelite) instead of using a neural network for the value/policy models:
It's mostly to understand how AlphaZero&Friends work. I'm also curious about how well a GBDT could do, and if there are self-play techniques that can accelerate training.
The nice thing about a GBDT is that, unlike when using a NN, you can do thousands of value/policy lookups per second on a single core. So it should be cheaper to scale self-play and run a lot of self-play experiments (assuming the self-play learnings when using the GBDT model transfer to when you use the more-powerful NN in these environments).
If you're curious about accelerating self-play training, check out David Wu's work (https://arxiv.org/pdf/1902.10565.pdf). He's the creator of KataGo. I implemented his "Playout Cap Randomization" technique in my implementation above and, sure enough, it's much more efficient: https://imgur.com/a/epaKtDY. It seems like it's still early days in terms of how efficient self-play training is.