"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ca503009-d42a-4875-b598-3deaeb13fdec/Untitled.png

Seq-to-seq model w/ attention mechanism removed the need for duration prediction.

Tacotron 2 (2018)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8874bca3-1deb-45be-b3f5-ecc5e3616af8/image1.png

seq-to-seq model

auto-regressive decoder

attention mechanism (location-sensitive)

Problems with auto-regressive attention based models

early cutoff

repetition

skipping

Efforts to improve the robustness of auto-regressive attention based models

adversarial training(Guo et al. 2019)

regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)

Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)

forward attention(Zhang et al., 2018)

stepwise monotonic attention(He et al., 2019)