"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).

Seq-to-seq model w/ attention mechanism removed the need for duration prediction.

Tacotron 2 (2018)

seq-to-seq model

auto-regressive decoder

attention mechanism (location-sensitive)

early cutoff

repetition

skipping

adversarial training(Guo et al. 2019)

regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)

Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)

forward attention(Zhang et al., 2018)

stepwise monotonic attention(He et al., 2019)