Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU
Under review as a conference paper at Interspeech 2022, pdf (code will be made publicly available shortly)
Abstract
Recently, score-based diffusion probabilistic modeling has shown encouraging results in various tasks outperforming other popular generative modeling frameworks in terms of quality. However, to unlock its potential and make diffusion models feasible from the practical point of view, special efforts should be made to enable more efficient iterative sampling procedure on CPU devices. In this paper, we focus on applying the most promising techniques from recent literature on diffusion modeling to Grad-TTS, a diffusion-based text-to-speech system, in order to accelerate it. We compare various reverse diffusion sampling schemes, the technique of progressive distillation, GAN-based diffusion modeling and score-based generative modeling in latent space. Experimental results demonstrate that it is possible to speed Grad-TTS up to 4.5 times compared to vanilla Grad-TTS and achieve real time factor 0.15 on CPU while keeping synthesis quality competitive with that of conventional text-to-speech baselines.
Examples from MOS evaluation
Note: HiFi-GAN is used as vocoder. Also, listen to the audios using headphones for better experience.
Text: | In Fort Worth, there occurred a breach of discipline by some members of the Secret Service who were officially traveling with the President. | The Secret Service had received from the FBI some nine thousand reports on members of the Communist Party. | Might have been more alert in the Dallas motorcade if they had retired promptly in Fort Worth. |
Ground truth | |||
Grad-TTS-Vanilla-10 | |||
Grad-TTS-ML-4 | |||
Grad-TTS-Distilled-2 | |||
GAN-DPM-4 | |||
LSGM-NVAE-ML-4 | |||
Glow-TTS | |||
FastSpeech2 |
Text: | It has also used other Federal law enforcement agents during Presidential visits to cities in which such agents are stationed. | Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car. | A formal and thorough description of the responsibilities of the advance agent is now in preparation by the Service. |
Ground truth | |||
Grad-TTS-Vanilla-10 | |||
Grad-TTS-ML-4 | |||
Grad-TTS-Distilled-2 | |||
GAN-DPM-4 | |||
LSGM-NVAE-ML-4 | |||
Glow-TTS | |||
FastSpeech2 |
March 2022