Leveraging Speaker Embeddings from Speaker Verification for Controllable Multispeaker Text-to-Speech

Authors

  • Nilam Thakkar
  • Shruti Yagnik
  • Tripti Sharma

Abstract

Our hybrid system is compatible with Librosa library, Gaussian Mixture Model, and text-to-speech (TTS) technology. Neural networks are then used by TTS to produce audio speech that mimics the sounds of several speakers, including those that are excluded from the training set. The system incorporates three separately trained components: (1) With a small clip of the target speaker audio, the pre-trained encoder can validate it after comparing with a stand alone separately stored dataset of thousands of speakers' high pitched vocal notes without transcripts can produce fixed-length embedding vectors; (2) TacotronII endorsed sequential model that, relies on the primary level speaker embedding, transforms text into mel-spectrograms; (3) An auto regressive ‘WaveNet vocoder’ converts Mel spectrograms to waveforms with are functions of time.We demonstrate how the discriminative pre-training of the speech encoder on large-scale speaker diversity conveys important knowledge about speaker variability to the multi-speaker TTS challenge, allowing high-quality synthesis even for unknown speakers. We measure the advantages of rich and heterogeneous speaker datasets for enhanced generalization. Progressively, the new sounds generated with the aid of embedding of a random/ variable speaker can effectively generate new sounds that are different from the training set, suggesting that the model has picked up strong speaker representations.To prevent undue similarities, alternate wording and structure are used while preserving the essential factual data. Our recently added custom Librosa layers extract necessary features, which is helpful to improve the efficiency of the particular feature, and our freshly added custom GMM layers eliminate noise from the raw audios by removing noisy features.

Downloads

Published

2025-09-25

How to Cite

Nilam Thakkar, Shruti Yagnik, & Tripti Sharma. (2025). Leveraging Speaker Embeddings from Speaker Verification for Controllable Multispeaker Text-to-Speech. Utilitas Mathematica, 122(2), 1269–1300. Retrieved from https://utilitasmathematica.com/index.php/Index/article/view/2858

Citation Check

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.