HMM-synthesis

This page provides samples of HMM-based speech synthesis described in our paper [Miyuki 2017].

The following speech signals were generated from our system that did not use any transcribed data, i.e., manually annotated data.

The total learning process consists of an unsupervised learning process.

Experiment 1: Japanese Vowel Sequences

The following results were obtained using Japanese vowel sequence corpus that consisted of five artificial words {aioi, aue, ao, ie, uo}, which consisted of five Japanese vowels {a, i, u, e, o}.

We asked a participant, a male Japanese speaker, to read thirty sentences aloud, twice for each sentence. The speech signals were recorded using a microphone. The thirty sentences included all possible two- word sentences.[Speech Data]

    1. Speech signals generated from latent words acquired by the system [URL]

      • This corresponds to the results shown in Table 1.

Experiment 2: TIDIGITS Corpus

The following results were obtained using TIDIGITS corpus.

    1. Speech signals generated from latent letter sequences estimated from original speech signals contained in TIDIGITS corpus [URL]

      • Each file name indicates contained numbers. Suffixes, "a" and "b" are attached to differentiate different audio files.

    2. A sequence of the numbers between 0 and 9. Only 0 has tow ways of pronunciation, i.e., "zero" and "o". [URL]

      • Latent words corresponding to the numbers contained in TIDIGITS were played, i.e., synthesized, sequentially.

    3. Bigram latent word-based random walk. [URL]

      • Obtained latent words were played with a random walk using obtained bigram language model.

      • However, because of the existence of silence, the bigram language model, i.e., word bigram model, did not correctly estimated. Therefore, the random walk didn't produce satisfactory results. Learning longer context and generating more natural sequence are our future challenge.

Citation information

[Miyuki 2017] Yuusuke Miyuki, Yoshinobu Hagiwara and Tadahiro Taniguchi,

Unsupervised Learning for Spoken Word Production based on Simultaneous Word and Phoneme Discovery without Transcribed Data,

IEEE ICDL-Epirob 2017 (submitted)