A Measure of Smoothness in Synthesized Speech

Phung Trung Nghia, Nguyen Van Tao, Pham Thi Mai Huong, Nguyen Thi Bich Diep, Phung Thi Thu Hien


The articulators typically move smoothly during speech production. Therefore, speech features of natural speech are generally smooth. However, over-smoothness causes "muffleness" and, hence, reduction in ability to identify emotions/expressions/styles in synthesized speech that can affect the perception of naturalness in synthesized speech. In the literature, statistical variances of static spectral features have been used as a measure of smoothness in synthesized speech but they are not sufficient enough. This paper proposes another measure of smoothness that can be efficiently applied to evaluate the smoothness of synthesized speech. Experiments showed that the proposed measure is reliable and efficient to measure the smoothness of different kinds of synthesized speech.

Full Text:



G. Fant, Acoustic theory of speech production. The Netherlands: Mouton-The Hague, 1960.

K. K. Paliwal, “Interpolation properties of linear prediction parametric representations,” in Fourth European Conference on Speech Communication and Technology, 1995, pp. 1016–1032.

R. McArdle and R. H. Wilson, “Speech perception in noise: The basics,” SIG 6 Perspectives on Hearing and Hearing Disorders: Research and Diagnostics, vol. 13, no. 1, pp. 4–13, 2009.

Y.-Y. Chen, T.-W. Kuan, C.-Y. Tsai, J.-F. Wang, and C.-H. Chang, “Speech variability compensation for expressive speech synthesis,” in International Conference on Orange Technologies (ICOT). IEEE, 2013, pp. 210–213.

T. Tomoki and K. Tokuda, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Transactions on Information and Systems, vol. 90, no. 5, pp. 816–824, 2007.

G. Beller, N. Obin, and X. Rodet, “Articulation degree as a prosodic dimension of expressive speech,” in Fourth International Conference on Speech Prosody, Campinas, Brazil, 2008, pp. 677–681.

S. Lee, E. Bresch, and S. Narayanan, “An exploratory study of emotional speech production using functional data analysis techniques,” in Proceedings of the 7th International Seminar on Speech Production, Ubatuba, Brazil, 2006, pp. 525–532.

P. Lieberman and S. B. Michaels, “Some aspects of fundamental frequency and envelope amplitude as related to the emotional content of speech,” The Journal of the Acoustical Society of America, vol. 34, no. 7, pp. 922–927, 1962.

S. W. Lee, S. T. Ang, M. Dong, and H. Li, “Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 429–432.

R. Martin, “Spectral Subtraction Based on Minimum tatistics,” in Proceedings of the European Signal Processing Conference (EUSIPCO), 1994, pp. 1182–1185.

E. B. Dagum and M. Morry, “Basic issues on the seasonal adjustment of the Canadian consumer price index,” Journal of Business & Economic Statistics, vol. 2, no. 3, pp. 250–259, 1984.

L. C. Mai and D. N. Duc, “Design of Vietnamese speech corpus and current status,” in Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP), vol. 6, 2006, pp. 748–758.

H. Kawahara, “STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds,” Acoustical Science and Technology, vol. 27, no. 6, pp. 349–353, 2006.

T. T. Vu, M. C. Luong, and S. Nakamura, “An HMM-based Vietnamese speech synthesis system,” in 2009 Oriental COCOSDA International Conference on Speech Database and Assessments. IEEE, 2009, pp. 116–121.

V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech Communication, vol. 9, no. 4, pp. 351–356, 1990.

DOI: http://dx.doi.org/10.21553/rev-jec.106

Copyright (c) 2016 REV Journal on Electronics and Communications

Copyright © 2011-2018
Radio and Electronics Association of Vietnam
All rights reserved