February 21, 2020

Jade’s Status Update for Feb 22

This week I investigated many different text to speech packages. I evaluated TTS packages on ease of install and use, whether the TTS required internet connection, and the quality of the voice. A lot of the text to speech packages had robotic or choppy voices, so they were ruled out. After reading about different TTS packages I decided to go with google’s gTTS pythoon package. It was easy to install and produced a clean voice. However, it does require an internet connection. The voice gTTS outputs is an adult female voice, so, to get the voice to be appealing to children I experimented with pitch shifting it upwards. Pitch shifting produced a more friendly voice, so now I will also be implementing a pitch-shifting algorithm. Below is a summary of the TTS packages I investigated.

Speech Synthesis Package	Pros	Cons	Source
Festival	Easy to install. Shell level command interpreter, Java and C++ libraries	Unappealing voice – very choppy synthesis and not natural
Flite	Built specifically for embedded systems. Runs faster than Festival does. No dependencies	Unappealing voice – very choppy synthesis not natural.
eSpeak	Command line, easy to install	Unappealing Voice – clear, but robotic
say	Command line easy to install	Unappealing Voice – extremely robotic unclear and muddled
spd-say	Command line easy to install	Unappealing Voice – speech is clear, but too fast and very choppy
google_speech	python package easy to install. Voice is adult female and does not sound bad.	Less advanced that gTTS and also not maintainted	https://pypi.org/project/google-speech/
gTTS	python package easy to install. Voice is adult female, sounds smooth	Requires internet connection on Pi	https://github.com/pndurette/gTTS
AWS Polly	Good voice comparable to google_speech and gTTS	Requires AWS credit.

I also investigated pitch shifting algorithms this week. There seems to be two main ones PSOLA and phase vocoding. PSOLA relies on doing a frame by frame analysis of the input data, and overlapping the frames closer together to achieve a increase in pitch and overlapping them farther apart to get a decrease in pitch. You change duration by adding or removing frames. Phase vocoding relies on taking STFT’s of the input, generating magnitude and phase frequency response, and then adding a frequency to the instantaneous frequency in order to get the pitch shifting without time scaling. I’m not too sure which method I want to use because both require a lot of computation and I want to write mock-up programs for each and time them to understand the latency.

Other things I worked on this week were getting the Raspberry Pi 4’s setup, and getting TTS / Speech Recognition working on my laptop. I also worked on some of the design presentation and design proposal.

My progress is on schedule because I have identified good packages for TTS and speech recognition and have got them working on my laptop.

For this upcoming week I hope to get the Raspberry Pi’s setup completely and verify that the packages I chose will run on them.

Author: jtraiger