Jade’s Status Update for Feb 22

This week I investigated many different text to speech packages. I evaluated TTS packages on ease of install and use, whether the TTS required internet connection, and the quality of the voice. A lot of the text to speech packages had robotic or choppy voices, so they were ruled out. After reading about different TTS packages I decided to go with google’s gTTS pythoon package. It was easy to install and produced a clean voice. However, it does require an internet connection. The voice gTTS outputs is an adult female voice, so, to get the voice to be appealing to children I experimented with pitch shifting it upwards. Pitch shifting produced a more friendly voice, so now I will also be implementing a pitch-shifting algorithm. Below is a summary of the TTS packages I investigated.

Speech Synthesis Package Pros Cons Source
Festival Easy to install.
Shell level command interpreter, Java and C++ libraries
Unappealing voice – very choppy synthesis and not natural
Flite Built specifically for embedded systems. Runs faster than Festival does. No dependencies Unappealing voice – very choppy synthesis not natural.
eSpeak Command line, easy to install Unappealing Voice – clear, but robotic
say Command line easy to install Unappealing Voice – extremely robotic unclear and muddled
spd-say Command line easy to install Unappealing Voice – speech is clear, but too fast and very choppy
google_speech python package easy to install. Voice is adult female and does not sound bad. Less advanced that gTTS and also not maintainted https://pypi.org/project/google-speech/
gTTS python package easy to install. Voice is adult female, sounds smooth Requires internet connection on Pi https://github.com/pndurette/gTTS
AWS Polly Good voice comparable to google_speech and gTTS Requires AWS credit.

I also investigated pitch shifting algorithms this week. There seems to be two main ones PSOLA and phase vocoding. PSOLA relies on  doing a frame by frame analysis of the input data, and overlapping the frames closer together to achieve a increase in pitch and overlapping them farther apart to get a decrease in pitch. You change duration by adding or removing frames.  Phase vocoding relies on taking STFT’s of the input, generating magnitude and phase frequency response, and then adding a frequency to the instantaneous frequency in order to get the pitch shifting without time scaling. I’m not too sure which method I want to use because both require a lot of computation and I want to write mock-up programs for each and time them to understand the latency.

Other things I worked on this week were getting the Raspberry Pi 4’s setup, and getting TTS / Speech Recognition working on my laptop. I also worked on some of the design presentation and design proposal.

My progress is on schedule because I have identified good packages for TTS and speech recognition and have got them working on my laptop.

For this upcoming week I hope to get the Raspberry Pi’s setup completely and verify that the packages I chose will run on them.



Leave a Reply

Your email address will not be published. Required fields are marked *