Jade’s Status Update for Feb 22
This week I investigated many different text to speech packages. I evaluated TTS packages on ease of install and use, whether the TTS required internet connection, and the quality of the voice. A lot of the text to speech packages had robotic or choppy voices, so they were ruled out. After reading about different TTS packages I decided to go with google’s gTTS pythoon package. It was easy to install and produced a clean voice. However, it does require an internet connection. The voice gTTS outputs is an adult female voice, so, to get the voice to be appealing to children I experimented with pitch shifting it upwards. Pitch shifting produced a more friendly voice, so now I will also be implementing a pitch-shifting algorithm. Below is a summary of the TTS packages I investigated.
Speech Synthesis Package | Pros | Cons | Source |
Festival | Easy to install. Shell level command interpreter, Java and C++ libraries |
Unappealing voice – very choppy synthesis and not natural | |
Flite | Built specifically for embedded systems. Runs faster than Festival does. No dependencies | Unappealing voice – very choppy synthesis not natural. | |
eSpeak | Command line, easy to install | Unappealing Voice – clear, but robotic | |
say | Command line easy to install | Unappealing Voice – extremely robotic unclear and muddled | |
spd-say | Command line easy to install | Unappealing Voice – speech is clear, but too fast and very choppy | |
google_speech | python package easy to install. Voice is adult female and does not sound bad. | Less advanced that gTTS and also not maintainted | https://pypi.org/project/google-speech/ |
gTTS | python package easy to install. Voice is adult female, sounds smooth | Requires internet connection on Pi | https://github.com/pndurette/gTTS |
AWS Polly | Good voice comparable to google_speech and gTTS | Requires AWS credit. |
I also investigated pitch shifting algorithms this week. There seems to be two main ones PSOLA and phase vocoding. PSOLA relies on doing a frame by frame analysis of the input data, and overlapping the frames closer together to achieve a increase in pitch and overlapping them farther apart to get a decrease in pitch. You change duration by adding or removing frames. Phase vocoding relies on taking STFT’s of the input, generating magnitude and phase frequency response, and then adding a frequency to the instantaneous frequency in order to get the pitch shifting without time scaling. I’m not too sure which method I want to use because both require a lot of computation and I want to write mock-up programs for each and time them to understand the latency.
Other things I worked on this week were getting the Raspberry Pi 4’s setup, and getting TTS / Speech Recognition working on my laptop. I also worked on some of the design presentation and design proposal.
My progress is on schedule because I have identified good packages for TTS and speech recognition and have got them working on my laptop.
For this upcoming week I hope to get the Raspberry Pi’s setup completely and verify that the packages I chose will run on them.