Mohini’s Status Report for 10/9/2020 – iRecruit

This week, I decided to take a break from designing the web pages and focused on starting the research and implementation phases of the speech to text model. I used the sounddevice library in Python to experiment with recording my voice saying different letters of the alphabet. I saved the recordings and tried to identify patterns in the audio recording where I was speaking the same letter. I wasn’t able to identify any obvious patterns just by looking at the amplitude of the signal, so I took the Fourier transform of the signal. This led to viewing the signal in the frequency domain, where I was able to identify similarities between different recordings of the same letter. Next steps here include using the Fourier transform to extract the desired frequencies of each letter.

Additionally, I reviewed the foundations of building a neural network from scratch. After completing this research component, I programmed a basic neural network where I formed the optimal parameter matrix through performing gradient descent on a training data set. I’ll explain this a little more. The goal of the neural net is to minimize the mean squared error of categorizing the letters. The input to the neural net is a sample of audio, represented by a vector of some dimension, n. There are a number of hidden layers connecting the input to the output, which is a probability distribution over the 26 letters. To get from the input to the output and form the hidden layer along the way, I form linear combinations with the input feature vector and the parameter weight matrix. The hidden layers are then represented by the above linear combinations passed through a sigmoid function. In order to achieve the goal of minimizing the mean squared error, I need to find the optimal parameter weight matrix. This is done through stochastic gradient descent which involves choosing one sample from the training dataset, calculating the partial derivative of the current mean squared error equation with respect to each of the weights, and updating the weights by this derivative. This is repeated for each element in the dataset.

I have finished most of the basic implementation of the neural net. However, currently the accuracy of my algorithm is approximately 35% and needs to be improved greatly. I need to research ways to improve the accuracy, most likely through increasing the number of epochs and adjusting the number of hidden layers in the model. Additionally, I need to test the neural net with the input from our signal processing of the audio. Since this component hasn’t been completed yet, I am currently using a dataset from the Internet that consists of images of letters, rather than signals of letters.

I believe I am on schedule as this week I worked on both the signal processing and machine learning components of our project. I will continue to work on fine tuning the neural net algorithm as well as brainstorm ways to best represent the audio record as a finite signal.

Leave a Reply Cancel reply