Mohini’s Status Report for 12/04/2020

Since I wrapped up completing the implementation last week, I spent most of my time this week testing and putting final touches to the web app. I updated the user model that keeps track of the user’s name and all their questions and answers. I tested this functionality thoroughly in order to ensure that if I logged out and then logged back in as a different user, then a new set of questions and answers are displayed on the completed technical page that correspond to this specific user. The completed technical page is a running record of all the questions and answers that belong to the logged in user. 

Second, I formally tested the speech recognition algorithm. This was accomplished in two different ways. First, I did automated testing with fixed testing data. I created a testing data file with approximately 50 samples representative of all 8 categories. I ran the speech recognition algorithm to predict the category of all 50 samples. The accuracy fell around 30-40% and varied significantly between different test cases. Next, I did manual testing that tested the integration of  the signal processing and machine learning components. Here, I created the sample of testing data through audio recording my voice and letting the signal processing algorithm decompose it. I created a spreadsheet to keep track of my manual testing. The spreadsheet consists of the true word and the predicted word, and has a similar accuracy of 30-40% as the automated testing. 

Lastly, I started working on the final report. I used our design report as a template and incorporated some of the many design changes that we made throughout the project. Some of the updates included word classification (rather than letter classification), the layout of the technical interviewing page, and the information that we stored in the profile page for the completed behavioral and technical pages. Next week, I plan on continuing to test the speech recognition algorithm through both automated and manual testing. I will also start recording my part for the final video demo. I believe we are making good progress as iRecruit is almost complete! 

 

Mohini’s Status Report for 11/20/2020

This week, I worked on various different things. I started with reviewing the neural net & back propagation algorithm with Shilika. We needed to take my working implementation for one hidden layer and apply it to determine the stochastic gradient algorithm for the addition of another hidden layer. After drawing out the feed forward algorithm with the different parameters, we derived the back propagation algorithm derivatives into a simplified form and coded the equations. Shilika was in charge of the coding aspect; I simply helped her think through the algorithm. 

Next, I continued working on the technical interview page. I focused on two things – improving the CSS of the page to make it more visually appealing and implementing the backend Django code for retrieving and saving the user’s answers to a database. Rather than tell the user if their answer is correct, I displayed the user’s submitted answer and the correct answer on the screen to let them decide if their answer is correct. I also created a database to store all the user information. This includes their name, email, and list of questions as well as their answers to the questions. For the CSS aspect, I played around with the layout of the page so that the chosen category and question banners appear under each other rather than next to each other. I changed the colors and positions of a few other elements to make the page, as a whole, more visually appealing. Additionally, I continued researching and adding tips to our tips page, as well as improving the CSS of that part. 

Lastly, I started testing the speech recognition algorithm. I created more training data, but the addition of this training data actually decreased the accuracy of the algorithm. I haven’t found the sweet spot of the optimal number of training data samples nor the optimal parameters to run the neural network. I will experiment to find the optimal ones for next week. 

 

Mohini’s Status Report for 11/13/2020

This week, I worked on a multitude of things. I started with looking into the speech recognition algorithm and thinking of possible ways to increase the accuracy. I created more training data which helped increase the accuracy by about 5%. I also tested the workflow and integration of the algorithm significantly, making sure that the signal processing and machine learning components work well together. 

Second, I worked on the web application this week. I spent a little bit of time cleaning up the backend logic for users to create a new account and login into iRecruit. This included fixing a minor bug so now the user’s username appears on the dashboard and the navigation bar. I also created a user database to store the information of any user that makes an account with iRecruit. This database will be utilized in the profile page to keep track of each individual user’s completed behavioral and technical interview practices. Additionally, I worked on the tips page and researched and added in tips for technical interviewing for the user’s convenience. 

Majority of my time was spent continuing to work on the technical interview page. I finished creating the questions database so that each of our eight categories have a couple questions. I displayed the user’s chosen category (the output of our speech recognition algorithm) on the webpage as well as a random question associated with that category. I also created an input text box for the user to submit their answer in. Next steps include writing backend code in the Django framework to retrieve the user’s answer and check its accuracy. I also plan on displaying a possible correct answer on the screen, so the user can compare theirs to this sample answer if desired. I will be storing the user’s questions and answers in the database, so that a summary of their practices can be displayed on the profile page.

I believe I am on progress as the skeleton of the technical interview page has been completed. I will spend the rest of the semester trying to improve the speech recognition algorithm and formatting the technical interview page to incorporate the best UI practices; however, I feel that the core of the project has been completed.

 

Mohini’s Status Update for 11/06/2020

This week, I worked on integrating the signal processing and machine learning algorithms in order to create a complete speech recognition implementation. First, I finished creating the training data. This involved recording myself speaking a word and letting the algorithm run that results in the binary representation of the data being stored in a text file. I manually appended the contents in this temporary text file as well as the English representation of the word to my training data text file. I decided to record each of the 8 categories 8 different times for a total of 64 samples in the training data set. This process was quite tedious and took a couple hours to complete as I had to wait for the signal processing algorithm to run for each sample. I used a similar approach to create the testing data set. Currently, there are only 7 samples in it, but I will add more samples in the upcoming future.

Next, I used the training and test data sets as the input to the neural network implementation. I coded the baseline of this implementation from scratch last year for my 10301: Introduction to Machine Learning course. I had to tweak a few things in order to adapt the code for the purposes of this project. One challenge was formatting the datasets in the best way so that the reading and processing of those files is as simplified as possible. Another challenge was ordering the data in the training dataset in the optimal order as changing the order of the data had a significant impact on the accuracy of the model. For example, I noticed that the accuracy of the model decreased if the training dataset had multiple samples of the same word in a row. After overcoming these obstacles, I modified the stochastic gradient descent algorithm to work with these datasets and fine tuned the parameter matrices. Then I wrote a predict function using the optimal parameter matrices determined from training the neural network in order to predict the corresponding English word for each sample in the test data set. Currently, this accuracy is at 42%, but I will work on improving this in the upcoming future. 

Finally, I integrated the speech recognition implementation with the Django web app so that the user can record themselves from the technical interview page and the algorithm returns the predicted word. This word is then displayed on the technical interview page. Next steps will include improving the accuracy of this algorithm and picking a question from the corresponding category to display on the screen. 

The image below is a snapshot of the training data set. Each sample is of length 15000.

Mohini’s Status Report for 10/30/2020

This week, I finalized the output for the signal processing algorithm. I applied the Mel scale filterbank implementation to simplify the dimension of the output to 40 x 257. Once I verified that the output seemed reasonable, my next steps were to determine the best way to feed it into the neural network. I experimented with passing the Mel scale filterbank representation, but these matrices seemed too similar between different words. Since it was the spectrogram visual representation that differed between words, I decided to save it as an image and pass the grey scaled version of the image as the input into the neural network. 

Once I decided this was the best way to represent the input vector, I began creating the training data for the neural network. We currently have about 8 different categories that users can pick their technical questions from. I plan to generate about 10 samples for each category as initial training data. To make a good model, I’d need to generate close to 1000 samples of training data. However generating each sample requires me to record the word and run the signal processing algorithm which takes a few minutes. Since this process is somewhat slow, I don’t think it’d be practical to generate more than a 100 samples of training data. So far, my training data set has approximately 30 samples. 

This week, I also integrated my neural network code with our Django webapp. I wrote my neural network code in Java, so figuring out a way for our Django webapp to access it was a challenge. I ultimately used an “os.system()” command to call my neural net code from the terminal. Next steps include finishing the training data set as well as passing it through the neural network to view the accuracy of the model.

 

Team Status Update for 10/23/2020

This week, we continued to work on implementing our respective portions of the project.

Jessica continued to work on the facial detection portion, specifically looking into the saving of videos, the alerts, and the facial landmark part. She was able to get the video saving portion to work, where the VideoWriter class in OpenCV is used to write video frames to an output file. There are currently two options that exist for the alerts, one with audio and one with visuals. Both are able to capture the user’s attention when their eyes are off-centered. She began looking into the facial landmark detection and has a working baseline. Next week, she is hoping to get the center of the nose and mouth coordinates from the facial landmark detection to use as frames of reference for screen alignment. She is also hoping to do more testing for the eye detection portion. 

Shilika worked on the signal processing algorithm and has an input ready for the neural network. She followed the process of applying a pre-emphasis, framing, windowing, and applying a fourier transform and power spectrum to transform the signal into the frequency domain.

Mohini also continued to work on the signal processing algorithm. The decision the team has made in regards to categorizing entire words, rather than individual letters, reduces many of our anticipated risks. The signals of many of the individual letters were looking quite alike whereas the signals of the different words have distinct differences. This change will simplify our decision greatly and is expected to have a higher accuracy as well. Next steps for Mohini include feeding the signal processing output into the neural network and fine tuning that algorithm.

 

Mohini’s Status Report for 10/23/2020

This week I continued working on the signal processing algorithm that will generate an input to the neural network. As a team, we have decided to make one significant change to our signal processing algorithm. Instead of trying to recognize individual letters, we will be trying to recognize entire words. Essentially, this reduces the scope of our project, because we will be giving the user a list of 10-15 categories to choose a technical question from. This means that our neural network will have 10-15 outputs instead of the original 26 outputs. Additionally, we will only need to run the neural network algorithm once for each word, rather than once for each letter, which will greatly speed up our time complexity for generating a technical question. 

Continuing on my work from last week, after making this decision, I tested the rough signal processing algorithm I created last week on these entire words (“array”, “linked list”, etc). I saw that there were significant differences between different words and enough similarity between the same words. Afterwards, I improved the algorithm by using a Hamming window, rather than a rectangular window as this windowing technique reduces the impact of discontinuities present in the original signal. I also started researching the Mel scale and the Mel filterbank implementation. This will simplify the dimension of the signal processing output, so that it will be easier for the neural network to process without losing any crucial information present in the original signal. Next week, I will be focusing on transforming the output using the Mel scale as well as creating a first attempt at a training dataset for the neural network. This will most likely include 10-15 signals representing each word that our neural network will be categorizing. It is important that our training dataset consists of a variety of signals for each word  in order to prevent the model from overfitting. 

 

Mohini’s Status Report for 10/16/2020

This week, I primarily focused on the signal processing aspect of our project. Last week involved saving the audio file that the user records as an integer vector and recognizing that the time domain signal was not a sufficient approach to categorizing signals as different representations of the same letter resulted in signals with similar shapes but different amplitudes. Therefore, this week, it led to the idea of analyzing the signal in the frequency domain. After taking the Fourier Transform of the time domain signal, we realized that this was also not a sufficient approach as the Fourier Transform of every letter had a peak at the low frequencies and another peak at the higher frequencies. After doing a little more research, we decided to analyze the Short Time Fourier Transform (STFT) over 20 ms chunks of the audio clip. This was plotted on a spectrogram, and it was easier to determine similarities between same letters and differences between different letters. 

The team and I spent a good amount of time trying to understand why this was the case and how to proceed. We met with a PhD student, who specializes in speech processing, to get some guidance. He told us to use a Hamming window with 50% overlap instead of a rectangular window with no overlap (which we had previously been using) when determining the STFT. Additionally, he told us to look into log mel filterbanks which will scale the frequency values to perception values that human ears are used to. We plan to implement these two features in the upcoming week. I believe my work is somewhat on schedule as determining the signal processing output is a crucial part of our project that we allocated several weeks to implement.

 

Mohini’s Status Report for 10/9/2020

This week, I decided to take a break from designing the web pages and focused on starting the research and implementation phases of the speech to text model. I used the sounddevice library in Python to experiment with recording my voice saying different letters of the alphabet. I saved the recordings and tried to identify patterns in the audio recording where I was speaking the same letter. I wasn’t able to identify any obvious patterns just by looking at the amplitude of the signal, so I took the Fourier transform of the signal. This led to viewing the signal in the frequency domain, where I was able to identify similarities between different recordings of the same letter. Next steps here include using the Fourier transform to extract the desired frequencies of each letter. 

Additionally, I reviewed the foundations of building a neural network from scratch. After completing this research component, I programmed a basic neural network where I formed the optimal parameter matrix through performing gradient descent on a training data set. I’ll explain this a little more. The goal of the neural net is to minimize the mean squared error of categorizing the letters. The input to the neural net is a sample of audio, represented by a vector of some dimension, n. There are a number of hidden layers connecting the input to the output, which is a probability distribution over the 26 letters. To get from the input to the output and form the hidden layer along the way, I form linear combinations with the input feature vector and the parameter weight matrix. The hidden layers are then represented by the above  linear combinations passed through a sigmoid function. In order to achieve the goal of minimizing the mean squared error, I need to find the optimal parameter weight matrix. This is done through stochastic gradient descent which involves choosing one sample from the training dataset, calculating the partial derivative of the current mean squared error equation with respect to each of the weights, and updating the weights by this derivative. This is repeated for each element in the dataset. 

I have finished most of the basic implementation of the neural net. However, currently the accuracy of my algorithm is approximately 35% and needs to be improved greatly. I need to research ways to improve the accuracy, most likely through increasing the number of epochs and adjusting the number of hidden layers in the model. Additionally, I need to test the neural net with the input from our signal processing of the audio. Since this component hasn’t been completed yet, I am currently using a dataset from the Internet that consists of images of letters, rather than signals of letters. 

I believe I am on schedule as this week I worked on both the signal processing and machine learning components of our project. I will continue to work on fine tuning the neural net algorithm as well as brainstorm ways to best represent the audio record as a finite signal.

 

Mohini’s Status Report for 10/02/2020

For our capstone project this week, I set up the basic framework of our web app through Django and installed the necessary libraries and dependencies. Afterwards, I designed the basic webframes of our webapp. This included planning out the different web pages necessary for the complete design and refreshing my HTML/CSS and Django knowledge. After running it through my group, we decided to have at least 7 pages – home, login, register, dashboard, behavioral, technical, and profile. 

A quick breakdown of these pages:

    • Home: the user is first brought to this page and introduced to iRecruit
      • There are links to the login and register pages as well as a centralized iRecruit logo and background picture. 
    • Login: the user is able to login to their existing account
      • I created a login form that checks for the correct username and password combination. 
    • Register: the user is able to register for our site 
      • I created a registration form where the user can create an account, and the account information is stored in a built in Django database. 
    • Dashboard: after logging in, the user is brought to this page where they can choose which page they want to view 
      • Currently, this page consists of 3 buttons which lead to either the behavioral, technical or profile pages. 
    • Behavioral Interview: the user can practice video recording themselves here and iRecruit will give real time feedback 
      • There is a “press to start practicing” button which calls Jessica’s eye detection code. 
    • Technical Interview: the user will ask for a question which iRecruit will provide, and the user can input their answer once done solving the question
      • This page is still relatively empty. 
    • Profile: this page will store the user’s past behavioral interviews, recorded audio skills, and past technical questions answered 
      • This page is still relatively empty. I have started experimenting with different ways to retrieve the audio.

My progress is relatively on schedule as I designated about two weeks to complete the basic web app components. I was able to complete more than half of it during this first week. Next steps for me include starting the research component of how to use signal processing to analyze the audio data received from the user during recording their skills.