Team’s Status Update for 10/30/20

This week, we continued working on implementing our respective portions of the project, making progress in the three main parts. Jessica worked on implementing a center of frame reference, facial landmark detection, and testing the eye detection portion. She thought it would be helpful to have a centered guideline for users to position themselves accordingly during the initial setup phase, so that they have a reference for the center of the video screen. She continued working on the facial landmark detection, and was able to get the coordinates of the center of the nose and mouth. The eye detection portion was also tested more, and the results seem to align with the accuracy goal. Next week, she will work on completing the initial setup phase for the facial landmark detection, and would like to complete the off-center screen alignment portion for the nose as well. She will also continue testing both the eye detection and screen alignment parts. 

Mohini finalized the signal processing algorithm and started making the training data set for the neural network algorithm. This week concluded the signal processing portion of our project, so I will be focusing on the machine learning portion as well as integrating the different components of our project together for the rest of the semester. Next week, I will be working on testing the neural network after finishing generating the rest of the training data set. 

Shilika began reviewing the neural network concepts, as this is the next technical aspect of the technical interview portion that she will help tackle. She also continued to work on the web application to improve the css and features that appear in the front-end to make the app more user friendly. Next week, she will continue working on the web application and the neural network. 

Shilika’s Status Report for 10/30/20

This week after finalizing the output of the signal processing, I began to review the concepts of a neural network which will be the next technical portion of our project. I will be working with Mohini to improve the neural network that we created in a Machine Learning course we previously took. In this algorithm, we use a single layer neural network that uses a sigmoid activation function for the hidden layer, a softmax function on the output layer, and the cross-entropy loss function to gauge the accuracy of our model. I reviewed the concepts behind these activation functions and how the output layer is formed using the input layer and hidden layers. 

I additionally started working on the web application components of our project again. I worked on how to run the java code in django and used the “copy path” command to be able to run the code from a separate direction. I also began working on the profile page again which is where the user will be able to save their skill set and view previously recorded behavioral interviews. I improved the css for the profile page to make it more user friendly and began to look at saving the videos locally in django.

Next week, my goal is to be able to save the videos on django and allow the user to upload a profile photo to the profile page. Additionally as soon as our training data is ready, I start implementing ways in which our neural network can be improved to classify our 8 outputs.

Mohini’s Status Report for 10/30/2020

This week, I finalized the output for the signal processing algorithm. I applied the Mel scale filterbank implementation to simplify the dimension of the output to 40 x 257. Once I verified that the output seemed reasonable, my next steps were to determine the best way to feed it into the neural network. I experimented with passing the Mel scale filterbank representation, but these matrices seemed too similar between different words. Since it was the spectrogram visual representation that differed between words, I decided to save it as an image and pass the grey scaled version of the image as the input into the neural network. 

Once I decided this was the best way to represent the input vector, I began creating the training data for the neural network. We currently have about 8 different categories that users can pick their technical questions from. I plan to generate about 10 samples for each category as initial training data. To make a good model, I’d need to generate close to 1000 samples of training data. However generating each sample requires me to record the word and run the signal processing algorithm which takes a few minutes. Since this process is somewhat slow, I don’t think it’d be practical to generate more than a 100 samples of training data. So far, my training data set has approximately 30 samples. 

This week, I also integrated my neural network code with our Django webapp. I wrote my neural network code in Java, so figuring out a way for our Django webapp to access it was a challenge. I ultimately used an “os.system()” command to call my neural net code from the terminal. Next steps include finishing the training data set as well as passing it through the neural network to view the accuracy of the model.

 

Jessica’s Status Update for 10/30/2020

This week, I worked on implementing a center of frame reference, continuing the facial landmark detection, and testing the eye detection portion. I thought it would be useful to the user to give them some guidelines in the beginning as to where to position their face. The current implementation draws two thin lines, one spanning the middle of the video frame horizontally and one spanning the middle of the video frame vertically. At the center of the frame (width/2, height/2), there is a circle, which is ideally where the user would center their nose. I thought that these guidelines would serve as a base of what “centered” is, although they do not have to be followed strictly for the facial detection to work.

I continued implementing the facial landmark detection portion, and worked off of the model that I had of all of the facial coordinates last week. I determined that it would be more helpful to get the locations of the center of the nose and mouth, as these are actual coordinates that we can base the frame of reference off of, instead of an array of the facial landmark coordinates. I was able to locate the coordinates of the center of the nose and mouth (by looping through the array and pinpointing which ones correspond to the nose and mouth) and will be using a similar tactic of storing the coordinates into an array during the initial setup period, and then taking the average of those coordinates to use as the frame of reference.

I tested the eye detection portion some more, and the accuracy seems to be in the range of what we were hoping for. So far, only a few false positives have been detected. I believe we are on a good track for the facial detection portion, with the eye detection portion working as expected and the facial landmark detection being on its way. Next week, I plan on completing the initial setup phase for the facial landmark part, as well as hopefully, the off-center screen alignment portion for the nose. I also will do more testing of the eye detection portion and begin testing the initial setup phase.

Team Status Update for 10/23/2020

This week, we continued to work on implementing our respective portions of the project.

Jessica continued to work on the facial detection portion, specifically looking into the saving of videos, the alerts, and the facial landmark part. She was able to get the video saving portion to work, where the VideoWriter class in OpenCV is used to write video frames to an output file. There are currently two options that exist for the alerts, one with audio and one with visuals. Both are able to capture the user’s attention when their eyes are off-centered. She began looking into the facial landmark detection and has a working baseline. Next week, she is hoping to get the center of the nose and mouth coordinates from the facial landmark detection to use as frames of reference for screen alignment. She is also hoping to do more testing for the eye detection portion. 

Shilika worked on the signal processing algorithm and has an input ready for the neural network. She followed the process of applying a pre-emphasis, framing, windowing, and applying a fourier transform and power spectrum to transform the signal into the frequency domain.

Mohini also continued to work on the signal processing algorithm. The decision the team has made in regards to categorizing entire words, rather than individual letters, reduces many of our anticipated risks. The signals of many of the individual letters were looking quite alike whereas the signals of the different words have distinct differences. This change will simplify our decision greatly and is expected to have a higher accuracy as well. Next steps for Mohini include feeding the signal processing output into the neural network and fine tuning that algorithm.

 

Mohini’s Status Report for 10/23/2020

This week I continued working on the signal processing algorithm that will generate an input to the neural network. As a team, we have decided to make one significant change to our signal processing algorithm. Instead of trying to recognize individual letters, we will be trying to recognize entire words. Essentially, this reduces the scope of our project, because we will be giving the user a list of 10-15 categories to choose a technical question from. This means that our neural network will have 10-15 outputs instead of the original 26 outputs. Additionally, we will only need to run the neural network algorithm once for each word, rather than once for each letter, which will greatly speed up our time complexity for generating a technical question. 

Continuing on my work from last week, after making this decision, I tested the rough signal processing algorithm I created last week on these entire words (“array”, “linked list”, etc). I saw that there were significant differences between different words and enough similarity between the same words. Afterwards, I improved the algorithm by using a Hamming window, rather than a rectangular window as this windowing technique reduces the impact of discontinuities present in the original signal. I also started researching the Mel scale and the Mel filterbank implementation. This will simplify the dimension of the signal processing output, so that it will be easier for the neural network to process without losing any crucial information present in the original signal. Next week, I will be focusing on transforming the output using the Mel scale as well as creating a first attempt at a training dataset for the neural network. This will most likely include 10-15 signals representing each word that our neural network will be categorizing. It is important that our training dataset consists of a variety of signals for each word  in order to prevent the model from overfitting. 

 

Shilika’s Status Update for 10/23/20

This week, I built off of the signal processing work we did in the previous weeks to create the output of the signal processing algorithm. The process after reading the original input file is as follows:

  1. We first apply a pre-emphasis on the audio input:
    1. To do this, we use the equation y(t) = x(t) – alpha*x(t-1). The alpha value is a predetermined filter coefficient which is usually 0.95 or 0.97.
    2. By doing so, we will be able to improve the signal to noise ratio by amplifying the signal.
  2. We then frame the updated signal:
    1. Framing is useful because a signal is constantly changing over time. Doing a simple Fourier transform over the whole signal because we would lose the variations through time.
    2. Thus, by taking the Fourier transform of adjacent frames with overlap, we will preserve as much of the original signal as possible.
  3. We are using 20 millisecond frames with 10 millisecond frames.
    1. With the updated signal, we use a hamming window:
    2. A Hamming window reduces the effects of leakage that occurs when performing a Fourier transform on the data.
    3. To apply it, we use a simple line of code in python.
  4. Fourier Transform and Power Spectrum:
    1. We can now do the Fourier Transform on the data and compute the power spectrum to be able to distinguish different audio data from each other.

The output will continue to be modified and enhanced to make our algorithm better, but we have something to input into our neural network now. I began looking into filter banks and mfcc, which are two techniques used to change the data so it is more understandable to the human ear. I will continue this next week and if time allows help the team with the neural network algorithm. 

Jessica’s Status Update for 10/23/2020

This week, I worked on implementing the saving of practice interview videos, the alerts given to the user, and the facial landmark part for screen alignment. Each time that the script is run, a video recording begins, and when the user exits out of the recording, it gets saved (currently, to a local directory, but eventually, to a database hopefully). This is done through the OpenCV library in Python. Similar to how the VideoCapture class is used to capture video frames, the VideoWriter class is used to write video frames to a video file. Each video frame is written to the video output created at the beginning of main().

I also worked on implementing the alerts given to the user for subpar eye contact. Originally, I thought of doing an audio alert – particularly playing a bell sound when the user’s eyes are off-center. However, this proved pretty distracting, although effective in getting the user’s attention. Then, I experimented with a message box alert, which pops up when the user’s eyes are off-center. This proved to be another effective way of getting the user’s attention. I plan on experimenting with both of these options some more, but they both work well to alert the user to re-center as of now.

I began researching into the facial landmark portion, and have a basic working model of all of the facial coordinates mapped out. Instead of utilizing each facial feature coordinate, I thought it would be more helpful to get the location of the center of the nose and perhaps the mouth. This way, there are definitive coordinates to use for the frame of reference. If the nose and mouth are off-center, then the rest of the fact is also off-center. Next week, I plan on attempting to get the coordinates of the center of the nose and mouth utilizing facial landmark detection. This requires going through the landmarks array and figuring out which coordinates correspond to which facial feature. I also plan on doing more testing on the eye detection portion, and getting a better sense of the current accuracy.

Team Status Update for 10/16/20

This week, the team continued researching and implementing their respective parts, particularly the implementation portion. A change we made to the facial detection part was in the initial set-up phase. In our proposal presentation, we stated that we wanted to have the initial set-up computed within 5 seconds. However, after testing the program, it turned out that 5 seconds was too short of a time, especially if the user is not used to using the system. We increased this time to 10 seconds. 

Jessica worked on implementing the off-center detection and initial setup phase for the eye detection portion of the facial detection part. When a user’s eyes are wandering around, which constitutes as subpar eye contact, for up to 5 seconds, iRecruit will alert the user that their eyes are not centered. This frame of reference for centered is measured through moments in OpenCV, which calculate the centroid of each iris/pupil image. The center coordinates are calculated for each eye detection, and then the average of all the center coordinates is taken to calculate the reference center coordinates (X and Y). If the user’s eyes differ from this reference center (within a range), they are alerted. She also started testing the eye detection portion, and will continue doing this next week. She will also start looking into the screen alignment portion with facial landmark detection. 

Mohini worked on implementing the signal processing aspect of the project. From her work from last week, the team determined that the time signal representation of the audio recording was not sufficient enough, so this week the audio signal was analyzed in the frequency domain. After meeting with the PhD student, we have a couple of ideas to implement for next week (the Hamming window and the log mel filterbank coefficients). 

Shilika worked on the signal processing portion of the project. She worked with the team to make modifications to the output of the signal processing algorithm. Modifications included splitting the total audio file into 20 millisecond chunks and trimming the file so there is no excess silence. The output still needs further modifications which she will continue working on this coming week. 

Shilika’s Status Report for 10/16/20

This week, I worked with Mohini on the signal processing part. We needed to research and experiment with different ways to trim our audio and scale our x-axis to make all the final outputs the same length. We decided to take a different approach and analyze the Short Term Fourier Transform (STFT) over 20 millisecond chunks of the whole audio file. After splitting the audio file and  applying the fourier transform to each chunk, we plotted the results on a spectrogram. Unlike before, we were able to see slight similarities when we said the same letter multiple times and differences between the different letters. We additionally met with a PhD student who specializes in speech recognition. He gave us tips on how to further hone our input. For example, he recommended we use a Hamming window with a 50% overlap and scale the frequency values so the numbers aren’t too small. 

I believe I am still on schedule. The goal last week was to have an output ready so we could use it as the input for the neural network. Though the output needs more modifications, we were able to come up with a solution. This week, I hope to continue my work in the signal processing portion and add all the modifications that were recommended by the PhD students and solidify the output of the signal processing algorithm.