Mohini’s Status Report for 10/30/2020

This week, I finalized the output for the signal processing algorithm. I applied the Mel scale filterbank implementation to simplify the dimension of the output to 40 x 257. Once I verified that the output seemed reasonable, my next steps were to determine the best way to feed it into the neural network. I experimented with passing the Mel scale filterbank representation, but these matrices seemed too similar between different words. Since it was the spectrogram visual representation that differed between words, I decided to save it as an image and pass the grey scaled version of the image as the input into the neural network. 

Once I decided this was the best way to represent the input vector, I began creating the training data for the neural network. We currently have about 8 different categories that users can pick their technical questions from. I plan to generate about 10 samples for each category as initial training data. To make a good model, I’d need to generate close to 1000 samples of training data. However generating each sample requires me to record the word and run the signal processing algorithm which takes a few minutes. Since this process is somewhat slow, I don’t think it’d be practical to generate more than a 100 samples of training data. So far, my training data set has approximately 30 samples. 

This week, I also integrated my neural network code with our Django webapp. I wrote my neural network code in Java, so figuring out a way for our Django webapp to access it was a challenge. I ultimately used an “os.system()” command to call my neural net code from the terminal. Next steps include finishing the training data set as well as passing it through the neural network to view the accuracy of the model.

 

Jessica’s Status Update for 10/30/2020

This week, I worked on implementing a center of frame reference, continuing the facial landmark detection, and testing the eye detection portion. I thought it would be useful to the user to give them some guidelines in the beginning as to where to position their face. The current implementation draws two thin lines, one spanning the middle of the video frame horizontally and one spanning the middle of the video frame vertically. At the center of the frame (width/2, height/2), there is a circle, which is ideally where the user would center their nose. I thought that these guidelines would serve as a base of what “centered” is, although they do not have to be followed strictly for the facial detection to work.

I continued implementing the facial landmark detection portion, and worked off of the model that I had of all of the facial coordinates last week. I determined that it would be more helpful to get the locations of the center of the nose and mouth, as these are actual coordinates that we can base the frame of reference off of, instead of an array of the facial landmark coordinates. I was able to locate the coordinates of the center of the nose and mouth (by looping through the array and pinpointing which ones correspond to the nose and mouth) and will be using a similar tactic of storing the coordinates into an array during the initial setup period, and then taking the average of those coordinates to use as the frame of reference.

I tested the eye detection portion some more, and the accuracy seems to be in the range of what we were hoping for. So far, only a few false positives have been detected. I believe we are on a good track for the facial detection portion, with the eye detection portion working as expected and the facial landmark detection being on its way. Next week, I plan on completing the initial setup phase for the facial landmark part, as well as hopefully, the off-center screen alignment portion for the nose. I also will do more testing of the eye detection portion and begin testing the initial setup phase.

Team Status Update for 10/23/2020

This week, we continued to work on implementing our respective portions of the project.

Jessica continued to work on the facial detection portion, specifically looking into the saving of videos, the alerts, and the facial landmark part. She was able to get the video saving portion to work, where the VideoWriter class in OpenCV is used to write video frames to an output file. There are currently two options that exist for the alerts, one with audio and one with visuals. Both are able to capture the user’s attention when their eyes are off-centered. She began looking into the facial landmark detection and has a working baseline. Next week, she is hoping to get the center of the nose and mouth coordinates from the facial landmark detection to use as frames of reference for screen alignment. She is also hoping to do more testing for the eye detection portion. 

Shilika worked on the signal processing algorithm and has an input ready for the neural network. She followed the process of applying a pre-emphasis, framing, windowing, and applying a fourier transform and power spectrum to transform the signal into the frequency domain.

Mohini also continued to work on the signal processing algorithm. The decision the team has made in regards to categorizing entire words, rather than individual letters, reduces many of our anticipated risks. The signals of many of the individual letters were looking quite alike whereas the signals of the different words have distinct differences. This change will simplify our decision greatly and is expected to have a higher accuracy as well. Next steps for Mohini include feeding the signal processing output into the neural network and fine tuning that algorithm.

 

Mohini’s Status Report for 10/23/2020

This week I continued working on the signal processing algorithm that will generate an input to the neural network. As a team, we have decided to make one significant change to our signal processing algorithm. Instead of trying to recognize individual letters, we will be trying to recognize entire words. Essentially, this reduces the scope of our project, because we will be giving the user a list of 10-15 categories to choose a technical question from. This means that our neural network will have 10-15 outputs instead of the original 26 outputs. Additionally, we will only need to run the neural network algorithm once for each word, rather than once for each letter, which will greatly speed up our time complexity for generating a technical question. 

Continuing on my work from last week, after making this decision, I tested the rough signal processing algorithm I created last week on these entire words (“array”, “linked list”, etc). I saw that there were significant differences between different words and enough similarity between the same words. Afterwards, I improved the algorithm by using a Hamming window, rather than a rectangular window as this windowing technique reduces the impact of discontinuities present in the original signal. I also started researching the Mel scale and the Mel filterbank implementation. This will simplify the dimension of the signal processing output, so that it will be easier for the neural network to process without losing any crucial information present in the original signal. Next week, I will be focusing on transforming the output using the Mel scale as well as creating a first attempt at a training dataset for the neural network. This will most likely include 10-15 signals representing each word that our neural network will be categorizing. It is important that our training dataset consists of a variety of signals for each word  in order to prevent the model from overfitting. 

 

Shilika’s Status Update for 10/23/20

This week, I built off of the signal processing work we did in the previous weeks to create the output of the signal processing algorithm. The process after reading the original input file is as follows:

  1. We first apply a pre-emphasis on the audio input:
    1. To do this, we use the equation y(t) = x(t) – alpha*x(t-1). The alpha value is a predetermined filter coefficient which is usually 0.95 or 0.97.
    2. By doing so, we will be able to improve the signal to noise ratio by amplifying the signal.
  2. We then frame the updated signal:
    1. Framing is useful because a signal is constantly changing over time. Doing a simple Fourier transform over the whole signal because we would lose the variations through time.
    2. Thus, by taking the Fourier transform of adjacent frames with overlap, we will preserve as much of the original signal as possible.
  3. We are using 20 millisecond frames with 10 millisecond frames.
    1. With the updated signal, we use a hamming window:
    2. A Hamming window reduces the effects of leakage that occurs when performing a Fourier transform on the data.
    3. To apply it, we use a simple line of code in python.
  4. Fourier Transform and Power Spectrum:
    1. We can now do the Fourier Transform on the data and compute the power spectrum to be able to distinguish different audio data from each other.

The output will continue to be modified and enhanced to make our algorithm better, but we have something to input into our neural network now. I began looking into filter banks and mfcc, which are two techniques used to change the data so it is more understandable to the human ear. I will continue this next week and if time allows help the team with the neural network algorithm. 

Jessica’s Status Update for 10/23/2020

This week, I worked on implementing the saving of practice interview videos, the alerts given to the user, and the facial landmark part for screen alignment. Each time that the script is run, a video recording begins, and when the user exits out of the recording, it gets saved (currently, to a local directory, but eventually, to a database hopefully). This is done through the OpenCV library in Python. Similar to how the VideoCapture class is used to capture video frames, the VideoWriter class is used to write video frames to a video file. Each video frame is written to the video output created at the beginning of main().

I also worked on implementing the alerts given to the user for subpar eye contact. Originally, I thought of doing an audio alert – particularly playing a bell sound when the user’s eyes are off-center. However, this proved pretty distracting, although effective in getting the user’s attention. Then, I experimented with a message box alert, which pops up when the user’s eyes are off-center. This proved to be another effective way of getting the user’s attention. I plan on experimenting with both of these options some more, but they both work well to alert the user to re-center as of now.

I began researching into the facial landmark portion, and have a basic working model of all of the facial coordinates mapped out. Instead of utilizing each facial feature coordinate, I thought it would be more helpful to get the location of the center of the nose and perhaps the mouth. This way, there are definitive coordinates to use for the frame of reference. If the nose and mouth are off-center, then the rest of the fact is also off-center. Next week, I plan on attempting to get the coordinates of the center of the nose and mouth utilizing facial landmark detection. This requires going through the landmarks array and figuring out which coordinates correspond to which facial feature. I also plan on doing more testing on the eye detection portion, and getting a better sense of the current accuracy.

Team Status Update for 10/16/20

This week, the team continued researching and implementing their respective parts, particularly the implementation portion. A change we made to the facial detection part was in the initial set-up phase. In our proposal presentation, we stated that we wanted to have the initial set-up computed within 5 seconds. However, after testing the program, it turned out that 5 seconds was too short of a time, especially if the user is not used to using the system. We increased this time to 10 seconds. 

Jessica worked on implementing the off-center detection and initial setup phase for the eye detection portion of the facial detection part. When a user’s eyes are wandering around, which constitutes as subpar eye contact, for up to 5 seconds, iRecruit will alert the user that their eyes are not centered. This frame of reference for centered is measured through moments in OpenCV, which calculate the centroid of each iris/pupil image. The center coordinates are calculated for each eye detection, and then the average of all the center coordinates is taken to calculate the reference center coordinates (X and Y). If the user’s eyes differ from this reference center (within a range), they are alerted. She also started testing the eye detection portion, and will continue doing this next week. She will also start looking into the screen alignment portion with facial landmark detection. 

Mohini worked on implementing the signal processing aspect of the project. From her work from last week, the team determined that the time signal representation of the audio recording was not sufficient enough, so this week the audio signal was analyzed in the frequency domain. After meeting with the PhD student, we have a couple of ideas to implement for next week (the Hamming window and the log mel filterbank coefficients). 

Shilika worked on the signal processing portion of the project. She worked with the team to make modifications to the output of the signal processing algorithm. Modifications included splitting the total audio file into 20 millisecond chunks and trimming the file so there is no excess silence. The output still needs further modifications which she will continue working on this coming week. 

Shilika’s Status Report for 10/16/20

This week, I worked with Mohini on the signal processing part. We needed to research and experiment with different ways to trim our audio and scale our x-axis to make all the final outputs the same length. We decided to take a different approach and analyze the Short Term Fourier Transform (STFT) over 20 millisecond chunks of the whole audio file. After splitting the audio file and  applying the fourier transform to each chunk, we plotted the results on a spectrogram. Unlike before, we were able to see slight similarities when we said the same letter multiple times and differences between the different letters. We additionally met with a PhD student who specializes in speech recognition. He gave us tips on how to further hone our input. For example, he recommended we use a Hamming window with a 50% overlap and scale the frequency values so the numbers aren’t too small. 

I believe I am still on schedule. The goal last week was to have an output ready so we could use it as the input for the neural network. Though the output needs more modifications, we were able to come up with a solution. This week, I hope to continue my work in the signal processing portion and add all the modifications that were recommended by the PhD students and solidify the output of the signal processing algorithm. 

Mohini’s Status Report for 10/16/2020

This week, I primarily focused on the signal processing aspect of our project. Last week involved saving the audio file that the user records as an integer vector and recognizing that the time domain signal was not a sufficient approach to categorizing signals as different representations of the same letter resulted in signals with similar shapes but different amplitudes. Therefore, this week, it led to the idea of analyzing the signal in the frequency domain. After taking the Fourier Transform of the time domain signal, we realized that this was also not a sufficient approach as the Fourier Transform of every letter had a peak at the low frequencies and another peak at the higher frequencies. After doing a little more research, we decided to analyze the Short Time Fourier Transform (STFT) over 20 ms chunks of the audio clip. This was plotted on a spectrogram, and it was easier to determine similarities between same letters and differences between different letters. 

The team and I spent a good amount of time trying to understand why this was the case and how to proceed. We met with a PhD student, who specializes in speech processing, to get some guidance. He told us to use a Hamming window with 50% overlap instead of a rectangular window with no overlap (which we had previously been using) when determining the STFT. Additionally, he told us to look into log mel filterbanks which will scale the frequency values to perception values that human ears are used to. We plan to implement these two features in the upcoming week. I believe my work is somewhat on schedule as determining the signal processing output is a crucial part of our project that we allocated several weeks to implement.

 

Jess’ Status Update for 10/16/2020

This week, I worked on implementing off-center detection and initial set-up phase for the eye detection portion of the facial detection part. When a user’s eyes are wandering around (subpar eye contact) for up to 5 seconds, the system will alert the user that their eyes are not centered. Centered in this case means within a certain range of the coordinates detected during the initial setup phase. The center coordinates of the irises/pupils are found through using moments in OpenCV, which find the centroid of the iris/pupil image (https://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html). The center is with respect to a specific origin, which is the left edge of each iris (in other words, 0 is the left edge of each iris, not the left edge of the screen). Each eye actually has the same center because of this origin reference.

The center coordinates are calculated for each eye detection and they are stored into an array. After the 10 seconds of initial set-up (changed from 5 from the proposal presentation, because 5 seconds was too short), the average of all the center coordinates is taken to calculate the reference center coordinates (X and Y). This reference center is what the program refers to to calculate whether or not the user’s eyes are “off-center.” I also started doing some formal testing, where I keep track of whether or not a user is alerted within 5 seconds if their eyes are wandering around. If they are, then this constitutes a passing test. If they are not, then this constitutes a failing test (false negative). If the user is alerted, but their eyes were not wandering around, this is also a failing test (false positive).

I believe that I am on-schedule, as getting the off-center detection and initial set-up phase for the eye detection is a big part of the facial detention portion. Next week, I plan on continuing to test the eye detection part, particularly on other user’s eyes (I will ask my friends if they want to volunteer). I also want to start the screen alignment portion, and research more about facial landmark detection in OpenCV.