Mohini’s Status Report for 10/23/2020

This week I continued working on the signal processing algorithm that will generate an input to the neural network. As a team, we have decided to make one significant change to our signal processing algorithm. Instead of trying to recognize individual letters, we will be trying to recognize entire words. Essentially, this reduces the scope of our project, because we will be giving the user a list of 10-15 categories to choose a technical question from. This means that our neural network will have 10-15 outputs instead of the original 26 outputs. Additionally, we will only need to run the neural network algorithm once for each word, rather than once for each letter, which will greatly speed up our time complexity for generating a technical question. 

Continuing on my work from last week, after making this decision, I tested the rough signal processing algorithm I created last week on these entire words (“array”, “linked list”, etc). I saw that there were significant differences between different words and enough similarity between the same words. Afterwards, I improved the algorithm by using a Hamming window, rather than a rectangular window as this windowing technique reduces the impact of discontinuities present in the original signal. I also started researching the Mel scale and the Mel filterbank implementation. This will simplify the dimension of the signal processing output, so that it will be easier for the neural network to process without losing any crucial information present in the original signal. Next week, I will be focusing on transforming the output using the Mel scale as well as creating a first attempt at a training dataset for the neural network. This will most likely include 10-15 signals representing each word that our neural network will be categorizing. It is important that our training dataset consists of a variety of signals for each word  in order to prevent the model from overfitting. 

 

Shilika’s Status Update for 10/23/20

This week, I built off of the signal processing work we did in the previous weeks to create the output of the signal processing algorithm. The process after reading the original input file is as follows:

  1. We first apply a pre-emphasis on the audio input:
    1. To do this, we use the equation y(t) = x(t) – alpha*x(t-1). The alpha value is a predetermined filter coefficient which is usually 0.95 or 0.97.
    2. By doing so, we will be able to improve the signal to noise ratio by amplifying the signal.
  2. We then frame the updated signal:
    1. Framing is useful because a signal is constantly changing over time. Doing a simple Fourier transform over the whole signal because we would lose the variations through time.
    2. Thus, by taking the Fourier transform of adjacent frames with overlap, we will preserve as much of the original signal as possible.
  3. We are using 20 millisecond frames with 10 millisecond frames.
    1. With the updated signal, we use a hamming window:
    2. A Hamming window reduces the effects of leakage that occurs when performing a Fourier transform on the data.
    3. To apply it, we use a simple line of code in python.
  4. Fourier Transform and Power Spectrum:
    1. We can now do the Fourier Transform on the data and compute the power spectrum to be able to distinguish different audio data from each other.

The output will continue to be modified and enhanced to make our algorithm better, but we have something to input into our neural network now. I began looking into filter banks and mfcc, which are two techniques used to change the data so it is more understandable to the human ear. I will continue this next week and if time allows help the team with the neural network algorithm. 

Jessica’s Status Update for 10/23/2020

This week, I worked on implementing the saving of practice interview videos, the alerts given to the user, and the facial landmark part for screen alignment. Each time that the script is run, a video recording begins, and when the user exits out of the recording, it gets saved (currently, to a local directory, but eventually, to a database hopefully). This is done through the OpenCV library in Python. Similar to how the VideoCapture class is used to capture video frames, the VideoWriter class is used to write video frames to a video file. Each video frame is written to the video output created at the beginning of main().

I also worked on implementing the alerts given to the user for subpar eye contact. Originally, I thought of doing an audio alert – particularly playing a bell sound when the user’s eyes are off-center. However, this proved pretty distracting, although effective in getting the user’s attention. Then, I experimented with a message box alert, which pops up when the user’s eyes are off-center. This proved to be another effective way of getting the user’s attention. I plan on experimenting with both of these options some more, but they both work well to alert the user to re-center as of now.

I began researching into the facial landmark portion, and have a basic working model of all of the facial coordinates mapped out. Instead of utilizing each facial feature coordinate, I thought it would be more helpful to get the location of the center of the nose and perhaps the mouth. This way, there are definitive coordinates to use for the frame of reference. If the nose and mouth are off-center, then the rest of the fact is also off-center. Next week, I plan on attempting to get the coordinates of the center of the nose and mouth utilizing facial landmark detection. This requires going through the landmarks array and figuring out which coordinates correspond to which facial feature. I also plan on doing more testing on the eye detection portion, and getting a better sense of the current accuracy.

Team Status Update for 10/16/20

This week, the team continued researching and implementing their respective parts, particularly the implementation portion. A change we made to the facial detection part was in the initial set-up phase. In our proposal presentation, we stated that we wanted to have the initial set-up computed within 5 seconds. However, after testing the program, it turned out that 5 seconds was too short of a time, especially if the user is not used to using the system. We increased this time to 10 seconds. 

Jessica worked on implementing the off-center detection and initial setup phase for the eye detection portion of the facial detection part. When a user’s eyes are wandering around, which constitutes as subpar eye contact, for up to 5 seconds, iRecruit will alert the user that their eyes are not centered. This frame of reference for centered is measured through moments in OpenCV, which calculate the centroid of each iris/pupil image. The center coordinates are calculated for each eye detection, and then the average of all the center coordinates is taken to calculate the reference center coordinates (X and Y). If the user’s eyes differ from this reference center (within a range), they are alerted. She also started testing the eye detection portion, and will continue doing this next week. She will also start looking into the screen alignment portion with facial landmark detection. 

Mohini worked on implementing the signal processing aspect of the project. From her work from last week, the team determined that the time signal representation of the audio recording was not sufficient enough, so this week the audio signal was analyzed in the frequency domain. After meeting with the PhD student, we have a couple of ideas to implement for next week (the Hamming window and the log mel filterbank coefficients). 

Shilika worked on the signal processing portion of the project. She worked with the team to make modifications to the output of the signal processing algorithm. Modifications included splitting the total audio file into 20 millisecond chunks and trimming the file so there is no excess silence. The output still needs further modifications which she will continue working on this coming week. 

Mohini’s Status Report for 10/16/2020

This week, I primarily focused on the signal processing aspect of our project. Last week involved saving the audio file that the user records as an integer vector and recognizing that the time domain signal was not a sufficient approach to categorizing signals as different representations of the same letter resulted in signals with similar shapes but different amplitudes. Therefore, this week, it led to the idea of analyzing the signal in the frequency domain. After taking the Fourier Transform of the time domain signal, we realized that this was also not a sufficient approach as the Fourier Transform of every letter had a peak at the low frequencies and another peak at the higher frequencies. After doing a little more research, we decided to analyze the Short Time Fourier Transform (STFT) over 20 ms chunks of the audio clip. This was plotted on a spectrogram, and it was easier to determine similarities between same letters and differences between different letters. 

The team and I spent a good amount of time trying to understand why this was the case and how to proceed. We met with a PhD student, who specializes in speech processing, to get some guidance. He told us to use a Hamming window with 50% overlap instead of a rectangular window with no overlap (which we had previously been using) when determining the STFT. Additionally, he told us to look into log mel filterbanks which will scale the frequency values to perception values that human ears are used to. We plan to implement these two features in the upcoming week. I believe my work is somewhat on schedule as determining the signal processing output is a crucial part of our project that we allocated several weeks to implement.

 

Jess’ Status Update for 10/16/2020

This week, I worked on implementing off-center detection and initial set-up phase for the eye detection portion of the facial detection part. When a user’s eyes are wandering around (subpar eye contact) for up to 5 seconds, the system will alert the user that their eyes are not centered. Centered in this case means within a certain range of the coordinates detected during the initial setup phase. The center coordinates of the irises/pupils are found through using moments in OpenCV, which find the centroid of the iris/pupil image (https://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html). The center is with respect to a specific origin, which is the left edge of each iris (in other words, 0 is the left edge of each iris, not the left edge of the screen). Each eye actually has the same center because of this origin reference.

The center coordinates are calculated for each eye detection and they are stored into an array. After the 10 seconds of initial set-up (changed from 5 from the proposal presentation, because 5 seconds was too short), the average of all the center coordinates is taken to calculate the reference center coordinates (X and Y). This reference center is what the program refers to to calculate whether or not the user’s eyes are “off-center.” I also started doing some formal testing, where I keep track of whether or not a user is alerted within 5 seconds if their eyes are wandering around. If they are, then this constitutes a passing test. If they are not, then this constitutes a failing test (false negative). If the user is alerted, but their eyes were not wandering around, this is also a failing test (false positive).

I believe that I am on-schedule, as getting the off-center detection and initial set-up phase for the eye detection is a big part of the facial detention portion. Next week, I plan on continuing to test the eye detection part, particularly on other user’s eyes (I will ask my friends if they want to volunteer). I also want to start the screen alignment portion, and research more about facial landmark detection in OpenCV.

Team Status Update for 10/09/2020

This week, the team continued researching and implementing their parts as laid out in the schedule and Gantt chart. The implementation of the project is spilt up into three main parts: facial detection, signal processing, and machine learning. Each team member is assigned to one of these parts and has been working on making progress with the relevant tasks. A change that was made to the overall design is that for facial detection, we are no longer going to be detecting posture, because there are too many unknown factors surrounding the task. There were ideas thrown around for “good” posture to be sitting up straight (e.g. distance between shoulders and face) or not having your head down (e.g. mouth is missing). However, if you have long hair, for instance, shoulder detection is not possible (and we cannot force a user to tie up their hair). Additionally, if your mouth is missing, then we will be unable to detect a face at all, which is another problem. 

Jessica continued working on the facial detection portion, and was able to get the eye detection part from last week working in real-time with video. Now, when the script is run, OpenCV’s VideoCapture class is called to capture the frames of the video. The eye detection is then performed on each of these frames to attempt to detect the user’s face and eyes, and the irises/pupils within the eyes. A green circle is drawn around each iris/pupil to keep track of their locations. She is planning to get the off-center detection and initial set-up stage done next week, as well as start formally testing the eye detection. 

Mohini started researching how to best represent the audio the user submits as a finite signal. Currently, she is able to save the audio as a finite matrix representing the amplitude of the signal, using the Nyquist Theorem. She is working on identifying patterns for different signals representing the same letter through analysis of the Fourier Transform. Additionally, Mohini reviewed her knowledge of neural networks and started working on the basic implementation of it. While she still has a significant amount of work to complete the algorithm and improve the accuracy, she got a good understanding of what needs to be done. She will continue to work and research both the signal processing and machine learning components of the project in the coming week. 

Shilika continued to work on the web application website, questions databases, and the signal processing portion. She was able to complete the profile page and behavioral questions database. She made progress on the technical questions database and the signal processing or speech recognition. She hopes to have a first round of completed input for the neural network for speech recognition by next week.

Jess’ Status Update for 10/09/2020

This week, I worked on implementing the real-time portion of the facial detection part of our project. I wanted to get eye detection working with video, so when a user eventually records themselves for their interview practice, we are able to track their eyes as they are recording. I was able to do this using Python’s OpenCV library, which has a VideoCapture class to capture video files, image sequences, and cameras. By utilizing this, we are able to continue reading video frames until the user quits out of the video capture. While we are reading video frames, we attempt to detect the user’s face and eyes, and then the irises/pupils within the eyes. The irises/pupils are detected using blob detection (available through the OpenCV library) and a threshold (to determine the cutoff of what becomes black and white), which allows us to image process the frame to reveal where the irises/pupils are. Currently, a green circle is drawn around each of iris/pupil, like so (looks slightly scary):

The eye detection works pretty well for the most part, although the user does have to be in a certain position and may have to adjust accordingly. This is why we plan on having the initial set-up phase at the beginning of the process. I believe that I am on-schedule, as getting the detection to work in real-time was a main goal for this part of the project. Next week, I plan on getting the off-center detection working as well as the initial set-up phase done. I want to give the user time to align themselves, so that the program can keep track of the “centered” eye coordinates, and then detect whether the eyes are off-center from there. I also need to start formally testing this part of the facial detection.

Team Status Update for 10/02/2020

This past week, the team mostly did initial set-up and began the research/implementation process. We wanted to get all of our environments up and running, so that we could have a centralized platform for implementing features. We decided to create a GitHub repository for everyone to access the code and make changes. Each team member is working on their own branch and will make a pull request to master when they are ready to merge. One of the risks that could jeopardize the success of the project would be having consistent merge conflicts, where team members are overwriting each other’s code. By making sure we make pull requests from our individual branches, we are making sure that master only contains the most up-to-date, working code. No major changes were made to the existing design of the system as this week was mostly spent familiarizing ourselves with the project environment and getting started with small components of the project. 

Jessica started researching and implementing the facial detection part for the behavioral interview portion. She followed the existing design of the system, where a user’s eyes will be detected and tracked to ensure that they make eye-contact with the camera. She used the Haar Cascades from OpenCV library in Python to detect a face and the eyes on an image. She is planning to complete the real-time portion next week, where eye detection is done on a video stream. 

Mohini started designing the basic web app pages and connecting the pages together through various links and buttons. A good portion of her time was dedicated to CSS and making the style of each element visually appealing. Towards the end of the week, she started researching different ways to extract the recorded audio from the user as well as the best way to analyze it.

Shilika focused on creating the navigation bars that will appear across the web pages, which will allow the user to easily go from one page to another. She also began creating databases for the behavioral and technical interview questions and began preliminary steps of the speech processing algorithm. Next week, she will continue working on the web pages on the application and populating the database.

Some photos of the initial wireframe of our web app:

Shilika’s Status Report for 10/02/2020

This week, I created the navigation bars that will be used across the pages in our web application. The top navigation bar has three main components – 1. The menu button allows you to open the side navigation bar. It has two additional buttons, one that leads you to the behavioral interview page and the other that leads you to the technical interview page. 2. The profile button leads you to your profile page. 3. The help button leads you to a page in which our web application features will be explained. The side navigation bar has two buttons that lead you to the behavioral and technical interview pages.

I also began creating the behavioral and technical databases. I used online websites and used common questions that are asked in behavioral/technical interviews for software engineering interviews roles. 

Lastly, I researched the steps of our speech processing algorithm to detect letters that the user will speak. So far, I have been able to successfully read the audio, convert it to an integer array, and graph the audio. These preliminary steps are the foundation of creating the data we will feed into our neural network.

I believe that my progress is on schedule. Next week, I aim to complete the css and html for the user profile page, complete collecting questions for the databases, and get a solid understanding of how the Fourier transform can be used in python to pre-process the audio signal we are receiving.