Weekly Status Report

Team Status

We successfully integrated our system and ran it end-to-end with decent results for the April 4 demo. The two major accomplishments were

1. Integrating all of the individual processors and persisting speaker data (ie speech embeddings) for easy lookup in our database. We faced a number of challenges completing this process due to race conditions and serialization problems, but eventually we got the system to run correctly.

2. Successfully implemented speaker classification by training speaker on our own data (we used friends and family to get ~10-12 internal speakers) along with the external held-out set from Voxceleb.

In its current form, the Yolo app currently accepts “registration” of a new user and also “authentication” of an already registered user. There isn’t much change made to our system diagram. We updated our schedule slightly to accommodate the setting up for Baidu Deep Speech for speaker presence verification and also for fine-tuning our model to improve speaker EER and accuracy.

Our current registration page. It is functional but we are working on a version that is easier to navigate and use.
The login (authentication) page. We will update the look and UI to make it easier to navigate and use.
The embeddings for our internal dataset are generally (non-linearly) separable in 2D. We expected that in the original 64 dimensional embedding space that the logistic regression model should be able to find a separating hyperplane for each speaker and confirm this based on our demo results. Note that there is overloading in the color labels.
The registration logic. Our processor abstraction allows us to write modular and easy to read code.
The authentication logic
Example of the Yolo system logger. In this snapshot, you can see the database loading a bunch of embeddings and then training a series of logistic regression parameters.

 

After our demo and based on how we are looking at the moment, we had to update some tasks in our Gantt Chart in order to get everything good and going for our Final Project Demo in May.

Updated Gantt Chart

Team Member Status

Nikhil Rangarajan

Prior to the demo, Ryan and I worked a lot on integrating the identification and verification inferences and tie them all together along with the Logistic Regression model that we mentioned in the previous status report. We had to play around with a lot of the LR parameters to see what worked best and what gave us the best separation/results.

For registration, our Web App displays a random paragraph that a user reads, and this audio file is hashed and stored in our database. A completely new set of Logistic Regression parameters are learnt for each internal speaker in our database now that a new speaker is added (one vs rest binary classifications) and these weights are stored. A user has finally registered.

For login, a user utters 5 randomized words (which we will probably have to change in the upcoming weeks to more words, and also words that make more sense together that are not completely independent of each other) and the embeddings for this utterance is tested on all Logistic Regression parameters for each speaker. If the probability of a particular speaker is above a certain threshold the speaker’s name is outputted, else a None is returned.

For the future, we need to fine-tune a lot of our parameters to see what works best with the highest accuracy and additionally, implement Baidu Deep Speech for the speaker presence verification.

Richa Ravi

I worked on changing the layout and design of the UI – changed a lot of the html and css and added some animations from codepen. Changed the usage so that the user only needs to click one button to start and stop recording themselves speaking. Wrote code to create database tables for the data. Used the peewee framework to define the database tables and classes – peewee framework is very similar to django models so it was easy for me to understand and use. I wrote some python code to wait for the redis queue to return a result about whether a user was found during login. If a user was returned I display this user on the template. I also added a check to see whether a username already exists during registering. To do this, I created a new model on the frontend to store all the usernames. If the username already exists, an error message is displayed and the username is not registered.

Ryan Brigden

I integrated and tested to end-to-end system for the demo we had on April 4. To reach this milestone and demonstrate a working, but not final or polished product, I accomplished the following tasks:

  1. Wrote the audio processor which converts the raw audio data received from the web backend into a WAV format, which is then processed into a mel-spectrogram. Optionally the audio processor can split a single audio sample into N evenly sized chunks. This is useful for registration when we want a number of different speech embeddings from a single long utterance when it would be inconvenient for the user to have to stop and start recording the individual samples.
  2. Wrote the speaker embedding processor, which efficiently performs network inference for our speaker model (ConvNet) using JIT compilation for acceleration.
  3. Wrote the database schema (written in the Peewee ORM). Specifically, I created the serialization tooling that lets us efficiently store the speaker embeddings (fixed length vectors).
  4. Developed the system initialization sequence, which performs two critical tasks:
    1. Loads the internal dataset (our collected set of speech samples from friends), generates the embeddings for these speakers, and adds these speakers to the database (serialize and store their embeddings and relate them to the respective User record).
    2. Loads the external data by first randomly sampling from a set of valid speech samples and then
  5. Helped develop the speaker classification processor with Nikhil. This pipeline learns logistic regression parameters for each speaker added to the database. We encountered a number of challenges to getting this to work. Initially, used dimensionality reduction to visualize both the internal and external embeddings in the database to see how separable the data was. Once we confirmed that it was separable I realized that we needed to hold-out a subset of the data for each speaker in order to find a threshold for each speaker model that achieves minimum EER. After tuning the held-out proportion and sampling process, we achieved decent performance.
  6. Added a global logging system that aggregates logging info from all of the processors and the database into a single logging file.

 

Weekly Status Report (03/09 & 03/23)

In two weeks of work, we as a team have made significant progress toward our minimum viable product (MVP) that we will demonstrate on April 4. The primary objective is to have the entire system integrated by that time so that we can spend the remainder of our development in optimizing individual components in order to have the system achieve sufficient performance.

The three most significant achievements this week were implementing the speaker classification (logistic regression) module, implementing the bare-bones processor (backend), and finishing a working web application that can accept audio data over the network.

The biggest unknown that we resolved was designing the specifics of the speaker classification system. The entirety of the speaker classification (logistic regression) system was built and tested in the past two weeks. In brief, the system learns a distinct set of logistic regression parameters for each speaker in the database. The motivation for this is twofold. For one, it helps us leverage both internal (from our speaker database) and external (from 3rd party datasets) data to derive a confidence score and threshold tuned to that speaker. Secondly, it is a simple and transparent model with low resource cost to run.

The greatest challenge we face in the next week is integrating the speaker embedding system with the speaker classification system and testing it on our own collected dataset.

Team Member Status

Ryan Brigden

I made progress on both the speaker embedding model and the processor (backend) system these past two weeks, as well as working with Nikhil to develop the speaker classification (logistic regression) training and evaluation process. I also collected 2 minute audio samples from 10 individuals that have been stored with those collected by other team members.

Here are some more specific updates on what I have completed.

Speaker Identification System

While this module is currently the most mature, we have made some more progress on refining the current model. The new features are

  1. Alternative bidirectional LSTM based model that exploits the sequential structure of the data. Achieved comparable results on Voxceleb to our ConvNet model by reducing the complexity of the original LSTM model described in the prior status report. The benefit of the LSTM model is that it handles variable length sequences more simply and efficiently. Hopefully if this model performs better with refinement we can use it over the ConvNet.
  2. Enforcing our embeddings to be unit vectors. Our metric for speaker similarity is inverse cosine distance so we decided to normalize the embedding layer activations during testing so that the learned representation of the embedding lies on a unit hypersphere and the embeddings can only differentiate based on the relative angle between one another. In practice this improves the discriminative nature of the embeddings leading to a gain of approximately 2% EER on Voxceleb.

Processor (Backend) System

This past week I built the bare-bones backend system, which we now call the processor to disambiguate it from the web application’s backend. The bare-bones processor:

  1. Deques a request (JSON) from the Redis requests queue (currently hosted locally) and reads binary audio data the same Redis server using the key from the request.
  2. Converts the audio to wave format.
  3. Converts the wave file to a mel-spectrogram using our parameters that we decided in our model development process.
  4. Stores dummy results back in Redis storage using the same ID passed in the initial request, which web server reads and returns to the user.

Noticeably the system is currently missing the internal “guts” of the processing, although these have all now been implemented independently. This bare-bones system validates that we can successfully queue request information from the web server, pass that data through our inference system, and write a result that the web server can read.

Nikhil Rangarajan

 

  • What did you personally accomplish this week on the project?

Implemented “One vs Rest” Logistic Regression on a sample of utterances from 34 speakers from our VoxCeleb Dataset. Learnt weight parameters for each speaker.

We use L1 norm for penalties, with 100 iterations.

Then tested on our held-out set to see how an utterance performs for our learned parameters.

A probability value is outputted for each class (in this case, probability it belongs to a certain speaker and probability it doesn’t belong to that same speaker).

After obtaining probability scores for each utterance in our held-out set after testing against our learned Logistic Regression models, we pass these scores into our Equal Error Rate Function to note performance.

We obtained EER between 0.01% to 10% depending on the speaker. Some speakers’ models performed better than others. We will need to explore this and fine-tune better.

  • What deliverables do you hope to complete in the next week?

Tune Parameters in Our CNN as well as Logistic Regression models to see what gives us the most optimal results. Additional we hope to train and obtain the EER and required threshold for speakers outside of our VoxCeleb dataset. We hope to obtain 20-30 recordings from our friends on campus to see how our model performs on internal data.

Richa Ravi

I finished the MVP web application over the past two weeks and can successfully record and push audio to a remote webserver. I have deployed the application to EC2 and have demonstrated that users can record and submit audio given a phrase prompt. Although the application will be used for login in registration, we are already making use of the deployed system to collect data for our own dataset.

 

Weekly Status Report (03/02)

Team Status

The system diagram that we finalized this week.

This week was heavy on solidifying our final design and beginning work on some of the core components of the project. Our current development strategy is to build skeleton modules for each component of the overall system first, integrate these modules, and then iteratively improve each module individually. By skeleton module, we mean a barebones implementation of the component such that it can interface with other components in the same way as the final product, but may not have fully developed internals.

On that note, this week we have finished our design of the backend system. The major design change associated with this finalization was settling on using a Redis-based queueing system (using a Python wrapper aptly called Redis Queue) to manage the background processing tasks and choosing a remote MySQL instance to maintain state of the system. The goal is to have a robust backend application that can be deployed and torn down on AWS by running a script. A baseline version of the system will be tested on AWS by the end of the coming week.

The most significant risk we currently face is proving that we can transfer the ability of our models on external data sets to our own data, which we need to demonstrate that the system can work at demo time. In order to do so, we have previously listed data collection as an important task. We are currently behind on the timeline for data collection. To catch up, we are spinning up a web app this weekend on AWS that presents users with a text prompt, which they will read while they application records them. At the end of recording, we write the audio file user id into a database. We plan to collect this data over the next week and also over break, allowing us to get back on track. We will be able to collect data remotely thanks to having the web application that can be used anywhere.

Team Member Status

Richa

This week I spent a lot of time thinking about how to design the backend cloud system. I did some research about how to use Redis as a data store. 

For the backend of the cloud system, the final design involves having an EC2 instance for the webapp which is connected to another EC2 instance (which the GPU will be on)  through Redis. Since each login and register will need access to all the data in the database (for each login the voice recording will be compared to n binaries for n users), we decided that both the EC2 instances should have direct access to the database.

I also spent some time trying to debug the ajax call to pass the voice recording from the javascript in the html to django views and store it in a django model object. I am currently working on creating an EC2 instance for the webapp and switching the database from SQLite to MySQL.

Ryan

This week I,

  • Developed and trained a bi-pyramidal LSTM model for speaker verification. I hoped that imposing temporal structure to the network and using a base network architecture that is popular for speech recognition could help performance. The model did not perform as well as hoped on Mel features. I plan on experimenting with MFCCs this coming week.
  • Prepared NIST-SRE data for training. This took longer because the data was large and we needed to find a way to preprocess it.
  • Tested contrastive training and discovered that it does not improve performance out of the box and can in fact degrade the model without careful tuning. This experimentation leads me to believe that more work needs to be done on tuning the verification model sooner rather than later in this regard.

Nikhil

This week I worked on implementing a critical of the speech ID pipeline, which is using logistic regression to build one-vs-all binary classification model for each speaker in the embedding space learned for verification.

I spent most of the time designing the algorithm and writing the baseline code. I plan to test the performance on real data from Voxceleb and NIST-SRE this coming week.

Weekly Status Report 02/23

Team Status

This week we made progress in demonstrating that our speaker verification system learns meaningful discriminative features beyond the dataset it was trained on. To do so, we showed that we can record audio in the browser, process it with our speech pipeline, and generate embeddings from the utterance spectrogram using our ConvNet model. We then used PCA and t-SNE dimensionality reduction techniques to visualize the 512 dimensional speaker embeddings in 2D space.

Speech embeddings from our verification model visualized in two dimensions.

The main system design change this week was a shift from a KNN one-shot identification model to learning separate logistic regression parameters for each speaker in our database. The reason for doing so is that KNN is that it is difficult to reason about confidence with KNN and also leverage data beyond our systems database. We now plan on training separate logistic regression parameters for each speaker in our database. The logistic regression is a binary classification between a speaker in our database and all other speakers in our database as well as a held out set (our model is not trained on it) from an external dataset. This method allows us to reason about confidences and develop individual thresholds for each speaker (ie threshold that achieves EER on held out data).

Currently the most significant risk to the project so far is integrating the cloud processing system with the web backend. We plan on implementing a skeleton system that passes “dummy” data around this week so that we can verify the concept. Another risk is that we don’t fully understand how our verification metrics on external datasets extend to our system performance. We need to get a baseline system up running as soon as possible (before spring break) so that we can perform this evaluation.

Team Member Status

Nikhil

  • A week ago, while extracting the Mel Coefficients and Mel Spectrogram from each voice recording, we were using a bunch of default parameters to do our signal processing (such as the sampling rate of our incoming signal and the number of FFTs to compute). Since the human ear is sensitive to frequencies between 20Hz to 20kHz, we decided to evaluate performance if we changed from the default sampling rate (which was 22050Hz) to 16kHz. There was a significant improvement in our EER rate after making this change. There’s a lot of scope over the next few weeks to adjust numerous parameters and see what gives us the best results.
  • Additionally, over the next week, the goal is to implement the one vs all Logistic Regression model on our speaker embedding. The idea is to learn a new set of Logistic Regression weights for each user in our database in order to get a threshold for each speaker vs the rest of the speakers (using our held-out set too)

Ryan

  • Adjusted model architecture and added contrastive training to improve our verification performance on the Voxceleb dataset.
  • Added logic to process and perform embedding inference on our own speech samples.
  • Performed dimensionality reduction on our embeddings to visualize them in 2 dimensions so that we can quickly know if the model is learning meaningful features on our own data.
  • Came up with new one-shot identification method using N binary logistic regressions for N users in the database that will give us a way to reason about the confidence of our classifications as well as the ability to leverage data beyond our own database at test time.

Richa

  • Worked on converting the webapp into a website that can be used for data collection. I added an input box where a user can enter their name and then when the user clicks the stop recording button the html calls a function from Django views to store the recording along with the users name. To store this data I created a Django model which has two fields – a Filefield to store the audio file and a Charfield to store the name of the user. The database I’m currently using is SQLlite which is the default database for a Django webapp.
  • Next week, I want to work on creating the backend for the webapp. This will include setting up an EC2 instance for the webapp which will then queue up tasks on redis and send the individual tasks to another instance which hosts the GPU. Each task will be processed on this instance, which hosts the GPU, and return a probability of the speaker being similar to a speaker on the database (for login) or just a task complete indicator (for register).

Introduction and Project Summary

The goal of our project is One-Shot Speech Identification and Speech Recognition via a Web Application. Our project can be broken into two parts:

  1. Speech Identification to identify if a particular voice utterance belongs to an authorized speaker.
  2. Speech Recognition as a security measure to ensure that a particular message is not pre-recorded and does indeed belong to the identified speaker.

Our project spans Software and Signals & Systems areas.

A user will utter a randomly generated phrase into our Web Application which will in turn identify the speaker from our database. There will also be a verification check to ensure that the utterance was at the time of authentication.