Team B2: Talking Piano – Page 2 – Carnegie Mellon ECE Capstone, Fall 2022: Angela Chang, Marco Acea, John Martins

November 20, 2022November 20, 2022

Team Status Report 11/19/2022

This week, we decided to pivot away from our physical interface. There are unfortunate news, as we were making progress in various areas. However, after ordering a first round of materials for testing, we realized it would take too long for our final batch of supplies to arrive in time for the end of the semester.

Luckily, we accounted for this in our initial proposal, and can now pivot towards a fully virtual implementation — one that uses the web application John has been working on to display the results of Angela and Marco’s work so far.

To that end, we’ve listed out some of the re-worked scopes of our project below:

Note Scheduling:
- Pivot from 5 discrete volume levels to more volume levels
- Take advantage of newfound dynamic range at quieter volumes: no longer limited by 5N minimum threshold
- Latency from input to output: 2% of audio length
- Threshold for replaying a key: 15% of max volume between each timestamp

Web App:
- Take in recorded audio from user (either new recording or uploaded file
- ‘Upload’ recording to audio processing and note scheduler

- - (Stretch goal) → Save csv file on backend (in between audio processing and note scheduler) for re-selection in future.
- Upon completion of audio processor, web app displays graphs of audio processing pipeline/progress
- Run ‘Speech to Text’ on audio file and support captions for virtual piano.
  - Probably run in conjunction with audio processing such that we can more immediately display the virtual piano upon finishing the processing.
- Shows virtual piano on a new page that takes the audio playback, shows keys ‘raining’ down on keys using inspiration from ‘Pianolizer’ (https://github.com/creaktive/pianolizer)
- In order to optimize latency → Web app would prioritize processing just the audio and playing it back on virtual piano
  - On a separate tab, we will show graphs
- Metrics:
  - Latency
    - Time between submitting audio recording to processing and return of graphs / audio for virtual piano
    - Defined as a function of input audio length

Signal Processing
- Take in recorded audio from web app backend
- Generate three files
  - A series of plots that give information about the incoming audio and frequencies perceived by the system
  - A digital reconstruction of the original wav file using the frequencies extracted by our averaging function
  - The originally promised csv file
- Metrics
  - Audio fidelity
    - Using the reconstructed audio from the signal processing module, we can interview people on whether they can understand what the reconstructed audio is trying to say. This provides insight into how perceivably functional the audio we’ll generate is (reported as a percentage of the successful reports / total reports).
    - Generate information on what percentage of the original frequencies samples are lost from the averaging function (reported as a percentage of the captured information / original information)

November 20, 2022

John’s Status Report for 11/19

This week, after settling on implementing our virtual piano interface instead of the physical piano, we rescoped our project and agreed upon some new parameters and features for our project.

In terms of the web app, we decided that it would be the start and end point for the user. What this means is that we offer the interface of recording an audio file or uploading a .wav file, next this uploaded audio gets sent to the audio processing and notes scheduling modules, lastly the results get displayed back on the web app and the audio is recreated using a virtual piano interface.

In terms of new features and parameters introduced, we are planning to add a speech-to-text library to create captions and have an objective way of testing fidelity of the output audio. Creating captions would help with interpreting the words ‘spoken’ by the piano and, by toggling them on and off, we can run experiments to see how easy it is to interpret the audio with and without the captions. Introducing a speech-to-text library would also allow us to take a speech-to-text reading of the initial, spoken audio and also one of the piano output, which we could compare to see if an objective program can understand the piano output.

This week, after coming together as a team to discuss these changes, I have mainly been focused on implementing a virtual piano interface on the web app. So far, I have found a well-documented repository (found here: https://github.com/creaktive/pianolizer) that emulates a very similar system to what we want. I have begun piecing it apart to understand how to create something similar and adding it to our own app. I am planning to extract the piano design and note visualization scheme used by the repository since this would save us a lot of time and it is mostly CSS and Image creation, which is a bit removed from the engineering concepts that we could implement more deeply in other areas of the project.

What is next is testing the full loop of intaking audio, processing it, then seeing it be played back. For this, we need more testing of the other modules and ‘glue’ code to form everything together and fine tune our testing parameters.

November 13, 2022November 13, 2022

Marco’s Status Report 11/12/2022

This week I started designing the PCB for the physical interface and sourcing parts from JLCPCB.com, which is where we’ll be manufacturing the PCB. Here is a list of all the parts I found. For the PCB, I had to design the receptacle that will hold the shift registers we ordered, below is a screen shot of the 3D model I generated for the receptacle.

This week we also presented our interim demo, after which we talked about some issues we’ve run into with the audio processing module. I’ll try to introduce the issue here, but some further investigation might be necessary if the reader is unfamiliar with certain signal processing concepts. I’ll try to add some links to further information where I can!

Our physical interface has a play rate, i.e the rate at which we can play keys with a solenoid, of 14 times per second. This rate dictates the number of samples we can extract from the original audio signal which is being recorded with a sampling rate of 48kHz. We’ve called these samples our window size, which results in a window size of 3428 samples per window. These are the number of samples we can use to perform the Fast Fourier Transform (FFT). One thing to note about the FFT is that our window size dictates how many frequency bins we have access to within a given window. Frequency bins are the number of evenly spaced points along the frequency domain that we can use to divide the range of possible frequencies recorded. For example, in our case with a range of 0Hz to 48kHz and a window size of 3428 samples, there are 3428 frequency bins which gives us a step size of 48kHz / 3428 ≈ 14 Hz. This means that each sample is separated by 14Hz in the resultant array we get from the FFT of our window. This is unfortunate because the step size amongst the frequencies of piano keys has a step size with 3 decimal points (e.g Key 1 that has fundamental frequency of 29.135Hz).

We’re currently investigating solutions to this issue, some of which include:

Rounding the piano key frequencies to their nearest integer, giving our piano keys domain a step size of 1. With that we can interpolated the 3428 frequency bin range of [0, 5000] with a step size of 1.
Filling our time domain window with 0’s
Reducing the sample rate to around 16kHz, since it would help us work with smaller datasets, have faster computation speeds, and isolate the frequencies we care about better

I’ll be implementing these avenues and investigating if they help us with our issue at hand.

November 13, 2022

John’s Status Report for 11/12

Early this week, I was focused on polishing the web app for the demo. I was able to integrate the audio processing code such that when users upload an audio file, this file is accessed by the python audio processing code in order to perform the processing on it and output the audio with the piano note frequencies. In the process I realized some issues with file formatting of the recorded audio from the web app. The python processing code expects the audio to be in a .wav format with a bit depth of 16 (16 bits used for the amplitude range), however, the web app recorded in a webm format (a file format supported by major browsers). I attempted to configure the webm file as a wav directly using the native Django and JavaScript libraries, but there was still issues getting the file header to be in the .wav format. Luckily, we discussed as a group and came across a JavaScript library called ‘Recorderjs’ (https://github.com/mattdiamond/Recorderjs). This allowed us to record the audio directly to a .wav format (by passing the webm format) with the correct bitdepth and sample rate (48 kHz). With this library I was able to successfully glue the webapp code to the audio processing code and get the webapp intaking the audio and displaying all the graphs of the audio through the processing pipeline.

We were not able to get the final processed audio played back due to difficulty in performing the inverse Fourier transform with the data we had. In an effort to better understand Fourier transforms and our audio processing ideas, I talked to Professor Tom Sullivan after one of classes with him and he explained the advantages of using Hamming windows for the processing and how we could potentially modify our sampling rate to save processing time and better isolate the vocal range for a higher resolution Fourier transform.

With this information, we are in the process of configuring our audio processing to allow for modular changes to many parameters (fft window, sample rate, note scheduling thresholds, etc..). I am also fixing the audio playback currently so we can successfully hear the audio back and have an idea of the performance of our processing.

My plans for the upcoming week is to work with the group to identify how we will set up with testing loop (input audio with different parameters, hear what it sounds like, see how long the processing takes, evaluate, then iterate). I will also be integrating the note scheduling code with our backend such that we can control the stream of data sent to the raspberry pi via sockets.

November 12, 2022

Angela’s status report, 2022/11/11

At the beginning of this week, I helped my teammates prepare for the demo. We debugged an issue where the audio file was being saved as a .webm file instead of a .wav file. This opportunity allowed me to read a lot of Django documentation which I’m sure will be helpful when it comes to writing glue code between further parts of the system in the upcoming weeks.

After discussion with the professors during the Wednesday demo and discussions with my teammates afterwards, I began to reconsider the way I was implementing volume in the key pressing. Initially, I had made some assumptions:

1. Since 5N was the minimum force for the sound to be heard, any increment less than 5N would not result in a noticeable difference in sound.

2. Discrete levels of volume were “good enough” for the purposes of our project.

Upon further consideration, I realized that this was unnecessarily limiting the dynamic volume range of our system. Since we are using PWM to control the solenoid force, we can have any force level from 0 to 100% (0 to 25N). Since we need at least 5N to sound a key, this gives us 5 to 25N. I also adjusted the volume level calculations to better reflect the relationship between force and volume. Since the audio processing output gives us amplitude in Pascals (Newton over metres squared) and the distance is a constant, the volume parameter is linear with Newtons. Previously, I mistakenly assumed that the volume was in decibels and had implemented a logarithmic relationship between with two.

November 11, 2022

Team status report, 2022/11/11

At the beginning of this week we worked on our glue code to get ready for the demo. We have now connected the webapp input module and the signal processing module. The system is able to record user speech and start processing it as well as display graphs that represent the processing steps. In the upcoming week, we will complete the glue code for the note scheduler as well as complete its output to the raspberry pi.

On Monday, Byron and we discussed planning for future weeks as well as testing. We have started to plan testing for different parametres, both quantitative and qualitative.

We ran into a problem with the resolution for our audio processing filtering. Our frequency data, due to the sampling rate of the FFT, is a bit low. The granularity of the FFT gives us steps of 14Hz. However, the granularity of the piano keys increases as frequency increases. At the lowest bass notes, the keys are less than 14 Hz apart. As a result, in the lower frequencies, we do not have frequencies mapping to some of the keys.

We have discussed solving this problem with Prof. Sullivan, who suggested a Hamming window. This method would also allow us to lower our sample rate and reduce the size of the dataset we have to work with.

November 6, 2022

Marco’s Status Report 11/5/2022

Hello,

On Monday I prepared for our ethics discussion. We had some really interesting points brought up surrounding adversarial use of our project and questions about what we would do in those situations. The remainder of the weekend I was away from campus at the Society of Hispanic Professional Engineers national convention in Charlotte, North Carolina. Now that I’m back in Pittsburgh I’ll be working on finishing our deliverables for the demo on Wednesday of this week.

November 6, 2022

Angela’s status report, 2022/11/05

This week I continued to work on the note scheduling module. Last week I completed all the main functions, but I was unhappy with the state of the syllable and phoneme recognition. (Note: a phoneme is an atomic unit of speech, such as a vowel sound or a consonant sound in English).

Phoneme recognition is important for our project as it allows us to know when to lift or press piano keys that have already been pressed, and when to keep them pressed to sustain the sound. This allows for fluid speech-like sounds, as opposed to a stutter.

First, I read about how speech recognition handled syllable recognition. I learned that it was done through volume amplitudes. When someone speaks, the volume of speech dips in between each syllable. I discussed using this method with my team, but we realized that it would fail to account for the phonemes. For example, the words “flies” and “bear” are both monosyllabic, but require multiple phonemes.

I’ve now implemented two different methods for phoneme differentiation.

Method 1. Each frequency at each time interval has its volume compared to its volume in the previous time interval. If it’s louder by a certain threshold, it is pressed again. If it’s the same volume or slightly quieter, it’s held. If it’s much quieter or becomes silent, the key is either lifted and re-pressed with a lower volume or just lifted.

Method 2. At each time interval, the frequencies and their amplitudes are abstracted into a vector. We calculate the multidimensional difference between the vectors at different time intervals. If the difference is larger than a threshold, it will be judged to be a new phoneme and keys will be pressed again.

In the upcoming weeks we will implement ways to create the sounds from the key scheduling module and test both these methods, as well as other methods we think of, on volunteers to determine the best method for phoneme differentiation.

November 6, 2022

Team status report for 11/5

This week, our team focus has been on establishing the requirements for our upcoming demo. We hope that by setting up our project for this demo, we can also have a platform built to test out parameters of our project and evaluate the results.

For our demo, we are hoping to have a web app that allows users to record audio to send to the audio processing modules. The audio processing will perform the Fourier Transform on the audio to get which frequencies comprise our audio. We will then take the frequencies that correspond to the keys of the piano and filter the frequencies of the recorded audio such that the only remaining frequencies are those that the piano can produce. This filtered set of frequencies represents what notes of the piano can be pressed to produce the snippet of audio. These many snippets of frequencies (at this point filtered to only contain those the piano can produce) are then sent to the note scheduler that takes in which notes to play and outputs a schedule representing when to play each key on the piano. This note scheduling module is important as it will help to identify whether a particular note’s frequency is present in consecutive samples, thus handle the cases of whether we need to keep a note pressed between many sample, re-press a note to achieve a new, higher volume, or release a note so that the frequency dies out. Lastly, these many audio samples, only containing piano frequencies, will be stitched together and played back through the web app. The audio processing module also creates graphical representations of the original recording (as time vs amplitude) and the processed recording (as time vs filtered amplitude- representing the piano keys) and displays them to the user.

Our goal with this demo is to not only show off a snippet of our project pipeline, but also provide a framework to test parameters and evaluate the results. We have many parameters in our project such as window of time for the FFT, window of frequencies to perform averaging for each note, the level of amplitude difference between consecutive frequencies that determine whether we re-press a note or keep it pressed down, just to name a few. We hope that with this demo, we can fine tune these parameters and listen to the output audio to determine whether the parameter is well-tuned. The output playback from our demo is the sound we would hear if the piano were able to perfectly replicate the sound with its available frequencies, so if we use our demo to test an optimal set of parameters, this will help the piano recreate the sound as best as possible.

We have also placed an order through AliExpress for a batch of solenoids. It is set to come in by the 11th of November. We will evaluate the effectiveness of the solenoids and quality from the AliExpress order, and if the solenoids are good, we will place an order for the rest of the solenoids we need.

In the next weeks, we are planning to define tests we can perform with our demo program to tune our parameters. We will also be planning how to test the incoming batch of solenoids and what will determine a successful test.

November 6, 2022

John’s Status Report for 11/5

This week, our group was focused on defining the requirements for our interim demo. For my part, I am working on the web app that users will interface with to record audio, upload it to the audio processing part, then see the results of the audio processing and piano key mapping displayed in graph form.

So far, I have added more to the web app, allowing users to visit a recording page and speak into their computer microphone to record their speech. As of now, I am using the videojs-record JavaScript package to record the user audio and display the live waveform of recording. This package includes many features and plugins that allow for efficient audio recording and easy implementation within our Django web app framework.

Currently, the user records the audio and this file gets stored in a local media folder as well as in a model on the sqlite3 backend. This allows different pages of the webapp to access these models so that the user can play back the audio or select a previously recorded clip to send to the piano.

What’s next is connecting up the audio processing and note scheduling modules written by Marco and Angela to the web app. Once we have uploaded their portions to the web app, we can work to pass in the recorded audio to these modules then displaying the outputs back onto the web app.