Angela’s status report, 2022/11/19

This week, my teammates and I decided to fully commit to the virtual piano. For my implementation of the note scheduler, this required small adjustments. Firstly, I removed the limitation of 5 volume levels, as well as the requirement of a minimum volume of 5N, since we don’t have solenoids anymore. This will allow us to reach a wider range of volumes through both extending the floor of the lowest volume as well as allowing for more granularity.

Furthermore, I’ve started to read documentation in preparation to write another part of the project. My teammates and I discussed further testing and presentation methods for our final product, and we’ve decided to use speech recognition both as a testing method as well as a way to present our work. We plan to run speech recognition on both the initial input as well as the final output as a way to measure fidelity. We will also use speech to text modules to create captioning to present our product to the user, in order to allow for easier recognition of what the piano is “saying”. I’ve examined the Uberi module, which seems appropriate for our project. An alternative is wav2letter which is in C++ and offers faster latency. I will discuss latency issues with my teammates at our next meeting to determine where the bottleneck is.

Marco’s Status Report for 11/19/2022

This week, we decided to pivot away from the physical interface — more information on that can be found on our team status report for this week. In light of this, I’ve been working with John and Angela to figure out how my works changes.

Here are my expected contributions “re-scoped” for our virtual piano interface:

  • Take in recorded audio from web app backend
  • Generate three files
    • A series of plots that give information about the incoming audio and frequencies perceived by the system
    • A digital reconstruction of the original wav file using the frequencies extracted by our averaging function
    • The originally promised csv file
  • Metrics
    • Audio fidelity
      • Using the reconstructed audio from the signal processing module, we can interview people on whether they can understand what the reconstructed audio is trying to say. This provides insight into how perceivably functional the audio we’ll generate is (reported as a percentage of the successful reports / total reports).
      • Generate information on what percentage of the original frequencies samples are lost from the averaging function (reported as a percentage of the captured information / original information)

Team Status Report 11/19/2022

This week, we decided to pivot away from our physical interface. There are unfortunate news, as we were making progress in various areas. However, after ordering a first round of materials for testing, we realized it would take too long for our final batch of supplies to arrive in time for the end of the semester.

Luckily, we accounted for this in our initial proposal, and can now pivot towards a fully virtual implementation — one that uses the web application John has been working on to display the results of Angela and Marco’s work so far.

To that end, we’ve listed out some of the re-worked scopes of our project below:

  • Note Scheduling:
    • Pivot from 5 discrete volume levels to more volume levels
    • Take advantage of newfound dynamic range at quieter volumes: no longer limited by 5N minimum threshold
    • Latency from input to output: 2% of audio length
    • Threshold for replaying a key: 15% of max volume between each timestamp
  • Web App:
    • Take in recorded audio from user (either new recording or uploaded file
    • ‘Upload’ recording to audio processing and note scheduler
      • (Stretch goal) → Save csv file on backend (in between audio processing and note scheduler) for re-selection in future.
    • Upon completion of audio processor, web app displays graphs of audio processing pipeline/progress
    • Run ‘Speech to Text’ on audio file and support captions for virtual piano.
      • Probably run in conjunction with audio processing such that we can more immediately display the virtual piano upon finishing the processing.
    • Shows virtual piano on a new page that takes the audio playback, shows keys ‘raining’ down on keys using inspiration from ‘Pianolizer’ (https://github.com/creaktive/pianolizer)
    • In order to optimize latency → Web app would prioritize processing just the audio and playing it back on virtual piano
      • On a separate tab, we will show graphs
    • Metrics:
      • Latency
        • Time between submitting audio recording to processing and return of graphs / audio for virtual piano
        • Defined as a function of input audio length
  • Signal Processing
    • Take in recorded audio from web app backend
    • Generate three files
      • A series of plots that give information about the incoming audio and frequencies perceived by the system
      • A digital reconstruction of the original wav file using the frequencies extracted by our averaging function
      • The originally promised csv file
    • Metrics
      • Audio fidelity
        • Using the reconstructed audio from the signal processing module, we can interview people on whether they can understand what the reconstructed audio is trying to say. This provides insight into how perceivably functional the audio we’ll generate is (reported as a percentage of the successful reports / total reports).
        • Generate information on what percentage of the original frequencies samples are lost from the averaging function (reported as a percentage of the captured information / original information)

John’s Status Report for 11/19

This week, after settling on implementing our virtual piano interface instead of the physical piano,  we rescoped our project and agreed upon some new parameters and features for our project.

In terms of the web app, we decided that it would be the start and end point for the user. What this means is that we offer the interface of recording an audio file or uploading a .wav file, next this uploaded audio gets sent to the audio processing and notes scheduling modules, lastly the results get displayed back on the web app and the audio is recreated using a virtual piano interface.

In terms of new features and parameters introduced, we are planning to add a speech-to-text library to create captions and have an objective way of testing fidelity of the output audio. Creating captions would help with interpreting the words ‘spoken’ by the piano and, by toggling them on and off, we can run experiments to see how easy it is to interpret the audio with and without the captions. Introducing a speech-to-text library would also allow us to take a speech-to-text reading of the initial, spoken audio and also one of the piano output, which we could compare to see if an objective program can understand the piano output.

This week, after coming together as a team to discuss these changes, I have mainly been focused on implementing a virtual piano interface on the web app. So far, I have found a well-documented repository (found here: https://github.com/creaktive/pianolizer) that emulates a very similar system to what we want. I have begun piecing it apart to understand how to create something similar and adding it to our own app. I am planning to extract the piano design and note visualization scheme used by the repository since this would save us a lot of time and it is mostly CSS and Image creation, which is a bit removed from the engineering concepts that we could implement more deeply in other areas of the project.

What is next is testing the full loop of intaking audio, processing it, then seeing it be played back. For this, we need more testing of the other modules and ‘glue’ code to form everything together and fine tune our testing parameters.

Marco’s Status Report 11/12/2022

This week I started designing the PCB for the physical interface and sourcing parts from JLCPCB.com, which is where we’ll be manufacturing the PCB. Here is a list of all the parts I found. For the PCB, I had to design the receptacle that will hold the shift registers we ordered, below is a screen shot of the 3D model I generated for the receptacle.

This week we also presented our interim demo, after which we talked about some issues we’ve run into with the audio processing module. I’ll try to introduce the issue here, but some further investigation might be necessary if the reader is unfamiliar with certain signal processing concepts. I’ll try to add some links to further information where I can!

Our physical interface has a play rate, i.e the rate at which we can play keys with a solenoid, of 14 times per second. This rate dictates the number of samples we can extract from the original audio signal which is being recorded with a sampling rate of 48kHz. We’ve called these samples our window size, which results in a window size of 3428 samples per window. These are the number of samples we can use to perform the Fast Fourier Transform (FFT). One thing to note about the FFT is that our window size dictates how many frequency bins we have access to within a given window. Frequency bins are the number of evenly spaced points along the frequency domain that we can use to divide the range of possible frequencies recorded. For example, in our case with a range of 0Hz to 48kHz and a window size of 3428 samples, there are 3428 frequency bins which gives us a step size of 48kHz / 3428 ≈ 14 Hz. This means that each sample is separated by 14Hz in the resultant array we get from the FFT of our window. This is unfortunate because the step size amongst the frequencies of piano keys has a step size with 3 decimal points (e.g Key 1 that has fundamental frequency of 29.135Hz).

We’re currently investigating solutions to this issue, some of which include:

  • Rounding the piano key frequencies to their nearest integer, giving our piano keys domain a step size of 1. With that we can interpolated the 3428 frequency bin range of [0, 5000] with a step size of 1.
  • Filling our time domain window with 0’s
  • Reducing the sample rate to around 16kHz, since it would help us work with smaller datasets, have faster computation speeds, and isolate the frequencies we care about better

I’ll be implementing these avenues and investigating if they help us with our issue at hand.

John’s Status Report for 11/12

Early this week, I was focused on polishing the web app for the demo. I was able to integrate the audio processing code such that when users upload an audio file, this file is accessed by the python audio processing code in order to perform the processing on it and output the audio with the piano note frequencies. In the process I realized some issues with file formatting of the recorded audio from the web app. The python processing code expects the audio to be in a .wav format with a bit depth of 16 (16 bits used for the amplitude range), however, the web app recorded in a webm format (a file format supported by major browsers). I attempted to configure the webm file as a wav directly using the native Django and JavaScript libraries, but there was still issues getting the file header to be in the .wav format. Luckily, we discussed as a group and came across a JavaScript library called ‘Recorderjs’  (https://github.com/mattdiamond/Recorderjs). This allowed us to record the audio directly to a .wav format (by passing the webm format) with the correct bitdepth and sample rate (48 kHz). With this library I was able to successfully glue the webapp code to the audio processing code and get the webapp intaking the audio and displaying all the graphs of the audio through the processing pipeline.

We were not able to get the final processed audio played back due to difficulty in performing the inverse Fourier transform with the data we had. In an effort to better understand Fourier transforms and our audio processing ideas, I talked to Professor Tom Sullivan after one of classes with him and he explained the advantages of using Hamming windows for the processing and how we could potentially modify our sampling rate to save processing time and better isolate the vocal range for a higher resolution Fourier transform.

With this information, we are in the process of configuring our audio processing to allow for modular changes to many parameters (fft window, sample rate, note scheduling thresholds, etc..). I am also fixing the audio playback currently so we can successfully hear the audio back and have an idea of the performance of our processing.

My plans for the upcoming week is to work with the group to identify how we will set up with testing loop (input audio with different parameters, hear what it sounds like, see how long the processing takes, evaluate, then iterate). I will also be integrating the note scheduling code with our backend such that we can control the stream of data sent to the raspberry pi via sockets.

Angela’s status report, 2022/11/11

At the beginning of this week, I helped my teammates prepare for the demo. We debugged an issue where the audio file was being saved as a .webm file instead of a .wav file. This opportunity allowed me to read a lot of Django documentation which I’m sure will be helpful when it comes to writing glue code between further parts of the system in the upcoming weeks.

After discussion with the professors during the Wednesday demo and discussions with my teammates afterwards, I began to reconsider the way I was implementing volume in the key pressing. Initially, I had made some assumptions:

1. Since 5N was the minimum force for the sound to be heard, any increment less than 5N would not result in a noticeable difference in sound.

2. Discrete levels of volume were “good enough” for the purposes of our project.

Upon further consideration, I realized that this was unnecessarily limiting the dynamic volume range of our system. Since we are using PWM to control the solenoid force, we can have any force level from 0 to 100% (0 to 25N). Since we need at least 5N to sound a key, this gives us 5 to 25N. I also adjusted the volume level calculations to better reflect the relationship between force and volume. Since the audio processing output gives us amplitude in Pascals (Newton over metres squared) and the distance is a constant, the volume parameter is linear with Newtons. Previously, I mistakenly assumed that the volume was in decibels and had implemented a logarithmic relationship between with two.

Team status report, 2022/11/11

At the beginning of this week we worked on our glue code to get ready for the demo. We have now connected the webapp input module and the signal processing module. The system is able to record user speech and start processing it as well as display graphs that represent the processing steps. In the upcoming week, we will complete the glue code for the note scheduler as well as complete its output to the raspberry pi.

On Monday, Byron and we discussed planning for future weeks as well as testing. We have started to plan testing for different parametres, both quantitative and qualitative.

We ran into a problem with the resolution for our audio processing filtering. Our frequency data, due to the sampling rate of the FFT, is a bit low. The granularity of the FFT gives us steps of 14Hz. However, the granularity of the piano keys increases as frequency increases. At the lowest bass notes, the keys are less than 14 Hz apart. As a result, in the lower frequencies, we do not have frequencies mapping to some of the keys.

We have discussed solving this problem with Prof. Sullivan, who suggested a Hamming window. This method would also allow us to lower our sample rate and reduce the size of the dataset we have to work with.

 

 

Marco’s Status Report 11/5/2022

Hello,

On Monday I prepared for our ethics discussion. We had some really interesting points brought up surrounding adversarial use of our project and questions about what we would do in those situations. The remainder of the weekend I was away from campus at the Society of Hispanic Professional Engineers national convention in Charlotte, North Carolina. Now that I’m back in Pittsburgh I’ll be working on finishing our deliverables for the demo on Wednesday of this week.

Angela’s status report, 2022/11/05

This week I continued to work on the note scheduling module. Last week I completed all the main functions, but I was unhappy with the state of the syllable and phoneme recognition. (Note: a phoneme is an atomic unit of speech, such as a vowel sound or a consonant sound in English).

Phoneme recognition is important for our project as it allows us to know when to lift or press piano keys that have already been pressed, and when to keep them pressed to sustain the sound. This allows for fluid speech-like sounds, as opposed to a stutter.

First, I read about how speech recognition handled syllable recognition. I learned that it was done through volume amplitudes. When someone speaks, the volume of speech dips in between each syllable. I discussed using this method with my team, but we realized that it would fail to account for the phonemes. For example, the words “flies” and “bear” are both monosyllabic, but require multiple phonemes.

I’ve now implemented two different methods for phoneme differentiation.

Method 1. Each frequency at each time interval has its volume compared to its volume in the previous time interval. If it’s louder by a certain threshold, it is pressed again. If it’s the same volume or slightly quieter, it’s held. If it’s much quieter or becomes silent, the key is either lifted and re-pressed with a lower volume or just lifted.

Method 2. At each time interval, the frequencies and their amplitudes are abstracted into a vector. We calculate the multidimensional difference between the vectors at different time intervals. If the difference is larger than a threshold, it will be judged to be a new phoneme and keys will be pressed again.

In the upcoming weeks we will implement ways to create the sounds from the key scheduling module and test both these methods, as well as other methods we think of, on volunteers to determine the best method for phoneme differentiation.