Angela’s Status Report, 2022/10/08

At the beginning of the week, I helped prepare the slides and studied the different aspects of the project in preparation for questions. Even though I’m not directly in charge of things like the audio processing, it was a good idea to re-familiarize myself with the concepts, both in anticipation of questions and also to work on the project collaboratively.

I also began the coding process. I have now outlined all the functions of the note scheduling module, and the inputs, outputs, preconditions and postconditions for each. This was possible as Marco and I discussed last week the format for the output of the audio processing module. I have also written pseudo-code for some of the functions.

I expressed interest in helping with some of the audio processing earlier on, especially with writing our custom FFT function. I thought it would be an interesting problem to work on, as it reduces the Fourier’s time complexity from O(n^2) to O(nlogn). As we would have access to AWS resources, we would also be able to further speed up the process with parallelism. I will investigate whether this is necessary for near-real-time speech-to-piano later. It’s possible that the bottleneck would be elsewhere, but if the bottleneck is in the audio processing, it’s good to know we have options for improving the timing here.

Marco’s Status Report 10/1/2022

This week I met with the team to finish discussing the design choices that needed to be made regarding our individual contributions. These are some of the questions we answered regarding my work and the project at large:

Where is everything going to be hosted?

We have three options:

  • A large portion of the processing and scheduling can be hosted on AWS
    • The remote server communicates to a Raspberry Pi that sends power to the solenoids that press the keys
    • Pro: We gain access to more compute power
    • Con: Communicating between the users computer and the raspberry pi will be bottle-necked by the internet upload and download speeds of each device
  • All of the computation is hosted on a raspberry pi
    • Pro: We no longer are bottle-necked by transmission rates between devices
    • Con: We lose the compute power necessary to host everything
  • All the computation is hosted on a Jetson Nano
    • Pro: We no longer are bottle-necked by transmission rates between devices
    • Pro: We gain access to more compute power

Were we completely set on implementing the physical device, we would implement it on Jetson Nano. However, if our proof of concept experiment doesn’t go well and we pivot towards a virtual piano, the need for speed goes away, and AWS becomes the best option. With this in mind, we’ve chosen to go with the AWS model, it provides the compute power necessary to host a majority of the processes, and also gives us room to pivot these processes onto a Jetson if the physical interface does finally get built.

How are the frequencies for each sample going to be gathered?

  • The audio file contains information in the time domain
  • We need to sample the audio file for some time in order to collect the frequencies that make up the sounds played within that time
    • If that window is too short, then we get inaccurate readings into what frequencies make up that sound
    • If the window is too long, then we won’t process the incoming sounds at a speed that is pleasurable to the user
    • A natural compromise arises from the sampling requirements of the audio file and the play rate of the piano keys. An audio file is sampled at 44.1kHz (i.e 0.0268ms) and a piano key can be pressed at most 15 times per second (i.e 66.67 ms), check out Angela’s post for more information on how we arrived at that number! At those rates, there are ~2487 audio samples between the moments we can play a key. This window is large and within our timing constraints — exactly what we were looking for!

Are we ‘punching holes’  into the audio or ‘blurring it’ out around the frequencies of the keys?

  • 5kHz is typically the highest energy considered in speech perception research, and 80Hz is typically the fundamental frequency of the adult voice [1]. That’s a range of 5k – 80 = 4920 frequencies that the human voice could be made up of. With only 69 keys (i.e distinct frequencies) if we simply filtered out the energy at those 69 frequencies from the human speech input we’d at best be able to collect (69/4980) = 2% of the frequencies that make up the human input. This is what I refer to as ‘punching holes’ through the input
  • Instead we’ll use the 69 distinct frequencies to collect an average of the nearby frequencies at those values — this is what I refer to as ‘blurring’ the input. By blurring the input as opposed to punching holes in it, we’ll be able to collect more information about the frequencies that make up the incoming speech.

References

  1. Monson, B. B., Hunter, E. J., Lotto, A. J., & Story, B. H. (1AD, January 1). The perceptual significance of high-frequency energy in the human voice. Frontiers. Retrieved October 1, 2022, from https://www.frontiersin.org/articles/10.3389/fpsyg.2014.00587/full

Angela’s Status Report, 2022/10/01

This week I met with my teammates to talk about the specifics of implementing our project. We also received our first parts and are excited to start working on our proof-of-concept for the physical piano interface soon.

Something I thought a lot about, as I am in charge of scheduling the keys, is how often to play them. We know from piano manufacturer Yamaha that a piano’s inner mechanisms allow each key to be pressed up to 15 times in one second. This is limited by an element known as the hammer: it strikes the strings of the piano to create sound. It takes it approximately 1/15 of a second to leave the string after each strike.

Even though we will be working on a digital piano and not an acoustic one, digital pianos are made to imitate acoustic pianos (when on the “piano” setting). Therefore, we can assume that digital pianos can also be approximately played 15 times per second. I first decided that I would be scheduling the keys at 15 Hz. Upon further consideration, I realized that human speech creates phonemes at a far slower rate than 15 per second. I also realized that playing keys at 15 Hz as long as there exists a frequency would result in a “stutter”: instead of “Hello” in 2 syllables, we would hear many syllables. This would render the speech both inaccurate and unintelligible.

I decided that the keys should be scheduled as such: at each time period, we should compare the status of each frequency to its status at the previous time period. These statuses will be encoded as booleans: 0 if the frequency is not heard, and 1 if it is. If the frequency is going from 0 to 1, we will play the key. If the frequency is going from 1 to 0, we will release the key. Otherwise, keys will retain their former position. This should result in acceptable fidelity to human speech. The time periods have not been decided yet; we will experiment with different frequencies to determine the best one for evoking speech.

Team Status Report for Oct. 1

This week we furthered our research and planning process. We thought a lot about the design choices and implementation details. We also considered a lot of alternative ways to implement different functionalities.

Prof. Yu asked many questions during meetings regarding the “why” behind our design choices, so we decided to think hard and justify them – and change them if we couldn’t.

Firstly, we reconsidered the hosting of our web back-end, audio processing and note scheduling logic. We originally planned for it to be on AWS, a decision that would allow us to access much computing power. However, the performance will be bottle-necked by the internet upload and download speeds of each device. An alternative idea is to host it locally on hardware, either a Raspberry Pi or a NVIDIA Jetson Nano. The former is a more familiar device for all three of us, while the latter would allow for more computing power. Both devices would allow us to no longer be bottle-necked by transmission latency. However, we are not yet certain about the physical piano. If our proof-of-concept is unsuccessful and we decide to instead commit to a virtual piano, AWS becomes the best option. Should we stick with the physical piano, we would already have the logic implemented and ready to be migrated onto a Jetson.

We also considered the information retention rate for the audio processing. Initially, we decided to determine frequencies by sampling at exact frequencies to the keys. Upon further consideration, taking into account the number of keys and the total frequency range, we realized that this would result in losing 98% of the frequency. We will instead take the smooth average amplitude of surrounding frequencies to a piano key.

John’s Status Report for 10/1/22

This week, I have been mainly focused on hashing out the specifics of the project with the rest of the team and gathering ideas on what exact requirements and metrics are needed from my part of the project (web app interface). After having a very productive conversation with Professor Byron on Wednesday about how to best go about testing our metrics and requirements, I went back to the drawing board with the team and we were able to not only gain a better understanding of our own design, but also have more clarity on the next steps of testing and implementation.

We ordered parts early this week and they came in already. Upon planning out how we will set up our physical proof of concept, we will build it out this following week and set up a series of experiments to understand our optimal circuitry and communication requirements. We have already begun to draft up some ideas of experiments to understand or PWM period for the solenoids, how fast we can press a piano key, how to encode our data signal such that notes can be sustained over many samples, how much time to sample the incoming audio to take the Fourier transform of, etc… Hopefully in the coming week, we will have many answers to these questions that will guide us in our design.

Team Status report for 9/24

This week finished preparing our Gantt chart, which after the feedback we received from our proposal presentation, was refactored include less work within the first 2 weeks of development. John went out to the piano rooms in the Hall of Arts to gather measurements on the piano we’re planning to work on. Below are some sample images of the measurements he took.

The team met on Friday to discuss the feedback from our presentation. One of the questions in our feedback asked, “Why are you interested in using all 88 keys if you’re replicating speech?”. This was an interesting question, initially we had assumed  we’d need all 88 keys. However, Angela and John noticed that if you look at the Mark Rober video we presented in class, the leftmost 1/4 of the keys on the piano aren’t being played! In order to play music on the piano, yes we might need all 88 keys, but as it turns out, adult speech has a much smaller frequency range.

The voiced speech of a typical adult male will have a fundamental frequency from 85 to 155 Hz, and that of a typical adult female from 165 to 255 Hz [1]. From the recording of Marco’s voice, we can also see that the frequencies that make up his voice might also lie in a much smaller range than that of the 88 keys covered by a piano. So, the original question was “Do we need all 88 keys”, the answer might be no! That is exciting because it means we may be able to look into smaller keyboards (which are also cheaper!).

This question will be investigated and hopefully answered by the end of this following week.

References:

1. Baken, R. J. (2000). Clinical Measurement of Speech and Voice, 2nd Edition. London: Taylor and Francis Ltd. (pp. 177), ISBN 1-5659-3869-0.

Marco’s Status Report for 9/24

This week I prepared our proposal slides for Wednesday’s presentation with John’s help. I met with John on Tuesday to help prepare before the presentation. After the feedback we received, the team and I met to reshape our project timeline on the Gantt chart. We’ve been contacting suppliers for the solenoids and have started gathering quotes. We’re planning on buying a small batch of parts (5 solenoids, some mosfets, etc.) in order to build the proof of concept physical interface.

Going into next week I will be making some concept drawings for the frame, and building a prototype of the circuit that drives the solenoids. Tomorrow, we will be meeting again to discuss any implementation details we have questions about before we work on our individual parts.

Angela’s Status Report for 9/24

The beginning of this week was focussed on preparing the slides and deciding the content of the proposal presentation. On Sunday we met and discussed this, and spent time working on the PowerPoint as well as the Gantt chart.

Unfortunately, I became and still am ill with strep throat, so I wasn’t able to be as productive as I have originally planned. Furthermore, the TA team had issues with the scope of our project so I met with my teammates to discuss that. We also decided that our Gantt chart was too demanding – it was very rushed, and the weeks were all very demanding weeks. Theoretically, the workload I had assigned myself was possible if I didn’t have other classes and commitments. I decided to scale back on the amount of weekly work and reformatted my Gantt chart to reflect this.

I also thought about the modularity of our system. Since we’re all working in parallel, and expect (hope!) that the pieces come together to form a fully functioning system, I suggested to my teammates that we come up with some sort of diagram that outlines all software, hardware, and physical systems, as well as their inputs/outputs and the format they should be in. This way, we can give each other a standard to work off of when it comes to integrating and interfacing with different parts of the design. I have begun writing a list of modules for the performance scheduler as well as the inputs and outputs to each. Next week I hope to help my teammates come up with such a list for their parts and compile all this information into a diagram as a point of reference.

John’s Status Report for 9/24

This week I helped prepare our proposal presentation and presented it to the class. Reading up on the feedback really helped realign some of our team design goals and put our project into perspective. I have begun formulating ideas for our piano playing interface proof of concept. To do this, I went to the music technologies classroom in the basement of HOA and took many measurements of the piano we will building our system based off of. These gave us some insight in the overall size of our build. I have also begun planning how  the website to host UI controls for our project. This has involved prioritizing base playback (pause/play) features and organizing our gantt chart to reflect our priorities. Luckily we are on schedule as it’s early on and this next week I hope to finalize our proof of concept circuit design, order some physical interface parts, and begin the website.