status report – Page 2 – Team D6: StenoPhone

Team Report for April 24

We performed our demo last Wednesday and received overall good feedback from Professor Sullivan and Abha. We had some technical difficulties during the beginning of the demo, so we started to add some robustness to the routines. The issue occurred when there were two microphones connected to a meeting and one of the microphones was disconnected. We also worked on testing the AWS deployment as well as formalizing the testing scripts. We also added in the ability to download transcripts of the meeting as pdfs. We have also started working on our final presentation slides.

Since we are performing our tests now, if there are any major issues that require an implementation overhaul, we would have to scramble to accomplish it.

Below is an image of our contraption. It has a Raspberry Pi, microphone, ReSpeaker, and custom ReSpeaker case.

Below is an image of our meeting transcript of our meeting in progress.

Ellen’s Status Report for April 24

We’ve made a lot of progress the last two weeks. Our demo last Wednesday went well, and afterwards because of the feedback we received we decided to revamp the speaker id setup process to make it more user-understandable. I worked on doing that all weekend, and then we tested it a bit on Monday while testing the AWS deployment.

Then Tuesday and Thursday I worked on outlining and fleshing out our final presentation slides. I also came up with a potential method for transcript latency testing.

Since we’ve already moved on to validating requirements (with a high confidence that they’ll pass) I’d say we’re on schedule according to our Gantt chart. This week we’ll finish gathering and evaluating test data. I’ll probably be calculating a lot of transcript related error rates. We’ll also prepare our final presentation and work on our video.

Cambrea’s Status Report for April 10

Last week week I completed the streaming code and AWS server code that is responsible for sending and receiving audio over the network,. The ReSpeaker offers a capability of detecting whether a user is speaking, using the is_voice() parameter. I was testing this capability over the last weekend and found that the output audio using this information is too choppy to be intelligible to the user. We are currently testing if after we tag the packets as voice and feed those to the transcript, if these packets have enough data to create the transcript.

This week we started integration of each of the systems so we were working on campus in HH D level. On Monday and Tuesday I added the raspberry pi’s to the cmu device wifi. We were having issues with connecting the devices to the wifi so we reflashed the OS to the SD cards on the raspberry pi and reconfigured the wifi and it now works on the cmu-device wifi.

On Wednesday we finished the integration between the audio streaming on the audio devices and the transcript generation on the server. For this integration we are currently using Ellens computer to act as the Server so that we can complete the integration for the demo before migrating to using the AWS Server. We are currently developing the speaker Identification more to make sure that I works to recognize different speakers.

This week we will start the tests for transcript accuracy, prepare for the demo, and also start the migration to AWS.

Mitchell’s Status Report for April 10

This week I worked on debugging the transcript streaming and working in my group. The consumer was changed to be a fully asynchronous model and the webpage now properly update when the transcript is fed in. There were also web socket instabilities that were debugged. For the group work, we met Mon., Wed., Fri., Sat, and will meet Sunday for multiple hours each time. I mostly helped to test the system and tweak changes like transcript autoscrolling.

From now on my schedule is the same as the group’s schedule. We think we have a demo prepared at this point, but we will perform further stress testing on the demo and mult-mic interaction.

Team Report for April 10th

Last week our team had finished our individual components, Cambrea finished audio streaming between raspberry pi and server, Ellen finished speech to text, and speaker identification ML, and Mitchell finished the database and website setup and transcript streaming. This week our team has focused heavily on complete system integration and testing. We have completed the connections between all components and have real transcript streaming to the website of the input audio from users. We are currently working on improving our speaker identification during the setup phase of the meeting.

Our main risk for right now is that the direction of arrival data from the speaker can have a small fluctuation during the time that a speaker changes or during the time that a speaker is speaking. This fluctuation is mainly affecting our setup phase for when we register speakers to be identified by the speaker identification ML. During the setup phase we rely on the DOA to determine if a new speaker should be registered with the system. We are currently fixing and testing this by only saying that a new speaker should be registered if a new DOA is detected and also if there is a significant amount of audio coming from that direction. This will work to ignore small fluctuations of DOA.

Updated Gantt Chart Link

https://docs.google.com/spreadsheets/d/1eeHeut41JF_Ju9Ys14n_sOLiZGYdpl4HtEale2ySasY/edit?usp=sharing

For our updated gantt chart, we have completed the task “Integration of RPi + Transcript + ML” half a week early so we are currently working on the task “Complete System Debugging / Testing”. Our current “complete system” for the demo uses a computer as the intermediary server that handles the speaker ML and transcript streaming. Ellen is working on some ML improvements for integration and testing at this time. After the demo Cambrea and Mitchell will start the migration to using the AWS Server instead of our local computer for the task “AWS Migration”. For our latency testing and standard testing we will complete these during week 12 and 13 after we have migrated to AWS. We will start revising our final report the during week 11, and work on our final presentation and video during weeks 12-14.

Ellen’s Status Report for April 10

This week has been a busy one for our team! Over the weekend I wrote audio-accumulation for the UDP side of the webserver and accumulated raw audio instead of files. Then from Monday forward our focus was on integration. After every time we met I had a bunch of todo items to debug or just improve the server processing and transcript generator. We met Mon., Wed., Fri., and Sat. for multiple hours each time. My work has included small (but important) tweaks like an improved filenaming system, improved audio accumulation, improved error reporting, changing what does and does not count as a “speaker change” in the transcript and in the backend, and others I can’t recall at this point. Larger changes I made would be changing the way speech to text results are accepted from Google (only accepting “final” and not interim results) and adding more branches and states to the state machine of microphone setup.

From now on my schedule is the same as the group’s schedule as a whole. We think we have a demo prepared at this point but when we meet tomorrow we’ll do more testing on the demo we’re preparing as well as testing how two mics in the same meeting work together. By the weekend we’ll be done integrating and debugging our two-mic setup and we’ll have some idea of how well the transcript is matching our requirements/targets.

Mitchell’s Status Report for April 3

This week I worked on integrating the transcript streaming and fabricating a case for the respeaker. For the case, it was designed using VMWare Horizon to access the windows lab cluster and Solidworks to CAD the model. The transcript streaming was interfaced as a producer in the transcript updating phase in the meeting manager using asynchronous to synchronous communication to the channels layer. The consumer model will then append or rewrite the changed sections and push those updates to the web hook which will update the webpage.

At this point we’re sort of reaching the point of schedule ambiguity. I do not have any more individually-assigned tasks. I will be testing the transcript streaming further and try to get a full system that works locally on our computers.

Ellen’s Status Report for April 3

This week I worked intensively on the speaker identification subsystem. If you’ve already read the team status report you’ll know that we decided to ditch the speaker diarization packages we had previously identified as possible solutions – crucially, none of the packages provided audio streaming abstractions – and proceed with google cloud speaker diarization, which could be added to the current google speech to text request by simply tweaking a configuration variable. This created the opportunity to integrate the speech to text and speaker ID modules, so I had a lot of code to write and rewrite to both get the speaker ID module integrated and to add in the processing the way we desired it (including the DOA augmentation). Initial tests (using random vocal samples I found online) suggest to me that the system is going to work well and that DOA augmentation is actually going to be quite valuable – but I conducted this testing just yesterday, so the jury is still out, I suppose.

At this point we’re sort of reaching the point of schedule ambiguity. I would consider speaker identification to be finished as of yesterday, which is a good half-a-week early. I don’t have any more individually-assigned coding tasks. Now I’ll be supporting integration through debugging and revision. As for what I know I want to accomplish this upcoming week: I want to rewrite the network-to-transcription queueing interface to accumulate multiple packets (if available) before sending the audio off to transcription. I also want to help Mitchell get transcript streaming fully working.

Team Status Report for April 3

This week we started to wrap up our individual modular code-writing and looked towards integration and demoing. As we tested and composed some of the most complex sections of the solution, we had to do some problem solving that led to design changes and improvements.

There are two notable design changes. The first is with regards to audio i/o on the raspberry pi. Our initial design involved passing audio through multiple queue data structures as processing was performed on it and it was prepared for networking. However, we found that this caused latency issues that resulted in sporadic and choppy audio. With the queueing being too slow, the audio stream now doesn’t go through any intermediary data structures before being networked, and audio processing is isolated on the webserver, as transcript preprocessing.

The other change pertains to the speech to text and speaker identification ML modules inside the transcript generator. We decided to utilize google cloud speaker identification because this would allow us to do all of the ML processing in a single integrated transaction. And this further allowed us to integrate the two modules into a single module. This also decreases the amount of multithreading.

Now we’re looking towards full integration of all components with each other. We’re doing local testing before deploying to AWS, which creates the risk that what works on our own computers doesn’t work once we deploy properly.

Cambrea’s Status Report for March 27

For this week, I worked on fixing some audio issues with the audio device code. The issue was that when I went to play the output audio on the speaker(this should be the output audio from users), this was not the correct audio and instead was just computer noise. I was trying to play raw audio using pyaudio like

stream = p.open(

rate=RESPEAKER_RATE,

format=p.get_format_from_width(RESPEAKER_WIDTH),

channels=RESPEAKER_CHANNELS,

output=True,

frames_per_buffer=CHUNK,

input_device_index=RESPEAKER_INDEX,

)

stream.write(data)

This code to stream.write(data) was not working with raw audio but was working when instead I wrote this audio to a wav file and stream.write() the data from the file after.

I am thinking that the raw audio data was not the correct format or type to be played using the stream, but when I print the type of the raw data I can see that it is “bytes” which is very ambiguous, the stream documentation says that it is also meant to play “byte” like objects.

Going forward I just instead chose to write this raw data to wav and then play the wav file through the stream since that gives clear audio.

This week I also received the equipment for the second audio device and assembled it. I have downloaded the necessary code onto the device and am testing currently using my computer as the server(by this I mean I am running basic server code on my computer with both devices connected to it). I am currently testing that the devices send and receive the correct data to eachother.

I am currently on track with my progress, next week I will work on using the AWS server for the audio devices and integrating this code with Ellen’s speech to text code.