eseeser – Page 2 – Team D6: StenoPhone

Team Status Report for March 13

This week the team members were busily taking care of our individual programming responsibilities. Cambrea was working on networking, Mitchell was working on audio processing and website stuff, and Ellen was working on transcription. We also worked towards completing the design report document that’s due on Wednesday.

The risk we’ve been talking about recently is mismatched interfaces. While we write our separate modules, we have to be aware of what the other members might require from them. We have to discuss the integration of the individual parts and, if we discover that something different is required, we have to be ready to jump in and change the implementation. For example, Ellen made the transcript output a single text file per meeting. However, when Cambrea starts writing the transcript streaming, she might discover that she wants it in a different format; so we just have to recognize that risk and be prepared to modify the code.

Our schedule hasn’t changed besides our making progress through its components.

Ellen’s Status Report for March 13

This was a pretty productive week on my end. Over the weekend, I got Google speech-to-text working (made an account, got credentials, added in the code, etc) to great success! It just seems way more accurate than the other two options I had implemented originally. (This is based on the same little paragraph snippet Cambrea recorded on the respeaker for some initial testing.)

Also over the weekend (if I’m recalling correctly) I coded up our first version of speaker identification (the no-ML, no-moving version). At that point it was gratifying to see simulated transcript results with both speaker tags and voice-to-text!

And my final weekend task was preparing for the design presentation, which I delivered on Monday.

Speaking of design materials, I worked a lot on the design report document. Since I’m the group member who likes writing the most, I zipped through first drafts of a bunch of the sections which the others are going to proofread and modify for the final version. And in the trade-studies and system-description sections, I just wrote the technical bits that I was responsible for. It’s nice having this document pretty close to finished!

Finally, I started the meeting management module. This takes a new transcript update and actually updates the file corresponding to the correct meeting. I’ve finished most of it, except for the bits that interface with the database – I had to confer with the rest of the team about that.

In terms of the schedule, I’m kind of on-track, kind of ahead of schedule. I’m on track for writing the meeting manager (self-assigned due date of Monday), but as for my schedule item after that, “transcript support for multi-mic meeting,” I’ve actually been building that into the transcription the entire time, so it looks like I’ll be able to start my actual next task earlier than scheduled.

Next week I’m scheduled to deliver the meeting management module. The on-website meeting setup flow, which is my next responsibility, will also be partially completed.

Ellen’s Status Report for March 6

This week I did a real mishmash of stuff for the project. I finished the design presentation slides as well as preparing my remarks for the design presentation, which I demoed for the team to get feedback.

I finished coding the transcript generator sans speaker ID — this included a multithreaded section (so I had to learn about python threads), audio preprocessing (downsampling to the correct rate and lowpass filtering), as well as going back through my previous code and making the data structure accesses threadsafe.

Since we received the respeaker mic in the mail, Cambrea recorded some audio clips on it and sent them to me so I could test the two speech to text models I had implemented in the transcript generator. The performance of DeepSpeech was okay – it made a lot of spelling errors, and sometimes it was easy to tell what the word actually ought to have been, sometimes not so easy. (If we decide to go with DS S2T, maybe a spelling-correction postprocessing system could help us achieve better results!) CMU PocketSphinx’s output was pretty much gibberish, unfortunately. While DS’s approach was to emulate the sounds it heard, PS tried to basically map every syllable to an English word, which didn’t work out in our favor. Since PS is basically ruled out, I’m going to try to add Google cloud speech to text to the transcript generator. The setup is going to be a bit tricky because it’ll require setting up credentials.

So far I haven’t fallen behind where my progress is supposed to be, but what’s interesting is that some past tasks (like integrating speech to text models) aren’t actually in the rearview mirror but require ongoing development as we try certain tests. I kind of anticipated this, though, and I think I have enough slack time built into my personal task schedule to handle this looking-backwards as well as working-forwards.

This week my new task is an initial version of speaker ID. This one does not use ML, does not know the number or identity of speakers, and assumes speakers do not move. Later it’ll become the basis of the direction-of-arrival augmentation of the speaker ID ML. I’m also giving the design presentation this week and working more on the design report. And by the end of next week, Google s2t integration doesn’t have to be totally done but I can’t let the task sit still either; I’ll have made some progress on it by the next status report.

Ellen’s Status Report for Feb. 27

This week I worked on speech to text ML and on design materials. I created a speech-to-text module that implements two different s2t engines – we can choose which one to run, and after our mic arrives test both to find which works better. Unfortunately for me, there was a lot of installation work to be done for both engines. The code itself was less time-consuming to write than the installation and research required to enable it. The engines are Mozilla DeepSpeech and CMU PocketSphinx; both of them have “stream” constructs which allow predictions to be formed from multiple pieces of audio which are input sequentially. I paid a lot of attention to the interface I was creating with this code as I was simultaneously working on the overall software design of the project.

In terms of design materials, I started enumerating the interfaces between the more high-level software modules. I also used extra time that I had after finishing the s2t code to draft the Introduction and Design Requirement sections of our design paper. I’ve volunteered to be the presenter for the design review, so I tried to identify the areas of the presentation we needed to flesh out, and I scripted what I wanted to say in the first half of the presentation.

I feel that I’m on schedule, or maybe slightly ahead. S2t didn’t take the full week so I got ahead on the design materials. For next week, I’ll have finished the transcript-module code that envelops the s2t and speaker identification subsections. Since our team will have finished our presentation outline and slides, I also will have started preparing to deliver the presentation and will have planned the second half of the presentation script.

Team Status Report for Feb. 20

This week our team worked on design and planning in order to prepare our project proposal. We researched our requirements and technology solutions, divided work, made presentation slides, and drew up a schedule in the form of a Gantt chart. There are a couple risks that arise from this. First, there’s the risk that, not understanding how much work some aspects of the project might entail, we divided work in an unbalanced way. Here, we just have to be flexible and prepared to change up the division of labor if such issues arise. Second, there’s the risk that our schedule is unrealistic and doesn’t match what will actually happen — but this is counteracted by the nature of the document as something that will be constantly changing over time.

Since we were creating our design this week, we can’t really say that it changed compared to previously; but our ideas were solidified and backed up by the research we did. Some of the requirements we outline in our proposal are different from those in our abstract because of this research. For example, in our abstract we highlighted a mouth-to-ear latency of one second, but after researching voice-over-IP user experience standards, we changed this value to 150ms.

We’ve just finished drawing up our schedule. You can find it below. We’ll point out ways that it changes in subsequent weeks.

Ellen’s Status Report for Feb. 20

This week my efforts were focused on research and on preparing slides for our project proposal. On the research side, I examined a bunch of the requirements we included in our abstract and went digging around the internet for papers and standards documents that could shed light on the specific measurements of a good user experience. This was easier to do for some requirements than others. Machine learning related papers usually focused more on what was possible to achieve with the technology rather than what a user might desire from the technology. But in the end our list of requirements was solidified.

I went on a separate research quest to find viable ML speech to text and speaker diarization solutions and the academic papers associated with the various solutions. Comparing solutions based on metrics reported in papers is an interesting problem; the datasets on which the performance measures are calculated are mostly all different, and there are different performance measures, too (for example, “forgiving” word error rate vs “full” word error rate on some datasets)! My task was basically to search for solutions that did “well” — I might need to evaluate them myself later when we have our hardware.

Currently, I’d say that I’m on-schedule in terms of progress. This comes from the fact that we just came up with our schedule this week! In this next week I’m working on getting an initial version of our speech-to-text up an running. In the end I want to have a module that’ll take in an audio file and output some text, running it through a different ML solution depending on a variable that’s set. Near the end of next week I will also start on the pre-processing for getting audio packets into the correct form to be passed into the speech-to-text module.