Weekly Update #11 (4/28 – 5/4)

Team

This week was spent on the final polishing and integration to prepare for the final demo. We ran into a pretty big issue with our UI/script integration, so we decided to move to the Flask framework to fix our issues. We also added quality of life features like displaying saved videos and images for the user to see before corrections, and options like redoing a move if they were unsatisfied with how they performed. We also tested timing differences between running on different AWS instances, as well as different flags for the various functions to give the fastest corrections without sacrificing speed.

Kristina

After realizing that our Node.js framework was getting over-complicated to switch over to and that the way I was calling a test python script wasn’t going to work with this implementation since the local file system couldn’t be accessed, we decided to shift our framework again. Luckily Brian did more research and had time, so he took over moving the framework for the second time so that I could focus on making the UI design more presentable and polished. Before, I was focusing on making all the elements there (viewing a demonstration of the pose or move, check. web cam access, check. viewing web cam feed mirrored, check. etc) but now I had to focus on not making it an eye sore. I worked on making sure all pages could be navigated to simply and intuitively and focused on styling and making elements look nice. I also helped with testing the final product and specifically with editing the styling to make sure that everything displayed nicely on a different laptop size and still worked with the user flow, where the user has to step away from the laptop screen in order to perform their movement. It’s wild that the semester is over already and that demos are so soon!

Brian

We realized this week that we had created the UI for the different poses and were able to run the scripts separately and display them in the UI, but were not able to run them together. This is due to the fact that our UI could not access files in our local file system. Since we needed to download the user images and videos, and send them over to be processed on AWS this was an issue. After doing some quick searching, I decided that a Flask framework would solve our issues easily. Therefore I ported over our existing UI, and defined all the functions necessary to get our website to access and interact with local files.

I ensured that each page had separate accesses, and that all user files were disposed of after being used in order to prepare for the next batch. In order to make the website work with the way Flask calls functions, I had to make slight changes to the structure of the website, but was able to integrate it in a way that wasn’t noticeable to the end user.

Finally I did a lot of the testing of the final integrated product, and caught a few  small errors that would have messed up the execution during the demo.

Umang

This week was the final countdown to demo day. Our goal was to get an end to end pipeline up and running, while fully integrating it with the UI. While others worked on the front end, I wanted to optimize the backend for an even smoother experience. Rather than taking 29 seconds for a video and 15 seconds for an image, I wanted to break sub-20 for a video and sub-10 for an image. The best place to shave off time was in the pose estimation by increasing the speed of AlphaPose and decreasing the frame rate of the original video.

It turned out that the UI saved the video as a *.webm file and AlphaPose did not take in this type. As such, I had to automatically a conversion function (ffmpeg was the one I picked) to convert it from *.webm to *.mp4.  Unfortunately, this conversion actually expanded instead of compressed the video which led to even slower pose estimation by AlphaPose. By setting a reduced frame rate flag, I was able to subsample in the video and then run the shorter video through the pose estimation network (with the relaxed confidence, lower number of people, and a increased detection batch). With these changes, I got the video estimation down to 1 second for a 4 second long *.webm file (with added time for the ffmpeg call with the subsampling).

This updated video pipeline ran in ~11 seconds total (including up-down to the instance and the longer ffmpeg) and ran in ~8 seconds for an image. Unfortunately, the AWS instance we used for this was a P3 instance (which had an unsustainable cost of $12/hr). So we settled for the normal P2 instance (which had a cheap cost of $0.90/hr). This pipeline on the P2 ran a video through in ~15 seconds and an image through in ~9 seconds. Both of these times far surpassed our original metrics. We look forward to the demo 🙂

Weekly Update #10 (4/21 – 4/27)

Team

Integration obviously means bugs, problems, and uncovering issues, as we have been warned so many times at t he beginning of the semester. This week, we continued the integration process by tackling the problems that arose. We realized that our UI needed a framework change in order to integrate with the way our back end was implemented and its in inputs and dependencies, so we focused on fixing the front and back end so that it could be properly integrated. We also continued looking into other ways to get the speed up we required, so we continued investigation into AWS and looking into possibly using a different pose estimator in order to get accurate but more efficient results. Our final presentations are next week, so we also spent time working on and polishing our slides.

Kristina

After realizing my big mistake of not looking ahead to HOW the back end would connect to the UI when designing and starting work, I had to move our front end code to a different framework. Initially I was using basic HTML/JavaScript/CSS to create web pages, but I realized that calling our python correction script and getting results back wouldn’t really work, so I decided to use Node.js in order to create a server-side application that could call the correction algorithm when an event that the user initiates happens. I honestly just chose the first framework that seemed the simplest to migrate to, and this ended up not working out as well as I had hoped. I ran into a lot of problems getting my old code to work server-side, and still need to fix a lot of issues next week. Since I’m also the one giving our presentation next week, I also spent some time preparing for it and practicing since I’m not great at presentations.

Brian

This week was a constant tug and pull between the UI side of the project and the backend. Some of the outputs that we thought were going to be necessary ended up needing to be changed to accommodate some of the recent changes in our app structure. A lot of the week was trying to get things formatted in the right way to pass between our separate parts of the project. In particular I was having issues with sending information to and from the AWS instance, but in the end was able to solve it with a lot of googling. I also worked on refactoring the code to be more editable and understandable, as well as on the final presentation.

Umang

This week I helped with the final presentation. Then, I decided to run a comparison between AlphaPose and OpenPose for running pose estimation on the AWS instance. We have a ~8 second up and down time that is irreducible but any other time is added due to the slow pose estimation. As such, I wanted to explore if OpenPose is faster for our use case on a GPU.  OpenPose runs smoothly but has much overhead if we want to retrofit it to optimize for our task. Running vanilla OpenPose led to a respectable estimation time for a 400 frame video (on one GPU), but was still over our desired metric. Though the times were comparable at first, when we added flags to our AlphaPose command to reduce the detection batch size, set the number of person to look for, and reduce the overall joint estimate confidence, we were able to get blazing fast estimation from AlphaPose (~20 seconds + ~8 seconds for the up down). This means we hit our metric of doing to end to video pose correction in under 30 seconds 🙂 Final touches to come next week!

Weekly Update #9 (4/14 – 4/20)

Team

After succeeding in finishing most of the work for the video corrections aspect of project, we decided that it was time to start integrating what we had done so far in order to get a better picture of how our project was going to look. Additionally, we realized that the amount of time that our corrections were taking was way too much for a user to justify. Therefore we wanted to find ways to speed up the pose processing. In order to do this, this we focused on:

  1. Looking into AWS as a way to boost our processing speeds
  2. Merging the existing pipelines with the UI and making it look decent

It’s also our in-lab demo next week, so we had to spend some time polishing up what our demo would look like. Since we only started integration this week, we still have problems to work through, so our in-lab demo will most likely not be fully connected or fully functional.

Kristina

This week I spent more time polishing up the UI and editing the implementation to show the recent changes made to our system. This involved being able to save a video or picture to files that could then be accessed by a script, being able to display the taken video or picture, being able to redo movement, and ensuring that the visualization on the screen acted as a mirror facing the user. The latter part was an important aspect that actually seems so small unless it doesn’t work and that we didn’t even realize until now. It’s so natural to look at a mirror; that’s how a dance class is conducted, and even more importantly, that’s how we’re used to seeing ourselves. Since our application is replacing dance class in a dance studio, it was important that the mirrored aspect of video and pictures worked. Also, because of our underestimate of the time necessary for each task, we realized that adding a text-to-speech element for the correction wasn’t the most necessary and we could replace it with a visualization, which would probably be more effective for the user since dance is a very visual art to learn.

Brian

Since I finished up on creating the frame matching algorithm, as well as helping with the pipelining last week, I decided to fine tune some of the things that we did to make it work more smoothly. Since we were only printing corrections on the terminal last week, I wanted to find a way to visualize the corrections in a way that made it apparent what the user needed to fix. In order to do this, I created functions to graph the user poses, and put them next to the instructor pose, with red circles highlighting the necessary correction. I also displayed the correction text in the same frame. I figured this would be an easy method to show all of the corrections in a way that would be easy to translate to the UI.

Umang

This week was all about speed up. Unfortunately, pose estimation using CPUs is horridly slow. We needed to explore ways to get our estimation under our desired metric of doing the pose estimation on a video in under 30 seconds. As such, I decided to explore running our pose estimation on a GPU where we would get the speedup we need to meet the metric. I worked on getting an initial pose estimation implementation of AlphaPose up on AWS. Similar to when AlphaPose runs locally, I run AlphaPose over a set of frames and give the resulting jsons to Brian to visualize as a graph. I also refactored a portion of the pipeline from last week to make it easier to ping the results from the json. The conflicting local file systems made this messy. I hope to compare pose estimation techniques (from GPUs) and continue to refactor code this next week.

Weekly Update #8 (4/7 – 4/13)

Team

This week we decided to have the video corrections finished by the end of the week before carnival. In order to do this, we focused on finishing up a couple things:

  1. Finishing and testing the frame matching algorithm to identify key frames within the user video
  2. Pipelining the data from when the video is taken until the corrections are made
  3. Creating UI pages to accommodate video capabilities

Kristina

With the data gathered from last week, I worked with Brian and Umang to identify the key points, or key frames, of each movement. For a lot of dance movements, especially in classical ballet which all of our chosen movements are taken from, there are positions that dancers move through every time which are important to perform the movement correctly. A dance teacher can easily identify and correct these positions that must always be hit when teaching a student. We take advantage of this aspect of dance in our frame matching algorithm in order to accommodate different speeds of videos and in our correction algorithm in order to give the user feedback. This is why you’ll probably hear us talk about “key frames” a lot when talking about this project. I also spent some time this week updating the UI to allow for video capture from the web camera. Unfortunately (for the time I have to work on capstone, fortunately for me personally!), Carnival weekend also means build week, so I  had a lot less time this week to work on capstone since I was always out on midway building/wiring my booth. I didn’t get as much of the UI implemented as I would have hoped, so I will be focusing on that a lot more next week.

Brian

This week I finished working on the frame matching algorithm. Since last week I focused on finding the distance metrics that yielded the best and most intuitive results, and decided on a simple l2 distance metric, this week I used this metric to actually match the frames. I started by converting the video to its angle domain, and then scanning the video with the key frame, calculating the distance at each point. Then simply by taking the minimum of this distance, I found the frame the best matched the key frame.

This method, however, has the issue that it may detect a frame in any part of the video, and does not take into account when the frame is in the video. In order to correct this, I calculated the positions of the top k most similar frames, and then went through in temporal order to find the best earliest match. Given n key frames, I would run this algorithm n times, each time only giving the frames that the algorithm hadn’t seen yet as frames to match to the keyframe.

Manually testing this on the keypoints that Kristina identified, we had an extremely high success rate in detecting the proper pose within a video.

Umang

This week was a short one due to Carnival. I worked on getting a end to end video pipeline up. Given a mp4 video, I was able to ingest it into a format that can be fed into AlphaPose locally and then sent the resulting jsons to be frame matched with the ground truth (which I also helped create this week). The ground truth was the amalgam of pose estimates from different ground truth videos that  Kristina captured (had to run this as a batch process before we started our pipeline so we would have access to the means and variances of the joints for a particular move). With the key frames identified (by Kristina), I was able to now provide corrections (after calling Brian’s frame matching algorithm); however, this process takes upwards of three minutes to run locally on my machine. As such, I need to explore ways to speed up the entire pipeline to optimize for our time metric.

Weekly Update #7 (3/31 – 4/6)

Team

After the midpoint demo this week, where we demoed running our correction algorithm on stills, we started work on corrections on videos. Before our design report and review, we had roughly designed our method to be able to match a user’s video to an instructor’s video from data collection regardless of the speed of the videos, but now we had to actually implement it. Like before, we spent some time adjusting the design of our original plan to account for problems encountered with the correction algorithm on stills before implementation. We will continue to work on frame matching and altering the correction algorithm to work with videos as well in the next week.

Kristina

I spent some time this week gathering more data since initially I just got data for the poses. I focused on taking videos of myself and a couple other dancers in my dance company doing a port de bras and a plie, which are the two moves we’ve decided to implement, but I also gathered more data for our poses as well (fifth position arms, first arabesque tendu, and a passe, since I realized I’ve never wrote the specific terms on the blog). Also the current UI is only set up for stills right now, so I spent a little bit of time redesigning it to work with videos as well. In the upcoming weeks, I hope to have a smoother version of the UI up and running.

Brian

I spent this first part of the week thinking of the best way to do corrections for videos. There were a couple of options that came to mind, but most of them were infeasible due to the amount of time that it takes to process pose estimations. Therefore, in the end we decided to correct videos by extending our image corrections to “Key Frames”. Key Frames are the poses within a video that we deem to be the defining poses necessary for the proper completion of the move. For example, in order for a push-up to be “proper”, the user must have proper form at both the to and the bottom. By isolating these frames and comparing them to the instructor’s top and bottom poses, we can correct the video.

In order to do this, we need to be able to match the instructor’s key frames to those of the user with a frame matching algorithm. I decided that it would be best to implement this matching by looking at the distance between the frame that we want to match and the corresponding user pose. This week Kristina and I experimented with a bunch of different distance metrics such a l1, l2, cosine, max, etc, and manually determined that the l2 distance yielded distances that most aligned with how similar Kristina judged 2 poses to be.

I will be using this metric to finalize the matching algorithm next week.

Umang

After a wonderful week in London, I spent this week working on video pipelining in particular I ironed out a script that allows us to locally to run estimation on images end to end and then started a pipeline to get pose estimates from a video (*.mp4 file), which would enable Brian’s frame matching algorithm to run. Working with him to devise a scheme to identify key frames made it a smooth process to run pose estimation locally: the problem was that certain frames were estimated incorrectly (due to glitches in the estimation API) and needed to be dropped. A more pressing issue is that pose estimation is really hard to run locally since it is so computationally expensive. I hope to complete the video pipeline and think about ways to speed this process up next week.

Weekly Update #3 (2/24 – 3/2)

Team

This week was focused on the completion of the design, and thinking through many of the important aspects of our design. We didn’t make any sweeping changes that affect the course of the project, and are still on track. We just need to start implementing some of the planned ideas this week.

Kristina

This week, I worked with Brian and Umang on refining our design presentation as well as our design report. I didn’t work a ton on the actual implementation of the project, but helped with many design decisions as we finalize our design report. This upcoming week, I hope to finish gathering the expert data needed for the project.

Brian

Most of this week was spent on the design specifications for our project. We still had a couple of implementation details to think through, so that was the main concern. A big realization was that we need to transform the data into the angle space rather than look at the points themselves. This will allow us to account for the scaling and translation of the person within the image easily. Next week I would like to have an implementation for a still image running so we can move on to movement afterwards.

Umang

This week I had to travel for grad school visit days; nonetheless, I contributed to the design document development. Moreover, during a collaboration session with Brian, we realized that we need to map the limbs from Euclidean space to the angular domain. As such our feature vector would be six angles at the joints of a person. Using a rule based system, we can map each angle to a movement (given two angle feature vectors from adjacent frames) and then prompt the user with a directional correction phrase (after the Mozilla TTS). Next week, I hope to have built the angle feature vectors.

Introduction and Project Summary

Hello, we are Team KUB, and our members are Umang Bhatt, Kristina Banh, and Brian Davis. Our goal is to build an application that is able to teach dancing on the go! We realize that traditional methods of learning are expensive and often inefficient. In order to work around these issues, we are developing a platform that is able to capture your movements and offer corrections after they are completed. We eventually hope to provide a useful alternative for people who may not have the means to learn dancing the traditional way.