Team D4: KUB – Carnegie Mellon ECE Capstone, Spring 2019 Umang Bhatt, Kristina Banh, Brian Davis

May 4, 2019May 6, 2019

Weekly Update #11 (4/28 – 5/4)

Team

This week was spent on the final polishing and integration to prepare for the final demo. We ran into a pretty big issue with our UI/script integration, so we decided to move to the Flask framework to fix our issues. We also added quality of life features like displaying saved videos and images for the user to see before corrections, and options like redoing a move if they were unsatisfied with how they performed. We also tested timing differences between running on different AWS instances, as well as different flags for the various functions to give the fastest corrections without sacrificing speed.

Kristina

After realizing that our Node.js framework was getting over-complicated to switch over to and that the way I was calling a test python script wasn’t going to work with this implementation since the local file system couldn’t be accessed, we decided to shift our framework again. Luckily Brian did more research and had time, so he took over moving the framework for the second time so that I could focus on making the UI design more presentable and polished. Before, I was focusing on making all the elements there (viewing a demonstration of the pose or move, check. web cam access, check. viewing web cam feed mirrored, check. etc) but now I had to focus on not making it an eye sore. I worked on making sure all pages could be navigated to simply and intuitively and focused on styling and making elements look nice. I also helped with testing the final product and specifically with editing the styling to make sure that everything displayed nicely on a different laptop size and still worked with the user flow, where the user has to step away from the laptop screen in order to perform their movement. It’s wild that the semester is over already and that demos are so soon!

Brian

We realized this week that we had created the UI for the different poses and were able to run the scripts separately and display them in the UI, but were not able to run them together. This is due to the fact that our UI could not access files in our local file system. Since we needed to download the user images and videos, and send them over to be processed on AWS this was an issue. After doing some quick searching, I decided that a Flask framework would solve our issues easily. Therefore I ported over our existing UI, and defined all the functions necessary to get our website to access and interact with local files.

I ensured that each page had separate accesses, and that all user files were disposed of after being used in order to prepare for the next batch. In order to make the website work with the way Flask calls functions, I had to make slight changes to the structure of the website, but was able to integrate it in a way that wasn’t noticeable to the end user.

Finally I did a lot of the testing of the final integrated product, and caught a few small errors that would have messed up the execution during the demo.

Umang

This week was the final countdown to demo day. Our goal was to get an end to end pipeline up and running, while fully integrating it with the UI. While others worked on the front end, I wanted to optimize the backend for an even smoother experience. Rather than taking 29 seconds for a video and 15 seconds for an image, I wanted to break sub-20 for a video and sub-10 for an image. The best place to shave off time was in the pose estimation by increasing the speed of AlphaPose and decreasing the frame rate of the original video.

It turned out that the UI saved the video as a *.webm file and AlphaPose did not take in this type. As such, I had to automatically a conversion function (ffmpeg was the one I picked) to convert it from *.webm to *.mp4. Unfortunately, this conversion actually expanded instead of compressed the video which led to even slower pose estimation by AlphaPose. By setting a reduced frame rate flag, I was able to subsample in the video and then run the shorter video through the pose estimation network (with the relaxed confidence, lower number of people, and a increased detection batch). With these changes, I got the video estimation down to 1 second for a 4 second long *.webm file (with added time for the ffmpeg call with the subsampling).

This updated video pipeline ran in ~11 seconds total (including up-down to the instance and the longer ffmpeg) and ran in ~8 seconds for an image. Unfortunately, the AWS instance we used for this was a P3 instance (which had an unsustainable cost of $12/hr). So we settled for the normal P2 instance (which had a cheap cost of $0.90/hr). This pipeline on the P2 ran a video through in ~15 seconds and an image through in ~9 seconds. Both of these times far surpassed our original metrics. We look forward to the demo 🙂

April 27, 2019May 6, 2019

Weekly Update #10 (4/21 – 4/27)

Team

Integration obviously means bugs, problems, and uncovering issues, as we have been warned so many times at t he beginning of the semester. This week, we continued the integration process by tackling the problems that arose. We realized that our UI needed a framework change in order to integrate with the way our back end was implemented and its in inputs and dependencies, so we focused on fixing the front and back end so that it could be properly integrated. We also continued looking into other ways to get the speed up we required, so we continued investigation into AWS and looking into possibly using a different pose estimator in order to get accurate but more efficient results. Our final presentations are next week, so we also spent time working on and polishing our slides.

Kristina

After realizing my big mistake of not looking ahead to HOW the back end would connect to the UI when designing and starting work, I had to move our front end code to a different framework. Initially I was using basic HTML/JavaScript/CSS to create web pages, but I realized that calling our python correction script and getting results back wouldn’t really work, so I decided to use Node.js in order to create a server-side application that could call the correction algorithm when an event that the user initiates happens. I honestly just chose the first framework that seemed the simplest to migrate to, and this ended up not working out as well as I had hoped. I ran into a lot of problems getting my old code to work server-side, and still need to fix a lot of issues next week. Since I’m also the one giving our presentation next week, I also spent some time preparing for it and practicing since I’m not great at presentations.

Brian

This week was a constant tug and pull between the UI side of the project and the backend. Some of the outputs that we thought were going to be necessary ended up needing to be changed to accommodate some of the recent changes in our app structure. A lot of the week was trying to get things formatted in the right way to pass between our separate parts of the project. In particular I was having issues with sending information to and from the AWS instance, but in the end was able to solve it with a lot of googling. I also worked on refactoring the code to be more editable and understandable, as well as on the final presentation.

Umang

This week I helped with the final presentation. Then, I decided to run a comparison between AlphaPose and OpenPose for running pose estimation on the AWS instance. We have a ~8 second up and down time that is irreducible but any other time is added due to the slow pose estimation. As such, I wanted to explore if OpenPose is faster for our use case on a GPU. OpenPose runs smoothly but has much overhead if we want to retrofit it to optimize for our task. Running vanilla OpenPose led to a respectable estimation time for a 400 frame video (on one GPU), but was still over our desired metric. Though the times were comparable at first, when we added flags to our AlphaPose command to reduce the detection batch size, set the number of person to look for, and reduce the overall joint estimate confidence, we were able to get blazing fast estimation from AlphaPose (~20 seconds + ~8 seconds for the up down). This means we hit our metric of doing to end to video pose correction in under 30 seconds 🙂 Final touches to come next week!

April 20, 2019May 6, 2019

Weekly Update #9 (4/14 – 4/20)

Team

After succeeding in finishing most of the work for the video corrections aspect of project, we decided that it was time to start integrating what we had done so far in order to get a better picture of how our project was going to look. Additionally, we realized that the amount of time that our corrections were taking was way too much for a user to justify. Therefore we wanted to find ways to speed up the pose processing. In order to do this, this we focused on:

Looking into AWS as a way to boost our processing speeds
Merging the existing pipelines with the UI and making it look decent

It’s also our in-lab demo next week, so we had to spend some time polishing up what our demo would look like. Since we only started integration this week, we still have problems to work through, so our in-lab demo will most likely not be fully connected or fully functional.

Kristina

This week I spent more time polishing up the UI and editing the implementation to show the recent changes made to our system. This involved being able to save a video or picture to files that could then be accessed by a script, being able to display the taken video or picture, being able to redo movement, and ensuring that the visualization on the screen acted as a mirror facing the user. The latter part was an important aspect that actually seems so small unless it doesn’t work and that we didn’t even realize until now. It’s so natural to look at a mirror; that’s how a dance class is conducted, and even more importantly, that’s how we’re used to seeing ourselves. Since our application is replacing dance class in a dance studio, it was important that the mirrored aspect of video and pictures worked. Also, because of our underestimate of the time necessary for each task, we realized that adding a text-to-speech element for the correction wasn’t the most necessary and we could replace it with a visualization, which would probably be more effective for the user since dance is a very visual art to learn.

Brian

Since I finished up on creating the frame matching algorithm, as well as helping with the pipelining last week, I decided to fine tune some of the things that we did to make it work more smoothly. Since we were only printing corrections on the terminal last week, I wanted to find a way to visualize the corrections in a way that made it apparent what the user needed to fix. In order to do this, I created functions to graph the user poses, and put them next to the instructor pose, with red circles highlighting the necessary correction. I also displayed the correction text in the same frame. I figured this would be an easy method to show all of the corrections in a way that would be easy to translate to the UI.

Umang

This week was all about speed up. Unfortunately, pose estimation using CPUs is horridly slow. We needed to explore ways to get our estimation under our desired metric of doing the pose estimation on a video in under 30 seconds. As such, I decided to explore running our pose estimation on a GPU where we would get the speedup we need to meet the metric. I worked on getting an initial pose estimation implementation of AlphaPose up on AWS. Similar to when AlphaPose runs locally, I run AlphaPose over a set of frames and give the resulting jsons to Brian to visualize as a graph. I also refactored a portion of the pipeline from last week to make it easier to ping the results from the json. The conflicting local file systems made this messy. I hope to compare pose estimation techniques (from GPUs) and continue to refactor code this next week.

April 13, 2019May 6, 2019

Weekly Update #8 (4/7 – 4/13)

Team

This week we decided to have the video corrections finished by the end of the week before carnival. In order to do this, we focused on finishing up a couple things:

Finishing and testing the frame matching algorithm to identify key frames within the user video
Pipelining the data from when the video is taken until the corrections are made
Creating UI pages to accommodate video capabilities

Kristina

With the data gathered from last week, I worked with Brian and Umang to identify the key points, or key frames, of each movement. For a lot of dance movements, especially in classical ballet which all of our chosen movements are taken from, there are positions that dancers move through every time which are important to perform the movement correctly. A dance teacher can easily identify and correct these positions that must always be hit when teaching a student. We take advantage of this aspect of dance in our frame matching algorithm in order to accommodate different speeds of videos and in our correction algorithm in order to give the user feedback. This is why you’ll probably hear us talk about “key frames” a lot when talking about this project. I also spent some time this week updating the UI to allow for video capture from the web camera. Unfortunately (for the time I have to work on capstone, fortunately for me personally!), Carnival weekend also means build week, so I had a lot less time this week to work on capstone since I was always out on midway building/wiring my booth. I didn’t get as much of the UI implemented as I would have hoped, so I will be focusing on that a lot more next week.

Brian

This week I finished working on the frame matching algorithm. Since last week I focused on finding the distance metrics that yielded the best and most intuitive results, and decided on a simple l2 distance metric, this week I used this metric to actually match the frames. I started by converting the video to its angle domain, and then scanning the video with the key frame, calculating the distance at each point. Then simply by taking the minimum of this distance, I found the frame the best matched the key frame.

This method, however, has the issue that it may detect a frame in any part of the video, and does not take into account when the frame is in the video. In order to correct this, I calculated the positions of the top k most similar frames, and then went through in temporal order to find the best earliest match. Given n key frames, I would run this algorithm n times, each time only giving the frames that the algorithm hadn’t seen yet as frames to match to the keyframe.

Manually testing this on the keypoints that Kristina identified, we had an extremely high success rate in detecting the proper pose within a video.

Umang

This week was a short one due to Carnival. I worked on getting a end to end video pipeline up. Given a mp4 video, I was able to ingest it into a format that can be fed into AlphaPose locally and then sent the resulting jsons to be frame matched with the ground truth (which I also helped create this week). The ground truth was the amalgam of pose estimates from different ground truth videos that Kristina captured (had to run this as a batch process before we started our pipeline so we would have access to the means and variances of the joints for a particular move). With the key frames identified (by Kristina), I was able to now provide corrections (after calling Brian’s frame matching algorithm); however, this process takes upwards of three minutes to run locally on my machine. As such, I need to explore ways to speed up the entire pipeline to optimize for our time metric.

April 6, 2019May 6, 2019

Weekly Update #7 (3/31 – 4/6)

Team

After the midpoint demo this week, where we demoed running our correction algorithm on stills, we started work on corrections on videos. Before our design report and review, we had roughly designed our method to be able to match a user’s video to an instructor’s video from data collection regardless of the speed of the videos, but now we had to actually implement it. Like before, we spent some time adjusting the design of our original plan to account for problems encountered with the correction algorithm on stills before implementation. We will continue to work on frame matching and altering the correction algorithm to work with videos as well in the next week.

Team

With the upcoming design presentation, we knew we had to make some important decisions. We’ve decided to use PoseNet and create a web application, which are two major changes from our original proposal. This is because we discovered that our original design, which was using OpenPose in a mobile application, would run very slowly. However, this change will not affect the overall schedule/timeline, as it is more of a lateral movement than a setback. Our decision to abandon the mobile platform could jeopardize our project; to adjust, we decided to offload processing to a GPU, which will make our project faster than it would have been on mobile.

Kristina

This week, I worked with Brian and Umang to test the limits of PoseNet so we could decide which joint detection model to use. I also started creating the base of our web application (just a simple Hello World application for now to build off of). I haven’t done any web development in a while, so creating the incredibly basic application was also a good way to review my rusty skills. Part of this was also trying to integrate PoseNet into the application, but I ran into installation issues (again…like last week. Isn’t set up like the worst part of any project) so I ended up just spending a lot of time trying to get TensorFlow.js and PoseNet on my computer. Also since this upcoming week is going to be a bit busier for me, I made a really simple, first-draft sketch of a UI design to start from. For this next week, my goals are to refine the design, create a simple application we can use to start gathering the “expert” data we need, and to start collecting the data.

Simple first draft of the UI design – very artistic right?!! I’m an aspiring stick figure artist.

Brian

This week I attempted to find the mobile version of openpose and have it run on an iphone. Similarly to last week, I ran into some issues during installation, and decided that since we already had a web version running, it was better to solidify our plan to create a webapp and trash the mobile idea.

Afterwards, I decided to get a better feel for the joint detection platform, and play around with tuning some of the parameters to see which ones yielded the best accuracy. This was mainly done by manual observation of the real time detection as I tracked the movement of what I assumed were dancelike movements. I also took a look at the raw output of the algorithm, and started thinking about the frame matching algorithm that we would like to use to account for the difference in speed amongst the user and training data. I also worked on creating the design documents. For the next week, I would like to work more with the data, and see if I can get something that can detect the difference between joints in given frames.

Umang

This week I worked with Brian to explore the platform options for our application. We found that mobile versions will be all to slow (2-3 fps without an speed-up to the processing) for our use case. We then committed to making a web app instead. For the web version, we used a lite version of Google’s pretrained posenet (for real time estimation) to explore latency and estimation methods. With simple dance moves, I am able to get the estimate of twelve joints; however, when twirls, squats, or other scale/orientation variants are introduced, this light posenet variant loses estimates. As such, this coming week, I want to explore running the full posenet model on a prerecorded video. If we can do the pose estimation post hoc, then I can send the recorded video to an AWS instance with a gpu for quicker processing with the entire model and then send down the pose estimates.

I still need to work on the interpolation required to frame match the user’s video (or frame) with our collected ground truth. To evade this problem, we are going to work stills of Kristina to generate a distribution over the ground truth. We can then query into this distribution at inference time to see how far the user’s joint deviates from the mean. I hope to have the theory and some preliminary results of this distributional aggregation within the next two weeks.