Finalizing InFrame’s Perception Stack
This week, I worked mainly on finalizing the CV design and on doing some research regarding benchmarks for the Jetson Nano’s performance on popular Deep Learning models.
One of the biggest discussions that we’ve had so far is about the computing capabilities that we’ll need on our system. A Raspberry Pi 3 is a very powerful SBC, but I found that a Jetson Nano outperforms it by a factor of approximately 30x across the board on Deep Learning Inference performance. Considering that object tracking is a core feature of InFrame, we don’t want to limit our real-time tracking capabilities and will proceed with a Jetson Nano as our core compute moving forward.
In terms of achieving real-time object tracking, research in consumer electronics such as TVs and gaming consoles has found that input lag starts to become noticeable for humans only at about 60 ms. As such, in order to appear to be tracking in real-time, we need CV decisions to happen within 60 ms (about 16 FPS). In order to achieve this, we will be using the lightest object detection model we can use to cover our use cases (following a lecturer and following popular sports objects). One such model is SSD MobileNet-V2, which was developed by Google AI as a next gen on-device object detection network and runs at approximately 39 FPS on a Jetson Nano with images of 300×300. We believe that MobileNet-V2 could work well for our system because it was trained on the COCO dataset, which contains classes that cover our use cases (like people, bicycles, frisbees, skateboards, balls, etc) and achieves 88% accuracy on these classes. In case this network doesn’t perform as well as we’d like, there are larger versions of this model that work on larger images or we could go to a state-of-the-art network like YOLO-v3. The reason that I believe we can get away with a heavier network is because we only need to run it on one frame. The user will use this frame and the object detection results to choose one of the bounding boxes returned from object detection and that will be the input to a tracking algorithm like Lucas-Kanade (although we are considering using an algorithm called Deep-LK, which is a deep learning optimization that deals better with occlusions and larger gaps in between frames).
A few things I still need to think about are if Lucas Kanade can track an object that changes in shape well enough (i.e. a diver curling up into a ball during a dive). Traditional Lucas Kanade doesn’t deal with lighting changes or large displacements between frames. The latter isn’t a problem at high frame rates from the camera, but lighting changes and objects changing shape could be. I’ll be looking into OpenCV implementations of Lucas Kanade to see how well they perform and consider maybe implementing it from scratch with deep learning optimizations as well.
0 Comments