Demo Feedback and Speeding up the Perception Pipeline

After a successful demo 2 in which Ike and I showed off the merged perception and CSM functionalities, I focused on speeding up the perception pipeline in an effort to get closer to our real-time tracking goal and have motor movement decisions in under 60 ms. I tried everything from multithreading frame reads and having them on a thread-safe producer queue to downsampling the image and even reducing the frequency at which object detection is run. I even tried looking at overclocking the Jetson Nano, only to find that the clock comes fully maxed out as its default power mode when you use a 5V 4A power supply like the one I’ve been using.

As a last resort, I looked at the benchmarks that NVIDIA reported for the ssd-mobilenet-v2 model that we’re using for object detection, and they report up to 39 FPS (25ms per frame) when using that model on 300×300 images. As such, I looked into the way that they achieved that and found that they use a proprietary SDK called TensorRT which maximizes inference performance by using the Nano’s CUDA cores for precision calibration, kernel auto tuning and more optimizations of the inference engine. By switching to using only their provided tools (i.e. loading in images using their provided libraries instead of reading them myself and transforming them using OpenCV), I was able to get the whole independent Perception running at an average of 13.75 FPS. This translates to roughly 72 ms per frame, which is only 12 ms away from our goal! In our design proposal, we talked about how research on input lag states that 60 ms is the upper bound for noticing any lag, and it was very satisfying to see for myself that when  you’re so close to 60 ms, the lag is indeed barely noticeable and the detection appears to be happening in real time.

What this means, however, is that in a somewhat surprising turn of events, object detection is running faster than object tracking. As a result, I am no longer using object tracking and am instead only using object detection as the tracker itself. The drawback with this approach is that there can only be one target with a certain class ID in the frame at any given time. This is because when the user selects a target, we save that class ID and look for the bounding box with that class ID at every frame. This works fine for our use case when we are tracking a lecturer, since there is usually only one person in the frame, but if there were any other person, it would randomly switch between tracking one or the other with every new frame. There are certainly ways around this (like saving the bounding box patch and comparing it with the results of object detection in every future frames to find the most similar one), but with this being the last week and with so many presentations and reports coming up, it is unlikely that I’ll get a chance to improve this. In any case, things are working really really well and honestly really fast so I’m very glad with how it all turned out <3.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *