This week, my team and I focused on finalizing our design review presentation and beginning work on the design review report. I personally contributed to these goals by providing feedback while our team prepped for the presentation, and by planning, outlining, and drafting a subset of the report related to our system architecture, and the computer vision component. We divided sections of the report to draft individually, which we plan to integrate tomorrow into one cohesive draft.
I also spent some time experimenting with the Jetson-Inference library for object detection inferencing. Running the real-time demo included with the library, and using a pre-trained Single Shot Detector (SSD) model trained on 91 object classes, the NVIDIA Jetson AGX was able to consistently perform inferencing at around 90-140 frames per second, while running on the lowest power setting (“15W Desktop”). That is to say, the stock configuration for object inferencing with this library meets our latency targets. We were able to achieve similar performance with both the Python and C++ demos, demonstrating that the library is sufficiently optimized in both languages and alleviating concerns we had towards the beginning of the project about potential bottlenecks stemming from Python’s interpreter.
Real-time object detection from the Jetson
Finally, I began planning out an approach for customizing the SSD model to use object classes more consistent with our intended use case (small, user-manipulable household objects). Collectively, my team and I narrowed down which 3 objects we will wish to identify: a coffee mug, a stapler, and a jar of mayonnaise. None of these are identified by the current model, so our approach will be to use transfer learning (aided by the Jetson-Inference library) to tweak the weights of the pre-trained model using a new dataset of these three objects. The benefits of doing this are twofold:
- We maintain the sophisticated weights in the MobileNet base network that was originally trained on the 330,000 images in the COCO dataset.
- We reduce training time as compared to training an entirely new model from scratch.
In terms of building this new dataset, I plan to combine labeled images from the Open Images data set for some of the classes we have chosen, and supplement it with new images that we will personally label in the coming weeks of the classes not present in this dataset (or any other), like the jar of mayonnaise.
With respect to the Computer Vision component of the project, we are on schedule. Though we did not initially plan for collecting and labeling our own data for model training, we can perform this task in parallel with other tasks that are on our critical path. That is, customizing our object classes should not hinder our ability to build a working product.
Next week I plan to continue working on our design review report. I also plan to begin writing the program for potential contact detection, which predicts which object the user may be touching. I plan to test the two approaches for tracking the user’s hand that we have previously discussed: object detection on a glove and Lucas-Kanade Tracking a point on the glove (post-calibration). Finally, I hope to write a detailed plan for collecting the image data, which can be parallelized between each group member over spring break.