This week I worked on many things – getting frames from the RPi camera video stream to pass to the gesture classifier, collecting more images (through crowdsourcing on social media which was a very interesting experience – we have a total of around 1200 real images in our dataset I believe) to train the classifier, wrote a script to relabel the data easily and most importantly boosting the classifier’s validation accuracy from ~80% to 90.36% when Neeti last ran it yay!
I tried quite a few things to improve our accuracy:
- Adding more images to the old model – did not really help beyond a point, it would overfit on the training dataset and not increase beyond 81% at best
- Adding regularization to the old model to reduce the overfitting by adding Dropout
- Tweaking the hyperparameters of the old model (batch size, adaptive learning rate on keras, different error calculation algorithm (adam turned out to be the best idea), training:test ratio, etc) – this finally maxed out the classifier’s validation accuracy at around 85%
- I had an idea that the variations in the backgrounds of the images (plentiful since we got these images from SO many different people) was causing the validation accuracy to be low. So my goal was to eliminate the background entirely in each image and only feed in the hand portion to the net. I implemented the skin detection algorithm using online resources and figured out the correct HSV color thresholds that would work best. This algorithm basically does a type of clustering by finding contours in the image based on the thresholds. It ended up working really well – I checked that it worked on our images.
- I also thought that it might be a good idea to turn the result of the skindetector into full black and white using Otsu binarization (something I learned in a comp bio class last sem for counting cells in a microscope image) – black background and white hand to increase the uniformity among the images being fed in.
- However, the above two steps did not help with the old neural net model and it stayed maxed out, fluctuating between 80-85% on average. I then decided that it would be worth it to try a different neural net architecture.
- I researched common CNN architectures used for industrial gesture classifiers and discovered an architecture called VGG16 that is apparently the best. However it is also huge and ends up learning about 19 million weights total. This kind of net is built for large scale data processing and run on powerful GPUs with high compute and large memory availability. It is not possible to run in on a normal laptop/RPi. That being said, it was still the best architecture available. I discovered that using a subset of the VGG16 layers and reducing some node counts in the dense layers but following a similar pattern was effective enough for our purpose.
- Finally, using this model + skindetector output (without otsu binarization which didn’t help much) on 923 images at the time gave a consistent validation accuracy of 87.88% over multiple runs.
- After adding more images today, Neeti ran it again and it hit 90%
Crowdsourcing was a wild experience. Overall I think it has been a highly productive week and we are on track. 🙂