This week has been devoted to cementing the choices for the ML model we should use for our project as well as addressing the potential of a purely Computer Vision approach for our card detection module. Our advisors proposed the following pipeline:
- Use a traditional feature descriptor to identify cards.
- Take a snapshot of the table’s state.
- Wait 300 ms and perform steps 1-2 again.
- Given these 2 images, use a convolutional filter
- Pass a gradient filter over both of these images.
- Subtract the gradients of the images.
The idea is that this pipeline should allow us to cheaply and efficiently detect the cards being placed on the table, and then once we are sure we a new card has been placed into frame, we can simply use the output of the feature detector to keep track of the cards. However, this algorithm makes too many simplifying assumptions to work in the context of Blackjack, and will most likely not be a good fit for various reasons.
Firstly, we note that the best feature descriptor for the task is ORB (Oriented FAST and Rotated BRIEF), which provides a scale and rotational invariant features for object detection that is fast. However, ORB is known to fail if major occlusions for object occur, and is also not robust to lighting changes, both of which are sensible issues for us. For occlusions, we note that since cards are dealt one over each other, the first card dealt in a players hand will be largely occluded from view of the camera. Furthermore, ORB is not as robust as other, paid feature descriptors are, to lighting changes. One of our user requirements is that there is no arduous or rigid setup the users have to follow to use our product, so limiting camera position or lighting is not a trade off we are willing to make. Furthermore, given the nature of Blackjack, ORB will face issues as a feature descriptor, making step 1 of this pipeline already problematic.
There would also be noise issues with this pipeline that would go against our user requirements. We already mentioned that this pipeline will experience shaky accuracy with ORB, which is a red flag, since we need at minimum 90% accuracy to guarantee a better deviation from the count than a typical professional card counter. However, we would also require a fixed camera angle if we used this pipeline. This is because if we only used gradients and edge detection, we cannot discern a playing card from another rectangular object that has numbers or letters on it. This is because ORB is hand crafted feature descriptor that does not undergo any further processing, so it is more susceptible to adversarial attacks. We could easily place a book in frame, and the edges and the letter “K”, for a King, would be detected as a King! The only way around this would be to fix the camera angle, so we know what size the cards should be, but again, this goes against a user requirement.
Finally, we would also spend a lot of time on this pipeline for little gains. Even if we use libraries that are optimized by using BLAS under the hood, such as NumPy, we are CPU bound on critically (parallel!) operations, such as the convolution! We would have to write our own CUDA kernel for performing convolutions, making it a large time expenditure to develop and debug the kernel for minimal savings either on time or memory while still retaining the above issues. All in all, this Computer Vision pipeline seemed like a sensible idea, but the numbers don’t justify using a pipeline like this at the end of the day.
For the machine learning side of things, we finally found the exact Yolo Model we want to use. We will be using Yolov11 Large for the foreseeable feature. This is because it has a good accuracy-speed trade off for us. We need accuracy far more than speed, and we can get 300ms-500ms inference times with this model, with far better guarantees on the ability to detect cards. This lines up well with our user requirements, since we will not be streaming data or performing real time inference.
This keeps us on schedule for meeting the schedule and requirements. By next week, we hope to have compiled a slightly larger dataset and begin setting up a training environment on the ECE clusters for fine tuning.