This week, I did some more research on the models for Optical Character Recognition. Here are some of the sources I looked at:
Optical Character Recognition Wiki
I learned more about which algorithm specifically allow OCR to work; OCR uses a combination of image correlation and feature extraction (both of which are computer vision methods that utilize filters) to recognize characters.
I also learned that certain OCR systems, such as Tesseract (which we mentioned in our proposal, and is open-source), use a two-pass approach. On the first pass-through, the system performs character recognition as usual, and on the second pass-through, the system actually uses the characters that it recognized with high confidence on the first pass to help predict characters that it was not able to recognize on the first pass.
I looked into post-processing techniques for OCR, which is something we might decide to try to improve accuracy. This involves getting the extracted text and then comparing it to some kind of dictionary, acting as a sort of ‘spell check’ to correct inaccurately-recognized words. This may be harder to do if there are proper nouns which don’t appear in a dictionary, so I’d like to try an implementation with and without post-processing, and compare the accuracy.
My progress is on schedule, as this week was meant for researching OCR models.
For next week, I will create a small test project using Tesseract and play around with the hyperparameters and training set, in order to ascertain the current level of accuracy that this system can achieve.