Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.
Detecting and reading text from photos has multiple use cases, be it clicking a picture of a printed text and automatically converting it into a digital file or the new age application of reading bills and invoices.
Other interesting use cases include deep image search, understanding local business listing using street view images or when combined with text translation the ability to take a picture of a billboard in a foreign country and have it converted to your native language, the possibilities are limitless.
Image text recognition is a class of computer vision problems which, among other things, includes OCR (optical character recognition) or text detection (used to find printed text on images) or handwritten text recognition.
With the advancement of deep learning we have come a long way to get substantially better at text recognition, but still, the best companies in the business have much to cover before we can consider this problem as solved.
Most of the major technology companies/cloud services provide APIs to recognize text in an image. Each of them provides the same functionality and if you are looking to integrate with any of them, you might be interested to know how each one of these stack up with respect to accuracy and recall?
We compared the text recognition APIs by Google cloud vision, Microsoft cognitive services, and Amazon AWS rekognition to answer exactly that question.
Dataturks or I have no affiliation with any of these providers (except that in the past I worked at Microsoft and Amazon) and we have tried to be completely unbiased third party who just wanted to independently evaluate how these APIs stack up.
The target use case we did the comparison was to: take images of business names or movie names and read text from them. The images can be from various different angles or fonts with a varied background.
Generally reading text off images is a two-step process where first, the text is detected on the images (a bounding box) and then that bounding box is interpreted as characters, words, lines etc. To simplify the task, we just took images cropped at the bounding boxes to focus more on character reading part. Also, each image had one predominant word which was the target word to be extracted.
Example images:
We used a subset of IIT-5K dataset which has 5K labeled images (a lot of images in the dataset are too small to work with the APIs which have a restriction on the size of the images used). We tested a random sample of 500 images on each of Google vision API, Microsoft cognitive services, and AWS rekognition.
Here is the open-dataset we used: (Open to download)
An example data-item from the dataset
Given an image, each API returned text detected in the image as a list of lines. Each of these lines is further broken into the list of words and all three API return the bounding box for each of the component (word, line, section).
If the API is not able to detect any text, it returns an empty response. Only AWS rekognition gives a confidence score for the detection.
Provider | #Correct (C) | #Wrong (W) | #NoResult (N) | Precision (C/(C+W)) | Recall ((C+W)/total) |
---|---|---|---|---|---|
Microsoft cognitive services | 142 | 76 | 283 | 65% | 44% |
Google cloud vision | 322 | 80 | 99 | 80% | 80% |
AWS rekognition | 58 | 213 | 230 | 21% | 54% |
Similar to how one might use these API in combination we also tested to see the best possible performance we can get. So the combination worked like: for each image if there was same text from two of these APIs we used that as the detected text, else we choose the text returned by one of the API in the preference order of Google vision followed by Microsoft cognitive service and the last being AWS rekognition ( the preference is based on the accuracy of each of these API as given by the above table).
Results for the combination:
Provider | #Correct (C) | #Wrong (W) | #NoResult (N) | Precision (C/(C+W)) | Recall ((C+W)/total) |
---|---|---|---|---|---|
Combining all APIs | 325 | 115 | 61 | 65% | 88% |
As seen from the above two tables, combining these APIs do not give any better performance since Google vision API is substantially better than the other two. In both precision and recall, Google vision API is a clear winner with a wide margin.
We have made the code and dataset freely available for anyone to validate the results.
There were 33 out of 500 cases where all three APIs did the right text recognition.
Provider | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Microsoft CS: | twilight | Advertise | Brady | FROST |
Google vision: | twilight | Advertise | Brady | FROST |
AWS rekognition: | twilight | Advertise | Brady | FROST |
There were 10 images where all three APIs got it wrong.
Provider | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Microsoft CS: | 4841!12022 | %0M21\u21660 | sess | Jachi% |
Google vision: | MKE NEWELL UM | Cutrency | WKS | aciie |
AWS rekognition: | MIKE NEWELLF | currenecy | AWEA A | aclie |
There were 60 cases where none of the APIs were able to detect any text.
Provider | ![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Microsoft CS: | None | None | None | None |
Google vision: | None | None | None | None |
AWS rekognition: | None | None | None | None |
Provider | ![]() |
![]() |
![]() |
---|---|---|---|
Microsoft CS: | 17d | BULLPEN | |
Google vision: | Goo\u011fle | None | BULLPE |
AWS rekognition: | oogle | None | None |
Provider | ![]() |
![]() |
![]() |
---|---|---|---|
Microsoft CS: | speohed | None | None |
Google vision: | Payloads | DONT | BEN |
AWS rekognition: | None | None | N |
Provider | ![]() |
![]() |
![]() |
---|---|---|---|
Microsoft CS: | None | 041e\u042d | None |
Google vision: | None | G OTHA M | FUT U R IE |
AWS rekognition: | GUN | GOTHAM | FUTURE |
All three APIs are quite simple to use and integrate with Apps. The only drawback with AWS rekognition APIs is that it only takes an image stored as an AWS S3 object as input while the other API work with any image stored on the web.
As per our experiment when it comes to detecting text in images Google vision APIs are miles ahead with respect to Microsoft or AWS in both precision and recall. On the other hand even today the state of the art in computer vision has a long way to go and there are many gaps to be filled as seen from the less than satisfactory performance of these API on a fairly simpler use case.
Here is the open-dataset we used.
If you like this, here is our blog on the comparison of best face recognition APIs.
If you have any queries or suggestions I would love to hear about it. Please write to me at mohan@dataturks.com.
Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.