Image text recognition APIs showdown. 

Google Vision vs Microsoft cognitive services vs AWS rekognition. 


Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.

Detecting and reading text from photos has multiple use cases, be it clicking a picture of a printed text and automatically converting it into a digital file or the new age application of reading bills and invoices. 

Other interesting use cases include deep image search, understanding local business listing using street view images or when combined with text translation the ability to take a picture of a billboard in a foreign country and have it converted to your native language, the possibilities are limitless. 

Image text recognition is a class of computer vision problems which, among other things, includes OCR (optical character recognition) or text detection (used to find printed text on images) or handwritten text recognition. 

With the advancement of deep learning we have come a long way to get substantially better at text recognition, but still, the best companies in the business have much to cover before we can consider this problem as solved.

Image text recognition APIs:

Most of the major technology companies/cloud services provide APIs to recognize text in an image. Each of them provides the same functionality and if you are looking to integrate with any of them, you might be interested to know how each one of these stack up with respect to accuracy and recall?

We compared the text recognition APIs by Google cloud vision, Microsoft cognitive services, and Amazon AWS rekognition to answer exactly that question.

Dataturks or I have no affiliation with any of these providers (except that in the past I worked at Microsoft and Amazon) and we have tried to be completely unbiased third party who just wanted to independently evaluate how these APIs stack up.

Target use case:

The target use case we did the comparison was to: take images of business names or movie names and read text from them. The images can be from various different angles or fonts with a varied background.

Generally reading text off images is a two-step process where first, the text is detected on the images (a bounding box) and then that bounding box is interpreted as characters, words, lines etc. To simplify the task, we just took images cropped at the bounding boxes to focus more on character reading part. Also, each image had one predominant word which was the target word to be extracted.

Example images:

Setup:

We used a subset of IIT-5K dataset which has 5K labeled images (a lot of images in the dataset are too small to work with the APIs which have a restriction on the size of the images used). We tested a random sample of 500 images on each of Google vision API, Microsoft cognitive services, and AWS rekognition. 

Results:

Given an image, each API returned text detected in the image as a list of lines. Each of these lines is further broken into the list of words and all three API return the bounding box for each of the component (word, line, section).

If the API is not able to detect any text, it returns an empty response. Only AWS rekognition gives a confidence score for the detection. 

Provider #Correct (C) #Wrong (W) #NoResult (N) Precision (C/(C+W)) Recall ((C+W)/total)
Microsoft cognitive services 142 76 283 65% 44%
Google cloud vision 322 80 99 80% 80%
AWS rekognition 58 213 230 21% 54%

Similar to how one might use these API in combination we also tested to see the best possible performance we can get. So the combination worked like: for each image if there was same text from two of these APIs we used that as the detected text, else we choose the text returned by one of the API in the preference order of Google vision followed by Microsoft cognitive service and the last being AWS rekognition ( the preference is based on the accuracy of each of these API as given by the above table). 

Results for the combination:

Provider #Correct (C) #Wrong (W) #NoResult (N) Precision (C/(C+W)) Recall ((C+W)/total)
Combining all APIs 325 115 61 65% 88%

As seen from the above two tables, combining these APIs do not give any better performance since Google vision API is substantially better than the other two. In both precision and recall, Google vision API is a clear winner with a wide margin.

We have made the code and dataset freely available for anyone to validate the results.

Examples

Where all three APIs work good.

There were 33 out of 500 cases where all three APIs did the right text recognition.

Provider
Microsoft CS: twilight Advertise Brady FROST
Google vision: twilight Advertise Brady FROST
AWS rekognition: twilight Advertise Brady FROST

Where all three APIs got it wrong.

There were 10 images where all three APIs got it wrong.

Provider
Microsoft CS: 4841!12022 %0M21\u21660 sess Jachi%
Google vision: MKE NEWELL UM Cutrency WKS aciie
AWS rekognition: MIKE NEWELLF currenecy AWEA A aclie

Where none of the API did any detection.

There were 60 cases where none of the APIs were able to detect any text.

Provider
Microsoft CS: None None None None
Google vision: None None None None
AWS rekognition: None None None None

Microsoft CS worked and others failed.

Provider
Microsoft CS: Google 17d BULLPEN
Google vision: Goo\u011fle None BULLPE
AWS rekognition: oogle None None

Google vision worked and others failed.

Provider
Microsoft CS: speohed None None
Google vision: Payloads DONT BEN
AWS rekognition: None None N

AWS rekognition worked and others failed.

Provider
Microsoft CS: None 041e\u042d None
Google vision: None G OTHA M FUT U R IE
AWS rekognition: GUN GOTHAM FUTURE

Ease of use:

All three APIs are quite simple to use and integrate with Apps. The only drawback with AWS rekognition APIs is that it only takes an image stored as an AWS S3 object as input while the other API work with any image stored on the web. 

Conclusion:

As per our experiment when it comes to detecting text in images Google vision APIs are miles ahead with respect to Microsoft or AWS in both precision and recall. On the other hand even today the state of the art in computer vision has a long way to go and there are many gaps to be filled as seen from the less than satisfactory performance of these API on a fairly simpler use case.

If you like this, here is our blog on the comparison of best face recognition APIs.

If you have any queries or suggestions I would love to hear about it. Please write to me at mohan@dataturks.com.


Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.