Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Tesseract, OCR

views
     
narf03
post Sep 29 2015, 11:21 PM

Look at all my stars!!
*******
Senior Member
4,547 posts

Joined: Dec 2004
From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol


Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font.
narf03
post Sep 30 2015, 03:55 PM

Look at all my stars!!
*******
Senior Member
4,547 posts

Joined: Dec 2004
From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol


QUOTE(zeb kew @ Sep 30 2015, 10:21 AM)
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.
*
From wiki
https://en.wikipedia.org/wiki/Tesseract_(software)

QUOTE
The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, Dutch, English, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Spanish, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.

 

Change to:
| Lo-Fi Version
0.0171sec    0.74    7 queries    GZIP Disabled
Time is now: 14th December 2025 - 04:10 PM