Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Tesseract, OCR

views
     
TSCounteReborn
post Sep 29 2015, 02:37 AM, updated 11y ago

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


Hello Buddies,

Recently I'm working witht he tesseract OCR, it does not seems working really fine for me. Bunch of funny words scanned out, of course if I were to use some better font it will scan them maybe 70% correctly?

When it comes to IC, i see JUNKS were scanned out. Disappointed.

Anyone experience with this?
May share you experience with me bro hmm.gif
TSCounteReborn
post Oct 1 2015, 09:37 PM

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


QUOTE(malleus @ Sep 29 2015, 10:55 AM)
some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still

what's the quality of your input image like? do you do image cleanups on it?

you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/

to clean up the image to make the text clearer for the OCR to process

apart from that, have you tried doing tesseract training?
*
Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand.
https://code.google.com/p/tesseract-ocr/wik...iningTesseract3


QUOTE(zeb kew @ Sep 29 2015, 11:01 AM)
CounteReborn, post your image here and let us see.

If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background.

Restrict the recognition to numbers only, to improve the recognition rate.
*
Yes, I'm referring MyKad as IC.
I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out.


QUOTE(zeb kew @ Sep 30 2015, 10:21 AM)
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.
*
Didn't know tesseract was famous few years back, probably because it's open source ... sweat.gif
May I know how can you limit it to get certain area's text only?


QUOTE(narf03 @ Sep 30 2015, 03:55 PM)
Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway

TSCounteReborn
post Oct 5 2015, 06:31 PM

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


Is there any other open-source OCR API recommended other than Tesseract? Would like to give a try hmm.gif

 

Change to:
| Lo-Fi Version
0.0164sec    0.28    7 queries    GZIP Disabled
Time is now: 14th December 2025 - 07:27 PM