Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Tesseract, OCR

views
     
zeb kew
post Sep 29 2015, 11:01 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
CounteReborn, post your image here and let us see.

If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background.

Restrict the recognition to numbers only, to improve the recognition rate.
zeb kew
post Sep 30 2015, 10:21 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(narf03 @ Sep 29 2015, 11:21 PM)
Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font.
*
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.

This post has been edited by zeb kew: Sep 30 2015, 10:22 AM
zeb kew
post Sep 30 2015, 04:14 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(narf03 @ Sep 30 2015, 03:55 PM)
Boy, really outdated am I ! blush.gif

When I used it, it didn't use a dictionary. It definitely was not in version 3. Can't remember if it was 1 or 2. When I downloaded it, I noticed that the project had been inactive and there had been no updates for a long time. Looks like somebody has picked it up again.

It definitely couldn't do Tamil or Thai back then. Only roman/latin characters at the time.

Not sure if the languages in the paragraph you quoted are referring to it using dictionaries or merely to the text/characters. It could be merely referring to characters recognized because different languages have different characters. Eg, even French and Spanish has many characters that are not present in English.
zeb kew
post Oct 2 2015, 04:05 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM)
Yes, I'm referring MyKad as IC.
I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out.

Yes you can. The IC number appears on a specific position on the IC. It is always at the same position. So, your code will crop out everything else, leaving only the number. Pass THAT image to Tesseract, telling it to recognize numbers only.

Then, your code takes the original image, crop out everything leaving only the name and address, and pass THAT to Tesseract, telling it to recognize alphanumerics and some punctuations like "/" and ".".

You should filter out the background using the colour. The background near where the text is is blue and white. The text is black. Open the image in Photoshop. Check the Red channel, Green channel, Blue channel. See which one reduces the clutter. You might also adjust the brightness and contrast, and scale the image so that it is the proper size (in terms of pixel height for each character) to present to Tesseract. Then you have to replicate what you did in photoshop programmatically using whichever language and tool you're most familiar with. Fred's scripts are very good, but could sometimes be a bit hard to understand. Or you could use ImageMagick directly. Note that the Imagemagick project had forked many years back due to developer dispute, and there is an alternative named GraphicsMagick.

Yes, this is a lot of work. The payoff is you get very high recognition rates if you can get it to work. Otherwise, if you simply pass the entire image to Tesseract, photo and all, you're likely to get very poor garbage back out.

For testing, you can just try doing the cropping and tweaking of the image in Photoshop. Then pass the image to Tesseract. Do a few tests and see if the result is acceptable. Note that you should convert the image to grayscale. That makes it easier for you to see how Tesseract "sees" the image.

QUOTE
May I know how can you limit it to get certain area's text only?

What I did was to crop the image and send only that portion to Tesseract. I used php since I was more familiar with it and it's gd tools.

QUOTE
Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway
*

Dictionaries are worse than useless if you're trying to recognize names and addresses. Turn them off. They'll mess up the recognition.
zeb kew
post Oct 2 2015, 04:23 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(malleus @ Oct 1 2015, 10:30 PM)
yup, its quite a bit to digest initially. and it can be quite some work to do on the training as well.

in a nutshell: you scan a sample, then check the results, do correction on the incorrect characters that's recognised, then feed the corrected data back to the OCR to train it, so hopefully it'll be smarter the next time.

but like I said, its tedious work, which is why we gave up on it and went with abbyy OCR instead.
*

Does Abbyy comes with a command line client or some kind of API so that we can bolt it on as part of a larger project?

One thing I was concerned about is that while the commercial stuff's recognition rate was significantly higher, they're also way heavier (in terms of resources consumed by the program, load times, etc). So if we have 10,000 images to process each day, the computer would just choke on simply loading and unloading the program for each image.

Any idea which of the commercial OCR programs have API / command line clients?

And comparing Nuance and Abbyy, which have better recognition rate when it comes to random data (names, numbers, etc) (raw recognition before using dictionary lookup)? Thinking of using it on old experimental papers (contains lots of names and numbers). I'm also wondering if it would help to run the same thing through 2 or 3 independent OCR programs, and then comparing the outputs to detect differences. Would this reduce the nett error rate? I'm wondering how to push down the error rate to near zero.
zeb kew
post Oct 5 2015, 11:41 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(malleus @ Oct 2 2015, 07:41 PM)
abbyy does have a linux command line version. in fact that's the only reason why I picked to use that

not sure about nuance though, never tried it
*
Hey, thanks a lot!
BTW: Nuance is the current owner of Omnipage.

 

Change to:
| Lo-Fi Version
0.0168sec    1.26    7 queries    GZIP Disabled
Time is now: 16th December 2025 - 09:52 AM