QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM)
Yes, I'm referring MyKad as IC.
I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out.
Yes you can. The IC number appears on a specific position on the IC. It is always at the same position. So, your code will crop out everything else, leaving only the number. Pass THAT image to Tesseract, telling it to recognize numbers only.
Then, your code takes the original image, crop out everything leaving only the name and address, and pass THAT to Tesseract, telling it to recognize alphanumerics and some punctuations like "/" and ".".
You should filter out the background using the colour. The background near where the text is is blue and white. The text is black. Open the image in Photoshop. Check the Red channel, Green channel, Blue channel. See which one reduces the clutter. You might also adjust the brightness and contrast, and scale the image so that it is the proper size (in terms of pixel height for each character) to present to Tesseract. Then you have to replicate what you did in photoshop programmatically using whichever language and tool you're most familiar with. Fred's scripts are very good, but could sometimes be a bit hard to understand. Or you could use ImageMagick directly. Note that the Imagemagick project had forked many years back due to developer dispute, and there is an alternative named GraphicsMagick.
Yes, this is a lot of work. The payoff is you get very high recognition rates if you can get it to work. Otherwise, if you simply pass the entire image to Tesseract, photo and all, you're likely to get very poor garbage back out.
For testing, you can just try doing the cropping and tweaking of the image in Photoshop. Then pass the image to Tesseract. Do a few tests and see if the result is acceptable. Note that you should convert the image to grayscale. That makes it easier for you to see how Tesseract "sees" the image.
QUOTE
May I know how can you limit it to get certain area's text only?
What I did was to crop the image and send only that portion to Tesseract. I used php since I was more familiar with it and it's gd tools.
QUOTE
Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway
Dictionaries are worse than useless if you're trying to recognize names and addresses. Turn them off. They'll mess up the recognition.