Tesseract

Lowyat.NET forums

Lowyat.NET Kopitiam Garage Sales

Lowyat.NET Rules and Regulations FAQ Help Search Members

Welcome Guest ( Log In | Register )

Lowyat.NET -> Codemasters

Bump Topic Add Reply RSS Feed

Outline · [ Standard ] · Linear+

Tesseract, OCR

views

TSCounteReborn	Sep 29 2015, 02:37 AM, updated 11y ago Return to original view \| Post #1
Getting Started Junior Member 205 posts Joined: Nov 2012	Hello Buddies, Recently I'm working witht he tesseract OCR, it does not seems working really fine for me. Bunch of funny words scanned out, of course if I were to use some better font it will scan them maybe 70% correctly? When it comes to IC, i see JUNKS were scanned out. Disappointed. Anyone experience with this? May share you experience with me bro
Card PM	Report Top Like Quote Reply

TSCounteReborn	Oct 1 2015, 09:37 PM Return to original view \| Post #2
Getting Started Junior Member 205 posts Joined: Nov 2012	QUOTE(malleus @ Sep 29 2015, 10:55 AM) some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still what's the quality of your input image like? do you do image cleanups on it? you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/ to clean up the image to make the text clearer for the OCR to process apart from that, have you tried doing tesseract training? Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand. https://code.google.com/p/tesseract-ocr/wik...iningTesseract3 QUOTE(zeb kew @ Sep 29 2015, 11:01 AM) CounteReborn, post your image here and let us see. If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background. Restrict the recognition to numbers only, to improve the recognition rate. Yes, I'm referring MyKad as IC. I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out. QUOTE(zeb kew @ Sep 30 2015, 10:21 AM) I think there is a way to force Tesseract to limit it to only some characters. And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor. Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops. Didn't know tesseract was famous few years back, probably because it's open source ... May I know how can you limit it to get certain area's text only? QUOTE(narf03 @ Sep 30 2015, 03:55 PM) From wiki https://en.wikipedia.org/wiki/Tesseract_(software) Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway
Card PM	Report Top Like Quote Reply

TSCounteReborn	Oct 5 2015, 06:31 PM Return to original view \| Post #3
Getting Started Junior Member 205 posts Joined: Nov 2012	Is there any other open-source OCR API recommended other than Tesseract? Would like to give a try
Card PM	Report Top Like Quote Reply

« Next Oldest · Codemasters · Next Newest »

Add Reply Options

Change to:

0.0164sec

0.28

7 queries

GZIP Disabled
Time is now: 14th December 2025 - 07:27 PM

All Rights Reserved © 2002- 2025 Vijandren Ramadass (~unite against racism~)

Removal Request

Powered by Invision Power Board © 2025 IPS, Inc.