Tesseract

Lowyat.NET forums

Lowyat.NET Kopitiam Garage Sales

Lowyat.NET Rules and Regulations FAQ Help Search Members

Welcome Guest ( Log In | Register )

Lowyat.NET -> Codemasters

Bump Topic Add Reply RSS Feed

2 Pages 1 2 >Bottom

Outline · [ Standard ] · Linear+

Tesseract, OCR

views

TSCounteReborn	Sep 29 2015, 02:37 AM, updated 11y ago Show posts by this member only \| Post #1
Getting Started Junior Member 205 posts Joined: Nov 2012	Hello Buddies, Recently I'm working witht he tesseract OCR, it does not seems working really fine for me. Bunch of funny words scanned out, of course if I were to use some better font it will scan them maybe 70% correctly? When it comes to IC, i see JUNKS were scanned out. Disappointed. Anyone experience with this? May share you experience with me bro
Card PM	Report Top Like Quote Reply

malleus	Sep 29 2015, 10:55 AM Show posts by this member only \| Post #2
Look at all my stars!! Senior Member 2,096 posts Joined: Dec 2011	some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still what's the quality of your input image like? do you do image cleanups on it? you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/ to clean up the image to make the text clearer for the OCR to process apart from that, have you tried doing tesseract training?
Card PM	Report Top Like Quote Reply

zeb kew	Sep 29 2015, 11:01 AM Show posts by this member only \| Post #3
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	CounteReborn, post your image here and let us see. If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background. Restrict the recognition to numbers only, to improve the recognition rate.
Card PM	Report Top Like Quote Reply

narf03	Sep 29 2015, 11:21 PM Show posts by this member only \| Post #4
Look at all my stars!! Senior Member 4,547 posts Joined: Dec 2004 From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol	Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font.
Card PM	Report Top Like Quote Reply

zeb kew	Sep 30 2015, 10:21 AM Show posts by this member only \| Post #5
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	QUOTE(narf03 @ Sep 29 2015, 11:21 PM) Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font. I think there is a way to force Tesseract to limit it to only some characters. And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor. Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops. This post has been edited by zeb kew: Sep 30 2015, 10:22 AM
Card PM	Report Top Like Quote Reply

narf03	Sep 30 2015, 03:55 PM Show posts by this member only \| Post #6
Look at all my stars!! Senior Member 4,547 posts Joined: Dec 2004 From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol	QUOTE(zeb kew @ Sep 30 2015, 10:21 AM) I think there is a way to force Tesseract to limit it to only some characters. And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor. Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops. From wiki https://en.wikipedia.org/wiki/Tesseract_(software) QUOTE The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, Dutch, English, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Spanish, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.
Card PM	Report Top Like Quote Reply

zeb kew	Sep 30 2015, 04:14 PM Show posts by this member only \| Post #7
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	QUOTE(narf03 @ Sep 30 2015, 03:55 PM) From wiki https://en.wikipedia.org/wiki/Tesseract_(software) Boy, really outdated am I ! When I used it, it didn't use a dictionary. It definitely was not in version 3. Can't remember if it was 1 or 2. When I downloaded it, I noticed that the project had been inactive and there had been no updates for a long time. Looks like somebody has picked it up again. It definitely couldn't do Tamil or Thai back then. Only roman/latin characters at the time. Not sure if the languages in the paragraph you quoted are referring to it using dictionaries or merely to the text/characters. It could be merely referring to characters recognized because different languages have different characters. Eg, even French and Spanish has many characters that are not present in English.
Card PM	Report Top Like Quote Reply

TSCounteReborn	Oct 1 2015, 09:37 PM Show posts by this member only \| Post #8
Getting Started Junior Member 205 posts Joined: Nov 2012	QUOTE(malleus @ Sep 29 2015, 10:55 AM) some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still what's the quality of your input image like? do you do image cleanups on it? you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/ to clean up the image to make the text clearer for the OCR to process apart from that, have you tried doing tesseract training? Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand. https://code.google.com/p/tesseract-ocr/wik...iningTesseract3 QUOTE(zeb kew @ Sep 29 2015, 11:01 AM) CounteReborn, post your image here and let us see. If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background. Restrict the recognition to numbers only, to improve the recognition rate. Yes, I'm referring MyKad as IC. I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out. QUOTE(zeb kew @ Sep 30 2015, 10:21 AM) I think there is a way to force Tesseract to limit it to only some characters. And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor. Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops. Didn't know tesseract was famous few years back, probably because it's open source ... May I know how can you limit it to get certain area's text only? QUOTE(narf03 @ Sep 30 2015, 03:55 PM) From wiki https://en.wikipedia.org/wiki/Tesseract_(software) Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway
Card PM	Report Top Like Quote Reply

malleus	Oct 1 2015, 10:30 PM Show posts by this member only \| Post #9
Look at all my stars!! Senior Member 2,096 posts Joined: Dec 2011	QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM) Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand. yup, its quite a bit to digest initially. and it can be quite some work to do on the training as well. in a nutshell: you scan a sample, then check the results, do correction on the incorrect characters that's recognised, then feed the corrected data back to the OCR to train it, so hopefully it'll be smarter the next time. but like I said, its tedious work, which is why we gave up on it and went with abbyy OCR instead.
Card PM	Report Top Like Quote Reply

zeb kew	Oct 2 2015, 04:05 PM Show posts by this member only \| Post #10
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM) Yes, I'm referring MyKad as IC. I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out. Yes you can. The IC number appears on a specific position on the IC. It is always at the same position. So, your code will crop out everything else, leaving only the number. Pass THAT image to Tesseract, telling it to recognize numbers only. Then, your code takes the original image, crop out everything leaving only the name and address, and pass THAT to Tesseract, telling it to recognize alphanumerics and some punctuations like "/" and ".". You should filter out the background using the colour. The background near where the text is is blue and white. The text is black. Open the image in Photoshop. Check the Red channel, Green channel, Blue channel. See which one reduces the clutter. You might also adjust the brightness and contrast, and scale the image so that it is the proper size (in terms of pixel height for each character) to present to Tesseract. Then you have to replicate what you did in photoshop programmatically using whichever language and tool you're most familiar with. Fred's scripts are very good, but could sometimes be a bit hard to understand. Or you could use ImageMagick directly. Note that the Imagemagick project had forked many years back due to developer dispute, and there is an alternative named GraphicsMagick. Yes, this is a lot of work. The payoff is you get very high recognition rates if you can get it to work. Otherwise, if you simply pass the entire image to Tesseract, photo and all, you're likely to get very poor garbage back out. For testing, you can just try doing the cropping and tweaking of the image in Photoshop. Then pass the image to Tesseract. Do a few tests and see if the result is acceptable. Note that you should convert the image to grayscale. That makes it easier for you to see how Tesseract "sees" the image. QUOTE May I know how can you limit it to get certain area's text only? What I did was to crop the image and send only that portion to Tesseract. I used php since I was more familiar with it and it's gd tools. QUOTE Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway Dictionaries are worse than useless if you're trying to recognize names and addresses. Turn them off. They'll mess up the recognition.
Card PM	Report Top Like Quote Reply

zeb kew	Oct 2 2015, 04:23 PM Show posts by this member only \| Post #11
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	QUOTE(malleus @ Oct 1 2015, 10:30 PM) yup, its quite a bit to digest initially. and it can be quite some work to do on the training as well. in a nutshell: you scan a sample, then check the results, do correction on the incorrect characters that's recognised, then feed the corrected data back to the OCR to train it, so hopefully it'll be smarter the next time. but like I said, its tedious work, which is why we gave up on it and went with abbyy OCR instead. Does Abbyy comes with a command line client or some kind of API so that we can bolt it on as part of a larger project? One thing I was concerned about is that while the commercial stuff's recognition rate was significantly higher, they're also way heavier (in terms of resources consumed by the program, load times, etc). So if we have 10,000 images to process each day, the computer would just choke on simply loading and unloading the program for each image. Any idea which of the commercial OCR programs have API / command line clients? And comparing Nuance and Abbyy, which have better recognition rate when it comes to random data (names, numbers, etc) (raw recognition before using dictionary lookup)? Thinking of using it on old experimental papers (contains lots of names and numbers). I'm also wondering if it would help to run the same thing through 2 or 3 independent OCR programs, and then comparing the outputs to detect differences. Would this reduce the nett error rate? I'm wondering how to push down the error rate to near zero.
Card PM	Report Top Like Quote Reply

malleus	Oct 2 2015, 07:41 PM Show posts by this member only \| Post #12
Look at all my stars!! Senior Member 2,096 posts Joined: Dec 2011	QUOTE(zeb kew @ Oct 2 2015, 04:23 PM) Does Abbyy comes with a command line client or some kind of API so that we can bolt it on as part of a larger project? One thing I was concerned about is that while the commercial stuff's recognition rate was significantly higher, they're also way heavier (in terms of resources consumed by the program, load times, etc). So if we have 10,000 images to process each day, the computer would just choke on simply loading and unloading the program for each image. Any idea which of the commercial OCR programs have API / command line clients? And comparing Nuance and Abbyy, which have better recognition rate when it comes to random data (names, numbers, etc) (raw recognition before using dictionary lookup)? Thinking of using it on old experimental papers (contains lots of names and numbers). I'm also wondering if it would help to run the same thing through 2 or 3 independent OCR programs, and then comparing the outputs to detect differences. Would this reduce the nett error rate? I'm wondering how to push down the error rate to near zero. abbyy does have a linux command line version. in fact that's the only reason why I picked to use that not sure about nuance though, never tried it
Card PM	Report Top Like Quote Reply

zeb kew	Oct 5 2015, 11:41 AM Show posts by this member only \| Post #13
Look at all my stars!! Senior Member 2,325 posts Joined: Sep 2015	QUOTE(malleus @ Oct 2 2015, 07:41 PM) abbyy does have a linux command line version. in fact that's the only reason why I picked to use that not sure about nuance though, never tried it Hey, thanks a lot! BTW: Nuance is the current owner of Omnipage.
Card PM	Report Top Like Quote Reply

TSCounteReborn	Oct 5 2015, 06:31 PM Show posts by this member only \| Post #14
Getting Started Junior Member 205 posts Joined: Nov 2012	Is there any other open-source OCR API recommended other than Tesseract? Would like to give a try
Card PM	Report Top Like Quote Reply

Palindromes	Apr 23 2020, 08:20 AM Show posts by this member only \| IPv6 \| Post #15
Getting Started Validating 157 posts Joined: Jan 2020	I want to understand if it is possible to do OCR in C# without any third-party library. I have limited success. For example, reading this image will resulting in: THRAANKKYGOCDUVEEEEEEEERRFYMUOCCHR222Z55509GEBBUOCCKK Input: THANK YOU VERY MUCH 250 BUCK Output: THRAANKKYGOCDUVEEEEEEEERRFYMUOCCHR222Z55509GEBBUOCCKK Because I scan the image (pixel by pixel) too close to each other, there are duplicate characters. Moreover, some characters are identical to each other, such as 2 and Z, O and C. What I did is very primitive form: 1. Get the font bitmap (A..Z, a..z, 0..9) 2. Read the input image 3. Convert it to B/W bitmap 4. Scan B/W bitmap and compare with font bitmap But it is difficult even to recognize the font bitmap generated by the C# program itself. So far only able to recognize 55 chars out of 62 in total. Limitation: Input: Must have spacing between characters, only 18-point Arial font type allowed (depends on the constant value in C#) Output: No punctuation, no whitespace , etc. So, I am having fun creating the basic form of OCR. What do you know about the logic behind OCR? I know OCR is far more complex than I thought or what I have achieved now.
Card PM	Report Top Like Quote Reply

basilpaschal	Apr 23 2020, 08:22 AM Show posts by this member only \| IPv6 \| Post #16
Getting Started Junior Member 178 posts Joined: May 2013	optical character recogbition (ocr) - use google translate camera
Card PM	Report Top Like Quote Reply

jibpek	Apr 23 2020, 08:27 AM Show posts by this member only \| Post #17
Enthusiast Junior Member 708 posts Joined: Jul 2012	It is possible, but don't waste your time. You text is consider very nice, clean and perfectly aligned. 99% of the time your image will be dirty, not align or warp
Card PM	Report Top Like Quote Reply

Palindromes	Apr 23 2020, 11:19 PM Show posts by this member only \| IPv6 \| Post #18
Getting Started Validating 157 posts Joined: Jan 2020	QUOTE(basilpaschal @ Apr 23 2020, 08:22 AM) optical character recogbition (ocr) - use google translate camera That's instant OCR.... amazing. They have polished their product very well over the years. QUOTE(jibpek @ Apr 23 2020, 08:27 AM) It is possible, but don't waste your time. You text is consider very nice, clean and perfectly aligned. 99% of the time your image will be dirty, not align or warp Yes, you're definitely right.
Card PM	Report Top Like Quote Reply

aBcD-\|	Apr 23 2020, 11:24 PM Show posts by this member only \| IPv6 \| Post #19
Enthusiast Senior Member 935 posts Joined: Dec 2010	First, use Tesseract 4.0 supported both C++ mainly and python secondary. The output accuracy is quite decent and most of the time you have to deal with pre-processing image and provide a clean sample to tesseract engine. Second, the methodology of image/text processing is not right, because the sample you provided is an ideal test case ^ see above comment. You can refer to opencv for that. Third, when it comes to image processing task, consider C++ as primary choice, you need to deal with pipeline processing, instead of object oriented task, so yeah it pretty wasting a lot of time to implement something that is already exists in the first place.
Card PM	Report Top Like Quote Reply

Palindromes	Apr 23 2020, 11:28 PM Show posts by this member only \| IPv6 \| Post #20
Getting Started Validating 157 posts Joined: Jan 2020	I think I will just throw this OCR project into the code museum... haha If you find it interesting, here's the OCR.cs (rename OCR.txt to OCR.cs / Program.cs will do) C# source file that I did yesterday night until this early morning. Except the getting font bitmap (g.DrawString(,myFont,,,)....) part was taken from StackOverflow, the rest of the code are my creative works. OCR.txt ( 12.11k ) Number of downloads: 19 This C# code will create 62 different font bitmap (A..Z, a..z, 0..9) in memory (with option to save each of them as image file on the user's desktop), then convert it to B/W bitmap, then match each of the font bitmap against 62 characters. So far, you'll see the code managed to recognize 55 out of total 62 characters only. (It will loop by itself, until the end, where you'll be prompted to press Enter to exit) It is not advisable to use this C# code to perform OCR on even the simplest image with characters....because (1) it is slow, (2) it is inaccurate. But at least I have had fun creating this piece of basic form of OCR to recognize the 62 characters created by itself.
Card PM	Report Top Like Quote Reply

« Next Oldest · Codemasters · Next Newest »

2 Pages 1 2 >Top

Add Reply Options

Change to:

0.0280sec

0.75

6 queries

GZIP Disabled
Time is now: 14th December 2025 - 11:19 AM

All Rights Reserved © 2002- 2025 Vijandren Ramadass (~unite against racism~)

Removal Request

Powered by Invision Power Board © 2025 IPS, Inc.