Welcome Guest ( Log In | Register )

Outline · [ Standard ] · Linear+

 Tesseract, OCR

views
     
TSCounteReborn
post Sep 29 2015, 02:37 AM, updated 11y ago

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


Hello Buddies,

Recently I'm working witht he tesseract OCR, it does not seems working really fine for me. Bunch of funny words scanned out, of course if I were to use some better font it will scan them maybe 70% correctly?

When it comes to IC, i see JUNKS were scanned out. Disappointed.

Anyone experience with this?
May share you experience with me bro hmm.gif
malleus
post Sep 29 2015, 10:55 AM

Look at all my stars!!
*******
Senior Member
2,096 posts

Joined: Dec 2011
some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still

what's the quality of your input image like? do you do image cleanups on it?

you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/

to clean up the image to make the text clearer for the OCR to process

apart from that, have you tried doing tesseract training?
zeb kew
post Sep 29 2015, 11:01 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
CounteReborn, post your image here and let us see.

If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background.

Restrict the recognition to numbers only, to improve the recognition rate.
narf03
post Sep 29 2015, 11:21 PM

Look at all my stars!!
*******
Senior Member
4,547 posts

Joined: Dec 2004
From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol


Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font.
zeb kew
post Sep 30 2015, 10:21 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(narf03 @ Sep 29 2015, 11:21 PM)
Attempt to do that a while ago, but failed, many of the OCR try to recognize dictionary words, so if you do number plates, name, etc will be big failure. It doesnt really matter much if you change font.
*
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.

This post has been edited by zeb kew: Sep 30 2015, 10:22 AM
narf03
post Sep 30 2015, 03:55 PM

Look at all my stars!!
*******
Senior Member
4,547 posts

Joined: Dec 2004
From: Metro Prima, Kuala Lumpur, Malaysia, Earth, Sol


QUOTE(zeb kew @ Sep 30 2015, 10:21 AM)
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.
*
From wiki
https://en.wikipedia.org/wiki/Tesseract_(software)

QUOTE
The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, Dutch, English, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Spanish, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.
zeb kew
post Sep 30 2015, 04:14 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(narf03 @ Sep 30 2015, 03:55 PM)
Boy, really outdated am I ! blush.gif

When I used it, it didn't use a dictionary. It definitely was not in version 3. Can't remember if it was 1 or 2. When I downloaded it, I noticed that the project had been inactive and there had been no updates for a long time. Looks like somebody has picked it up again.

It definitely couldn't do Tamil or Thai back then. Only roman/latin characters at the time.

Not sure if the languages in the paragraph you quoted are referring to it using dictionaries or merely to the text/characters. It could be merely referring to characters recognized because different languages have different characters. Eg, even French and Spanish has many characters that are not present in English.
TSCounteReborn
post Oct 1 2015, 09:37 PM

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


QUOTE(malleus @ Sep 29 2015, 10:55 AM)
some fonts do indeed scan better than others. this is not a tesseract problem, but is a common problem for all OCR, including the commercial ones. although they do differ in terms of output quality still

what's the quality of your input image like? do you do image cleanups on it?

you can probably try something like this: http://www.fmwconcepts.com/imagemagick/textcleaner/

to clean up the image to make the text clearer for the OCR to process

apart from that, have you tried doing tesseract training?
*
Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand.
https://code.google.com/p/tesseract-ocr/wik...iningTesseract3


QUOTE(zeb kew @ Sep 29 2015, 11:01 AM)
CounteReborn, post your image here and let us see.

If by "IC" you mean the Mykad, there are blue coloured patterns behind the number. The IC number itself is black. You can filter out the background so that tesseract only sees the number in front of a blank background.

Restrict the recognition to numbers only, to improve the recognition rate.
*
Yes, I'm referring MyKad as IC.
I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out.


QUOTE(zeb kew @ Sep 30 2015, 10:21 AM)
I think there is a way to force Tesseract to limit it to only some characters.

And IIRC, Tesseract does not use dictionary lookup/spellcheck, which is why it's recognition rate for normal text is pretty poor.

Can't remember the command line option now. We used Tesseract a couple of years back in a project to scan receipts. The OCR was only to pick out the receipt number so that the image file can be stored with the appropriate filename/receipt number. With additional code to pick out the receipt number and clean up the area around it, the error rate was less than 10 per 10,000 receipts. A lot depends on your image quality. And there is an optimal size for the characters (in number of pixels). It's in the documentation. Too small or too large, and the recognition rate drops.
*
Didn't know tesseract was famous few years back, probably because it's open source ... sweat.gif
May I know how can you limit it to get certain area's text only?


QUOTE(narf03 @ Sep 30 2015, 03:55 PM)
Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway

malleus
post Oct 1 2015, 10:30 PM

Look at all my stars!!
*******
Senior Member
2,096 posts

Joined: Dec 2011
QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM)
Nope, haven't try that yet. mind give some brief understanding on that? Seems lot to understand.
yup, its quite a bit to digest initially. and it can be quite some work to do on the training as well.

in a nutshell: you scan a sample, then check the results, do correction on the incorrect characters that's recognised, then feed the corrected data back to the OCR to train it, so hopefully it'll be smarter the next time.

but like I said, its tedious work, which is why we gave up on it and went with abbyy OCR instead.
zeb kew
post Oct 2 2015, 04:05 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(CounteReborn @ Oct 1 2015, 09:37 PM)
Yes, I'm referring MyKad as IC.
I cannot restrict only numbers recognition, I need to get names and address as well. But there are some others words and colors distracting the engine to extract them out.

Yes you can. The IC number appears on a specific position on the IC. It is always at the same position. So, your code will crop out everything else, leaving only the number. Pass THAT image to Tesseract, telling it to recognize numbers only.

Then, your code takes the original image, crop out everything leaving only the name and address, and pass THAT to Tesseract, telling it to recognize alphanumerics and some punctuations like "/" and ".".

You should filter out the background using the colour. The background near where the text is is blue and white. The text is black. Open the image in Photoshop. Check the Red channel, Green channel, Blue channel. See which one reduces the clutter. You might also adjust the brightness and contrast, and scale the image so that it is the proper size (in terms of pixel height for each character) to present to Tesseract. Then you have to replicate what you did in photoshop programmatically using whichever language and tool you're most familiar with. Fred's scripts are very good, but could sometimes be a bit hard to understand. Or you could use ImageMagick directly. Note that the Imagemagick project had forked many years back due to developer dispute, and there is an alternative named GraphicsMagick.

Yes, this is a lot of work. The payoff is you get very high recognition rates if you can get it to work. Otherwise, if you simply pass the entire image to Tesseract, photo and all, you're likely to get very poor garbage back out.

For testing, you can just try doing the cropping and tweaking of the image in Photoshop. Then pass the image to Tesseract. Do a few tests and see if the result is acceptable. Note that you should convert the image to grayscale. That makes it easier for you to see how Tesseract "sees" the image.

QUOTE
May I know how can you limit it to get certain area's text only?

What I did was to crop the image and send only that portion to Tesseract. I used php since I was more familiar with it and it's gd tools.

QUOTE
Yes, you are right. There are a lot of engines now supporting different languages. But sadly, for my cases I've tried Malay as well. It doesn't seems to work better anyway
*

Dictionaries are worse than useless if you're trying to recognize names and addresses. Turn them off. They'll mess up the recognition.
zeb kew
post Oct 2 2015, 04:23 PM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(malleus @ Oct 1 2015, 10:30 PM)
yup, its quite a bit to digest initially. and it can be quite some work to do on the training as well.

in a nutshell: you scan a sample, then check the results, do correction on the incorrect characters that's recognised, then feed the corrected data back to the OCR to train it, so hopefully it'll be smarter the next time.

but like I said, its tedious work, which is why we gave up on it and went with abbyy OCR instead.
*

Does Abbyy comes with a command line client or some kind of API so that we can bolt it on as part of a larger project?

One thing I was concerned about is that while the commercial stuff's recognition rate was significantly higher, they're also way heavier (in terms of resources consumed by the program, load times, etc). So if we have 10,000 images to process each day, the computer would just choke on simply loading and unloading the program for each image.

Any idea which of the commercial OCR programs have API / command line clients?

And comparing Nuance and Abbyy, which have better recognition rate when it comes to random data (names, numbers, etc) (raw recognition before using dictionary lookup)? Thinking of using it on old experimental papers (contains lots of names and numbers). I'm also wondering if it would help to run the same thing through 2 or 3 independent OCR programs, and then comparing the outputs to detect differences. Would this reduce the nett error rate? I'm wondering how to push down the error rate to near zero.
malleus
post Oct 2 2015, 07:41 PM

Look at all my stars!!
*******
Senior Member
2,096 posts

Joined: Dec 2011
QUOTE(zeb kew @ Oct 2 2015, 04:23 PM)
Does Abbyy comes with a command line client or some kind of API so that we can bolt it on as part of a larger project?

One thing I was concerned about is that while the commercial stuff's recognition rate was significantly higher, they're also way heavier (in terms of resources consumed by the program, load times, etc). So if we have 10,000 images to process each day, the computer would just choke on simply loading and unloading the program for each image.

Any idea which of the commercial OCR programs have API / command line clients?

And comparing Nuance and Abbyy, which have better recognition rate when it comes to random data (names, numbers, etc) (raw recognition before using dictionary lookup)? Thinking of using it on old experimental papers (contains lots of names and numbers). I'm also wondering if it would help to run the same thing through 2 or 3 independent OCR programs, and then comparing the outputs to detect differences. Would this reduce the nett error rate? I'm wondering how to push down the error rate to near zero.
*
abbyy does have a linux command line version. in fact that's the only reason why I picked to use that

not sure about nuance though, never tried it
zeb kew
post Oct 5 2015, 11:41 AM

Look at all my stars!!
*******
Senior Member
2,325 posts

Joined: Sep 2015
QUOTE(malleus @ Oct 2 2015, 07:41 PM)
abbyy does have a linux command line version. in fact that's the only reason why I picked to use that

not sure about nuance though, never tried it
*
Hey, thanks a lot!
BTW: Nuance is the current owner of Omnipage.
TSCounteReborn
post Oct 5 2015, 06:31 PM

Getting Started
**
Junior Member
205 posts

Joined: Nov 2012


Is there any other open-source OCR API recommended other than Tesseract? Would like to give a try hmm.gif
Palindromes
post Apr 23 2020, 08:20 AM

Getting Started
**
Validating
157 posts

Joined: Jan 2020
I want to understand if it is possible to do OCR in C# without any third-party library.

I have limited success. For example, reading this image will resulting in: THRAANKKYGOCDUVEEEEEEEERRFYMUOCCHR222Z55509GEBBUOCCKK

user posted image

user posted image

Input: THANK YOU VERY MUCH 250 BUCK
Output: THRAANKKYGOCDUVEEEEEEEERRFYMUOCCHR222Z55509GEBBUOCCKK

Because I scan the image (pixel by pixel) too close to each other, there are duplicate characters.

Moreover, some characters are identical to each other, such as 2 and Z, O and C.

What I did is very primitive form:
1. Get the font bitmap (A..Z, a..z, 0..9)
2. Read the input image
3. Convert it to B/W bitmap
4. Scan B/W bitmap and compare with font bitmap

But it is difficult even to recognize the font bitmap generated by the C# program itself. So far only able to recognize 55 chars out of 62 in total.

Limitation:
Input: Must have spacing between characters, only 18-point Arial font type allowed (depends on the constant value in C#)
Output: No punctuation, no whitespace , etc.

So, I am having fun creating the basic form of OCR. What do you know about the logic behind OCR? I know OCR is far more complex than I thought or what I have achieved now.

basilpaschal
post Apr 23 2020, 08:22 AM

Getting Started
**
Junior Member
178 posts

Joined: May 2013


optical character recogbition (ocr) - use google translate camera
jibpek
post Apr 23 2020, 08:27 AM

Enthusiast
*****
Junior Member
708 posts

Joined: Jul 2012
It is possible, but don't waste your time.

You text is consider very nice, clean and perfectly aligned.

99% of the time your image will be dirty, not align or warp
Palindromes
post Apr 23 2020, 11:19 PM

Getting Started
**
Validating
157 posts

Joined: Jan 2020
QUOTE(basilpaschal @ Apr 23 2020, 08:22 AM)
optical character recogbition (ocr) - use google translate camera
*
That's instant OCR.... amazing. They have polished their product very well over the years.


QUOTE(jibpek @ Apr 23 2020, 08:27 AM)
It is possible, but don't waste your time.

You text is consider very nice, clean and perfectly aligned.

99% of the time your image will be dirty, not align or warp
*
Yes, you're definitely right. wink.gif


aBcD-|
post Apr 23 2020, 11:24 PM

Enthusiast
*****
Senior Member
935 posts

Joined: Dec 2010
First, use Tesseract 4.0 supported both C++ mainly and python secondary. The output accuracy is quite decent and most of the time you have to deal with pre-processing image and provide a clean sample to tesseract engine.

Second, the methodology of image/text processing is not right, because the sample you provided is an ideal test case ^ see above comment. You can refer to opencv for that.

Third, when it comes to image processing task, consider C++ as primary choice, you need to deal with pipeline processing, instead of object oriented task, so yeah it pretty wasting a lot of time to implement something that is already exists in the first place.
Palindromes
post Apr 23 2020, 11:28 PM

Getting Started
**
Validating
157 posts

Joined: Jan 2020
I think I will just throw this OCR project into the code museum... haha

If you find it interesting, here's the OCR.cs (rename OCR.txt to OCR.cs / Program.cs will do) C# source file that I did yesterday night until this early morning.

Except the getting font bitmap (g.DrawString(,myFont,,,)....) part was taken from StackOverflow, the rest of the code are my creative works.

Attached File  OCR.txt ( 12.11k ) Number of downloads: 19


This C# code will create 62 different font bitmap (A..Z, a..z, 0..9) in memory (with option to save each of them as image file on the user's desktop), then convert it to B/W bitmap, then match each of the font bitmap against 62 characters.

So far, you'll see the code managed to recognize 55 out of total 62 characters only. (It will loop by itself, until the end, where you'll be prompted to press Enter to exit)

It is not advisable to use this C# code to perform OCR on even the simplest image with characters....because (1) it is slow, (2) it is inaccurate.

But at least I have had fun creating this piece of basic form of OCR to recognize the 62 characters created by itself.




2 Pages  1 2 >Top
 

Change to:
| Lo-Fi Version
0.0280sec    0.75    6 queries    GZIP Disabled
Time is now: 14th December 2025 - 11:19 AM