Linux OCR is it worth it?

Date: 22 Jul, 2010
Posted by: admin
In: hints & tips|linux, open source & software

Bookmark and Share

However here’s a review of the current state of OCR on linux distros from a user perspective. I’ve looked at OCR for Linux briefly before when considering PDF editing and OCR of text-as-image in PDF documents but it’s not really relevant to this.

Note this article is not yet finished!

Optical Character Recognition

OCR – What is it?

OCR applications take an image of text as their input and output the textual equivalent, or at least the applications best guess. Simples. Applications for this sort of thing should be readily apparent – editing documents that you scan in, performing automated content tagging, indexing images by textual content, reading receipt data, these sorts of things.

Popular linux OCR options

The most popular projects in OCR under linux are, in no particular order:

  • Ocrad (from the Ocrad manual)
    GNU Ocrad is an OCR (Optical Character Recognition) program and library based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are collectively known as pnm. Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.
  • GOCR, aka JOCR, is a sourceforge hosted project that has a TCL frontend and was last updated 2009-08-04 but is under active development according to their tracker.
  • Tesseract, for which frontends include gimagereader, is available for Linux, Windows and Mac OSX. From the Google Code page:
    The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.”
  • easy-ocr

At least these above 4 are available in standard Ubuntu repos and so can be installed in Ubuntu and related distros using apt.

  • ABBYY Finereader along with ABBYY’s other offerings is only available for MS Windows and Mac systems despite ABBYY offering a Linux based OCR engine for other developers to use.
  • Readiris is for MS Windows and Mac only.
  • Gamera is not an app but a toolkit/library/framework for use in OCR developed with support from John Hopkins University.
  • Kooka is/was a raster image scanning program that was in teh KDE graphics kit but was dropped due to inactivity in development when KDE4 was released. It used Ocrad as the default backend. KDE now has libkscan for image apps to hook into for scanning duties.
  • CuneiForm, aka OpenOCR for which there is a QT frontend, is an open sourced project available in Russian and English language versions.
  • OCRopus is not a desktop tool but is developed for high throughput, large volume, OCR systems. From the Google Code page:
    OCRopus(tm) is a state-of-the-art document analysis and OCR system [sponsored by Google], featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
  • Pixel Engine OCR from Vividata is a command line based OCR app designed for high-throughput needs and targetted at Linux and Solaris.
  • Asprise OCR provides several SDK for different platforms but included amongst these is a java offering that itself includes a demo application that can perform OCR on image inputs.

It seems that greyscale 400dpi scans are best received and so any testing I’ve done (towards the end) has been using this initial scanning format.

Previous reviews and studies

The Information Science Research Institute (ISRI) at the University of Nevada Las Vegas (UNLV) has performed studies in the past of various pieces of OCR software with a focus on open source offerings. Some of their studies, from 1992-1996, their test samples and current OCR testing kit (OCRtk) can be found on their website. The latest entry there is for 2007 so any results there are likely out of date by now (2010).

As an additional resource I’ve block-quoted a document previously found at http://groundstate.ca/ocr which is no longer available there. It was written in 2007 and did quite detailed testing of some of the options here. Clearly things have moved on but I just wanted to archive the resource here in case anyone finds it useful (I did).


Linux OCR: A review of free optical character recognition software
Submitted by Austin on Sat, 2007/05/19 – 15:02

* Tech/Geek

by Austin Acton

I’ve used Linux as my full-time desktop for seven years now. I have almost no reason to use Windows (other than stupid ExamSoft), and even when I do, I don’t have much Windows software available. The one “hole” in my workflow has been OCR. For years, people have been able to scan a document and have it converted into real text. One of my old printers even came with OCR software included – for Windows of course. But when I’ve really needed OCR, I’ve just assumed that there were no high quality packages available for Linux.

Recently I decided to find out for myself (a complete OCR virgin) what is available, how to use it, and what the results are like. I installed every free OCR package I could find, and systematically tested them. They all work very differently, so I tried to design a simple test for my specific needs.

As a test subject, I entered the following line into a word processor:

The quick brown Métis jumped over the fluffy Finance Manager.

I specifically included an accented character, some capital letters, and the “fl” and “ff” combos which tend to overlap in serif fonts. I then printed the sentence sixteen times: in each of Times New Roman, Vera Serif, Arial, and Vera SansSerif, at 10, 12, 14, and 16 point size. My assumption was that larger text would be better interpreted by the OCR software. More on that later. Then I simply scanned the printout at various resolutions using XSane, in both black and while (binary) and greyscale.

The printouts looked a little something like this:

Black and White (Binary)

Greyscale

Many of the OCR results were totally surprising to me. Let’s take a look.
Resolution and File Size

My scanner can handle up to 9600 DPI, which is overkill for just about anything. However, I wasn’t sure what resolution would produce optimal OCR results. I assumed that the higher, the better. If that were true, file size would begin to be a problem at high resolutions. Here’s a comparison of the file sizes in the resolutions I scanned. (Note that I only scanned half the page.)

As you can see, they increase exponentially, and greyscale images are significantly larger.

To test the effect of resolution on OCR accuracy, I used the default OCR application for Linux, gocr. I also used the Times New Roman line for this test, as that’s what I’m most often having to scan. The results were very surprising. Here are the accuracies at various font sizes:

Times New Roman 10:

Times New Roman 12:

Times New Roman 16:

Much to my surprise, there is an optimal resolution – approximately 400-600 DPI above and below which the accuracy decreases. In fact, at very high resolutions, gocr was unable to read any text at all. Maybe this is a limitation of the software; I’m not sure.

Greyscale scans were slightly better interpreted by the software. The test made me feel confident that using 400 DPI greyscale would be a fair standard to compare programs with, and that it was the best tradeoff between file size and quality.
Typeface

Using gocr at 400 DPI greyscale again, I compared the results of each font. Here’s a table of the number of errors in each line:
Font Size Times Vera Serif Airal Vera SansSerif
10 6 5 3 2
12 4 6 4 2
14 6 6 2 3
16 8 6 5 2

Clearly, sans-serif fonts are much more reliable, which makes sense, as all those little serifs all over the place are bound to confuse the OCR engine at some point. It’s also interesting that the free Vera fonts consistently performed better than the standard MS fonts. I assume this is because they are a bit wider, but again, I’m not sure.

I was also very surprised that the size of the font did not have the effect I predicted, i.e. larger fonts are not interpreted any better than the smaller ones. I was happy to see that for Times New Roman, the font I most frequently scan, size 12 is one of the best sizes, which is also the size I most frequently scan. Unfortunately though, Times is clearly the least reliable typeface.

Again I should note that these numbers are likely affected by the choice of gocr for the test. It may be “tuned” to 12 point font or not, but again, it made me feel confident that Times 12 was a good standard for the rest of the tests.

So now, on to the software…
gocr

Gocr has been the de facto standard for free OCR for some time now; for example, xsane can interact directly with it. It has very few bells or whistles however.

Gocr was easy to install, being included with Mandriva 2007 (Spring Edition)*.

The command line interface is dead easy to use. A simple:

$ gocr file.pnm

dumps the text to the screen. Couldn’t be much simpler than that. CLI interfaces make batch processing very easy, and allow for integration with other applications like xsane.

Various advanced options are possible, including grey level, dust size, space width, and certainty level.

Gocr does come with a crude graphical interface written in tcl/tk.

It does not provide any functionality unavailable on the command line, and I didn’t really enjoy using it.
Name: gocr
Location: http://jocr.sourceforge.net/
Version: 0.44
Input Format: pnm
Accuracy: 94%
Ease of Use: 4/5
Clara

Next up is something totally different. Clara OCR has a history as long as gocr’s, but the implementation is as different as can be. First off, there is no command line interface. This make batch processing difficult and eliminates the possibility of server-side use. But more notably, it has to be “trained”. In other words, to use it, one must scan some text, and then teach the software character-by-character. This can obviously lead to higher accuracies than ready-to-use OCR engines, but the lack of out-of-the-box usability made me eliminate this option for me, and hence from an actual review.

Regardless, it may be an option for you – provided you have a very specific, repetitive task requiring high accuracy, and a lot of patience.

Just as a warning, the GUI is horrible – one of the worst I’ve ever seen, or used. Not only is it ugly, but it’s totally unintuitive, including things like tabs which cycle through multiple clicks, and a confusing side-menu. It can’t even open documents outside of the directory it’s launched from. Here’s a screenshot for interest’s sake:

By the way, clara was included with Mandriva 2007, so required no compiling on my part.
Name: Clara OCR
Location: http://www.geocities.com/claraocr/
Version: 20031214
Input Format: pbm
Easy of Use: 0/5
Ocre

Ocre was not currently available in Mandriva Linux so I had to compile it myself. This was fairly easy, requiring only gtk2 and aspell development libraries to be installed. I had to download two tarballs, and run make. No big deal there.

Ocre is an interesting combination in concepts between gocr and Clara. It is run from the command line, and outputs the results directly to the terminal, but a rudimentary gtk2 interface pops up when there is an unknown character, asking for a suggestion. I’m not sure if that actually “teaches” the program to interpret that character in the future, or simply fills in the missing holes. The web site is scarce on information, and written in poor English. I do know that it comes with a lot of pre-packaged information for the OCR engine to use in different languages, and can also integrate with the aspell spell checker for help.

I tried the software on various files, and it crashed constantly. The gui would pop up, asking for a keypress or click, but entering any information resulted in a segfault. Trying to run the program with spell checking support resulted in an immediate segfault. This was a shame, since I’m very curious to know what the accuracy would have been – in other words, how many characters would be interpreted without user-input.

Ocre is a very interesting project, but unstable, and under-developed.
Name: ocre
Location: http://lem.eui.upm.es/ocre.html
Version: 0.026
Input Format: pgm/pbm
Easy of Use: 3
Ocrad

Ocrad is the GNU implementation of an OCR package. It does not come with its own graphical interface, and the command line interface is very similar to gocr. One enters a filename, and the text is dumped to standard output. There are fewer customization options than gocr: mainly only grey level.

The OCR results were similar to gocr as well. Scans at 200 DPI and lower and at 1200 DPI and higher dropped significantly in accuracy of interpretation. Grey and binary results were very similar to each other. And again, sans serif fonts were slightly more reliable than serif.

Installation was extremely simple, as it’s included with Mandriva; compiling should be simple enough too, and it’s only a single binary. There’s an info page and no man page, which I found a bit strange, and running it would any arguments just results in a silent hang. Any daily Linux user should be smart enough to try a [-h] or [–help], but still, an oversight.

On the Times 12 test, accuracy topped out at 97%. Very nice – with one caveat. One of the two unrecognized characters was the é – perhaps it only recognizes US-ASCII. Otherwise, it may perform 98%+.
Name: Ocrad
Location: http://www.gnu.org/software/ocrad/ocrad.html
Version: 0.15
Input Format: pbm/pgm
Accuracy: 97%
Easy of Use: 4/5
Tesseract

Recently, HP (one of Free software’s good friends, when convenient) has released some OCR code they developed between 1985 and 1994 called Tesseract, under the Apache license. A group of volunteers have created a new home for it at Google Code, and it is now under active development again. According to the bundled self-promotion, it was one of the best performing OCR engines of its day, so if the development is continued, it looks as if it may become a great gift to the community.

As Tesseract is too new to have been included with Mandriva 2007, I had to compile it myself. Other than tiff-devel, it required very few development libraries.

I downloaded the tarball from the Google Code site, and was pleased to see that the build is based on GNU autotools. (Not that I love autotools, but many old packages just have an ugly, edit-it-yourself makefile.) A simple `./configure; make` had it building. The build failed at several points, seemingly caused by the very recent version of gcc we are using, but patches were available, easily applied, and the build continued without problems.

Tesseract is an engine and framework for other software to build upon, so it only supports a single column of very horizontal text. It does however seem to include some extra code for viewing, training, and spell checking the interpreted text.

Usage is a bit rough, but understandably so for such young and active software. There is no man page, no [-h] or [–help] option, and it crashes if launched without proper arguments. Again, the Google Code website provided much help. It seems to only support tiff at the moment, so I had to use Image Magick to convert all of my scanned pnm’s to tiffs, with:

$ for i in *.pnm; do convert $i $i.tiff; done

Running the software required a simple:

$ /usr/local/bin/tesseract g400.tiff g400.txt

Much to my surprise, the output was near perfect. There were a few errors at font size 10 and 16. In no case was the é interpreted properly. However, disregarding the é, font sizes 12 and 14 were perfectly interpreted in all four typefaces. Very impressive. I’m not sure if the unaccented e is a result of English-only spell checking or not.

While Tesseract is the least user-friendly of the command line applications, it is by far the most accurate, the most active, and the most promising.
Name: Tesseract
Location: http://code.google.com/p/tesseract-ocr/
Version: 1.04b
Input Format: tiff
Accuracy: 99%
Easy of Use: 2/5
Ocropus

Ocropus is the motherload of Free OCR. It began as a combination of a handwriting analysis engine and a layout analysis engine. Tesseract has been integrated as the OCR engine, but it allows for other OCR engines to be plugged in as well. The project is now sponsored by Google.

According to the Google Code site:
OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

Anticipated future functionalities include:

* statistical, trainable layout analysis
* efficient OCR and layout correction, active learning
* a web services interface
* PDF, camera, and screen OCR
* support for additional languages
* integration with Beagle, Spotlight, Google Desktop Search
* GUI frontends for Gnome, Windows, Macintosh
* packaging for Ubuntu, Fedora, and other platforms

Wow. Those are some lofty ambitions! Although it is in a pre-alpha state currently, I couldn’t help but try it out.

As a tarball is not available, I checked out the development tree:

$ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus

In addition to tesseract, a few development libraries had to be installed, but this is a breeze with urpmi:

# urpmi jam aspell-devel libtiff-devel libpng-devel libjpeg-devel

Compiling was easy:

$ ./configure; jam

There is very little documentation bundled with the project. The only test I performed was a direct OCR of my 400 DPI greyscale standard. The default output of ocropus is HTML, with some embedded OCR-specific tags. As expected, the results were the same quality as Tesseract, with the é being the only incorrectly interpreted character. I’ve posted the resulting HTML file in case you’re interested.

Ocropus is entirely the most interesting and promising project here, and with Google’s interest in OCR, their sponsorship may be very helpful.
Name: Ocropus
Location: http://code.google.com/p/ocropus/
Version: svn (20070523)
Input Format: many
Accuracy: 99%
Easy of Use: 1/5
Aspire OCR

Just for fun, I thought I’d like to compare all of this free software to something commercial. My subconscious Linux insecurities made me assume they would far outperform the free software. There was only one piece of commercial software that I could find which had a downloadable demo and ran on Linux (via Java VM in this case).

Aspire OCR is meant to be a development kit, but it does come with a (rather crappy) GUI as a demo. It was unable to read both pnm and pbm files, but luckily could interpret tiff. How were the results? Awful. Worse than gocr even.

Times 12.. The *uìck brown Métisjumped over the flurfy Finance Manager.
VeraS 12: The *uick brown Métís jumped over the fluf Finance Manager.
Arial 12.. The *uíck brown Métis jumped over the flu y Finance Manager.
VeraSS l2: The *uick brown Métis \code(0135)umped over the fluffy fínance Manager.

This result, as compared to Tesseract, reminds us that open can be better than closed, and Free can be better than freeware. Of course I realize that there are many commercial Windows applications that can do flawless OCR on good quality scans of Times 12, but that is not what this article is about.
Name: Aspire OCR
Location: http://asprise.com/product/ocr/index.php?lang=java
Version: 3.0
Input Format: tiff/pdf
Accuracy: 91.5%
Easy of Use: 3/5
In Closing

The good news is that there are solutions available on Linux right now which interpret documents at up to 99% accuracy. The bad news is that 99% is not 100%, and that anything other than a high quality 400-600 DPI scan of 12-14 point font drops off very quickly in accuracy. The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.

I learned a lot about OCR in writing this. I realize it’s not a perfectly scientific review, but I hope it can be useful to other Linux OCR newbies as well as reminding people that Free software alternatives are available and need your help.

* Note: I am not employed by, nor do I have a financial interest in Mandriva Inc.
‹ Chemical Drawing Apps for Linux Review Rawhide vs. Cooker — Head to head ›
Thank you for your job. I’ve
Submitted by stephane (not verified) on Wed, 2007/05/23 – 13:29.

Thank you for your job. I’ve been looking at OCR programs in linux a few years ago and was disapointed. I’m happy to learn that new alternatives exist, like Tesseract and Ocropus. Goog job.

Stephane
»

* reply

Thanks for the informative
Submitted by Ali (not verified) on Wed, 2007/05/23 – 20:16.

Thanks for the informative review. BTW, the KB vs dpi curve is quadratic, not exponential.
»

* reply

No, it is exponential for
Submitted by Pyroguy (not verified) on Tue, 2007/10/30 – 19:37.

No, it is exponential for standard, uncompressed raster images. Perhaps the success rate versus DPI, but not file size. He was right in the article. There is no parabola there at all; just a steady exponential increase.
»

* reply

Last I checked, the area of a
Submitted by Samuel Bronson (not verified) on Fri, 2008/02/22 – 23:17.

Last I checked, the area of a pixel was given res^-2, and the number of pixels needed to fill a region is of course the area of the region divided by the area of the individual pixels. Thus, the number of pixels on a page is quadratic in the resolution.
»

* reply

You might want to try OCRAD,
Submitted by David (not verified) on Wed, 2007/05/23 – 21:10.

You might want to try OCRAD, it`s better than GOCR and worse than Tesseract in my tests, but comes with layout analysis engine built in.

http://www.gnu.org/software/ocrad/ocrad.html

Also, Kooka is a nice KDE front end to GOCR and OCRAD.
»

* reply

A Linux commercial OCR
Submitted by Wendell Dingus (not verified) on Wed, 2007/05/23 – 21:35.

A Linux commercial OCR product you might have missed:

http://www.vividata.com/index.html
»

* reply

Hmmm?? Some of the features
Submitted by sCANNER (not verified) on Wed, 2007/05/23 – 21:55.

Hmmm?? Some of the features I want an OCR package are:

Scanner Integration to allow direct input to the OCR package. Who wants to have hundreds of tiffs littering the hard disk.

Good recognition of different fonts and languages. Recognition of math formulas would be even better.

Preservation of page layout and images.

Direct saving to word-processor formats.

On the fly correction and training.

No open source OCR even comes close to something you can use for real-life situations.
»

* reply

There are a few companies
Submitted by Matthew Lenz (not verified) on Wed, 2007/05/23 – 22:01.

There are a few companies with commercial OCR products that run on Linux. We us a product called Pixel Engine OCR from a company called Mentalix. Although it supports text OCR, we use it to read EAN13 bar codes in fax tiffs for auto indexing documents. It works well but took a couple months to develop the scripts to make it do everything we needed. The engine interprets TCL scripts to handle all the image processing. It allows you to manipulate (various filters, rotate. flip etc) images and convert them to different formats as well. The software isn’t cheap (over 1000USD per license) but it does work rather well. Might be worth your time until ocrapus “gets there”
»

* reply

Would be good to see, how
Submitted by Wojciech Halicki-Piszko (not verified) on Wed, 2007/05/23 – 22:25.

Would be good to see, how well those engines perform for non-latin 1 characters – I come from Poland and we all know Polish isn’t the only language with all those funny letters. I think the benchmark test was a little too short :). Great job! Thanks for taking your time to test all those programs.
»

* reply

Do any of these programmes
Submitted by Bernard (not verified) on Wed, 2007/05/23 – 22:35.

Do any of these programmes handle multiple columns or other formatting issues?
»

* reply

Great post! I’ve been
Submitted by Jay (not verified) on Wed, 2007/05/23 – 22:54.

Great post! I’ve been thinking of using some OCR programs to create a database from some old articles scanned as an alternative to some of the proprietary reference managers out there for Linux and this saves me some research. I’m glad the OSS programs outperformed the closed source ones.

Just as an aside, the file size is proportional to the square of the DPI, so the relationship between file size and DPI is not exponential.
»

* reply

Thanks for coming up with a
Submitted by Gaveen (not verified) on Thu, 2007/05/24 – 01:14.

Thanks for coming up with a review of these interesting packages. I wasn’t aware of some of these packages. Caught this article on OSNews.com

It was really good to learn about some quality OCR software for Linux. Thank you.
»

* reply

Hi! Thanks for this great
Submitted by EthraZa (not verified) on Thu, 2007/05/24 – 01:36.

Hi!
Thanks for this great review!
I wold like to tell that a tesseract-ocr package is now included with Mandriva 2007.1 Spring.

Note: I am not employed by, nor do I have a financial interest in Mandriva Inc. too, but oh god, it is the best Linux out there ;)
»

* reply

Great job and nice overview
Submitted by maarten (not verified) on Thu, 2007/05/24 – 02:52.

Great job and nice overview
»

* reply

A very good and informative
Submitted by Jakob Malm (not verified) on Thu, 2007/05/24 – 03:07.

A very good and informative review! About five months ago I had a bunch of documents in Swedish to scan and convert to text. If I remember correctly I used gocr, and I had to do quite a bit of editing afterwards. I hope OCRopus will come along well!
»

* reply

If you want to compare to
Submitted by A.C. (not verified) on Thu, 2007/05/24 – 03:09.

If you want to compare to commercial OCR software, you should have tried FineReader (abbyy.com), which is as good as commercial OCR for printed text gets. It runs well in WINE, although installation is rather non-trivial, and scanners will not work. Demo version has limitations on exporting and saving.

If you give me the links to your images, I can test it.
»

* reply

If you would send me the
Submitted by Richard Philips (not verified) on Thu, 2007/05/24 – 03:16.

If you would send me the tiff file to my E-mail address, I could run it against ABBYY Finereader 8.0, a state of the art OCR software for Vista. I’ll send you the results back.

Thank you for review.
»

* reply

Thx a lot! Hmm.. And what
Submitted by AnonymousII (not verified) on Thu, 2007/05/24 – 04:09.

Thx a lot!

Hmm.. And what about recognition russian chars, for example?

I think within a few years we’ll have powerfull OCR software.
»

* reply

It’s interesting if this
Submitted by Andrei (not verified) on Thu, 2007/05/24 – 04:15.

It’s interesting if this programs can recognise docs in other languages? Russian, for examle.
»

* reply

thanks for your review.. i
Submitted by bernhard (not verified) on Thu, 2007/05/24 – 05:20.

thanks for your review.. i am planing to switch to linux (zenwalk, sabayon or xubuntu) and i want it with all the features that i use in win xp now..
your report is really very helpful
»

* reply

Hi, Many thanks for you OCR
Submitted by bersace (not verified) on Thu, 2007/05/24 – 05:22.

Hi,

Many thanks for you OCR review. Very helpful since one goal of my SoC is to implement OCR in Gnome with gnome-scan :)

Étienne.
»

* reply

Well done. Recently a friend
Submitted by graffic (not verified) on Thu, 2007/05/24 – 05:27.

Well done. Recently a friend was looking for information about OCR in linux. Your article will help him and it have helped me to know more about OCR in linux. Thank you.

Graf.
»

* reply

Thank you for great post,
Submitted by virens (not verified) on Thu, 2007/05/24 – 06:13.

Thank you for great post, really useful!
»

* reply

Great review, Iv taught of
Submitted by Anonymous? (not verified) on Thu, 2007/05/24 – 07:08.

Great review, Iv taught of testing the options myself but this is a great recourse. Glad to here there are some good options also.

Its funny how many of them have 80’s looking unix UI’s, of course its OCR so who cares. just looks odd.
»

* reply

The state of OCR software is
Submitted by Jeroen Ruigrok van der Werven (not verified) on Thu, 2007/05/24 – 07:43.

The state of OCR software is sad indeed. Especially when working with accented characters, which is unavoidable in Europe (where I live) as well as South-American.

Let alone trying to scan Thai, Tibetan, Chinese, Korean, or Japanese. That area is sorely lacking. And unfortunately an area I do a lot in.
»

* reply

> The state of OCR software
Submitted by A.K.T. (not verified) on Sun, 2007/05/27 – 08:16.

> The state of OCR software is sad indeed. … accented characters … Europe … South-American … Thai, Tibetan, Chinese, Korean, or Japanese.

The commercial software has that area covered. The programs I know well — CuneiForm and FineReader worked fine with all European languages 10 years ago. FineReader 8 supposedly works with hieroglyphs too. And there are programs that claim to be better.

Open-source seems to be catching up. Tesseract promised support for non-English alphabets by the end of 2007.
»

* reply

Thanks! This helped a lot!!
Submitted by Kazuo (not verified) on Thu, 2007/05/24 – 08:11.

Thanks! This helped a lot!!
»

* reply

And Kooka???
Submitted by Rex Bachmann (not verified) on Thu, 2007/05/24 – 09:08.

And Kooka???
»

* reply

Yes, this is going to be an
Submitted by BuddyG (not verified) on Thu, 2007/05/24 – 09:15.

Yes, this is going to be an incredible resource for a lot of people. Thanks and excellent work! -MG
»

* reply

Nice article! But please!
Submitted by James Jiblets (not verified) on Thu, 2007/05/24 – 10:46.

Nice article! But please! Useless use of subshell:

for i in `ls *.pnm` …

is a wasteful version of:

for i in *.pnm

That’s all. :)
»

* reply

Fixed, thanks.
Submitted by Austin on Thu, 2007/05/24 – 11:13.

Fixed, thanks.
»

* reply

..never realised it could be
Submitted by Anonymous_ajj9 (not verified) on Wed, 2008/02/06 – 17:17.

..never realised it could be done thay way – thanks..
»

* reply

i couldn’t help but snicker
Submitted by donkey willits (not verified) on Thu, 2007/05/24 – 11:17.

i couldn’t help but snicker when i read ocrapus oh crap us in the comments. man, part of me is still 4.. but it was funny all the same.
»

* reply

Concerning the effects of
Submitted by ACC (not verified) on Thu, 2007/05/24 – 12:10.

Concerning the effects of resolution, what really matters is the optical resolution of your scanner. The 9600 dpi that the marketing guys said to the poor users is an interpolated one. The same is true for some lower resolutions. Software in firmware are interpolating the results.

From your results I would say that your scanner has an optical resolution of 600 dpi, at least in one direction. It is common to have different resolutions in X and Y direction, one is defined by the CCDs the other by the step motor that drives the head or the paper in case of auto-feeder.
»

* reply

“While Tesseract is the
Submitted by Anonymous2 (not verified) on Thu, 2007/05/24 – 12:15.

“While Tesseract is the least user-friendly of the command line applications, it is by far the most accurate, the most active, and the post promising.”

Looks like the OCR-ing failed reading “most” as “post”… :P
»

* reply

Well, that all looks good,
Submitted by ddc (not verified) on Thu, 2007/05/24 – 13:24.

Well, that all looks good, but AFAIK neither of the OCR programs in question have any way acceptable support for texts in non-iso8859-1 typeset. As I was choosing an ORC program at Fall’06, I found exactly zero OCRs that could somehow recognize cyrillic in any of my samples. AFAIR, non even had an option for selecting a symbol set.
»

* reply

To be correct OCR is and was
Submitted by Anonymous coward (not verified) on Thu, 2007/05/24 – 15:49.

To be correct OCR is and was developed using a few special OCR type faces. The engine used a bitmap overlay on each character since spacing was not kerned it was almost easy.

Today billions of packages and addresses are read by delivery and postal machines.
They basically find the address region. Try to break it up to lines of print. Then try to split the lines to single character. They then use a confidence level from 3 to 6 character engine. Each engine uses it’s own version. One is a software copy of an original hardware computer designed almost 30 years ago. Oddly the modern engines have not been able to fully beat the old system. Most of the newer designs use either a mathematical based on pixel or shapes/curves to determine the character.
»

* reply

One little gotcha with
Submitted by JT (not verified) on Thu, 2007/05/24 – 21:37.

One little gotcha with Tesseract is that it just produces nothing with 16-bit tiffs, such as you get from gimp if you rotate the image to correct bad alignment on the scanner and don’t then remember to flatten the image. Once that’s fixed it looks to have outperformed gocr by at least an order of magnitude on some tables of data in a courier font[*].

[*] gocr repeatedly confused 4 with q or Q and 8 and 9 with g — that was the worst.
»

* reply

Great article. Someone
Submitted by Marsolin (not verified) on Thu, 2007/05/24 – 23:29.

Great article. Someone complained about wanting to be able to OCR as they scanned. Kooka is capable of doing that using either the Ocrad or gocr engines. Hopefully they’ll add tesseract someday.

A couple other projects I thought I’d mention are ocube and VueScan. ocube is a CLI wrapper for tesseract that aims to make it easier to use. VueScan is a commercial OCR package.

Chad
http://linuxappfinder.com
»

* reply

Nitpicking “As you can see,
Submitted by mmebane (not verified) on Fri, 2007/05/25 – 00:08.

Nitpicking

“As you can see, they increase exponentially”

This is incorrect. They actually increase polynomially – quadratically, to be exact. Every time the DPI doubles, the file size quadruples.
»

* reply

Good to see Google
Submitted by FreeBSD (not verified) on Fri, 2007/05/25 – 04:01.

Good to see Google supporting the Tesseract Project. OCR and speech recognition are 2 areas where open source is still behind. Lot of catching up to do.
»

* reply

Yep. The third area free
Submitted by Austin on Fri, 2007/05/25 – 05:45.

Yep. The third area free software is seriously lacking is pen-based computing and handwriting recognition.
»

* reply

Thanks! A very helpful
Submitted by Rusty (not verified) on Fri, 2007/05/25 – 09:14.

Thanks! A very helpful review. I think OCR and handwriting recognition are an important direction to take. The better computers become at interpreting “human” input, the better the man-machine interface will become. And Open Source, better yet!
»

* reply

Really interesting, thank
Submitted by Giulio (not verified) on Fri, 2007/05/25 – 15:20.

Really interesting, thank you!

I tried gocr and ocrad in the past ( and between these 2 i choosed ocrad ), and i remember gocr was a lot, some orders of magnitude, slower.

What about timing commands ( with the time command, e.g. time /usr/local/bin/tesseract g400.tiff g400.txt )?
»

* reply

Thanks for the review. I had
Submitted by What is Linux? (not verified) on Fri, 2007/05/25 – 16:07.

Thanks for the review. I had to scan some documents recently, and didn’t even try OCR on my home Fedora machine. I went straight to the university’s XP machines and used whatever program was installed (I really don’t remember the name). It had about 80% accuracy: in other words, useless. To bad my HP all-in-one scanner/printer/fax can only scan single pages and not books, else I’d look into Tesseract.
»

* reply

ocrad and gocr have always
Submitted by taupist (not verified) on Sun, 2007/05/27 – 12:55.

ocrad and gocr have always been a bust for me.

Just out of curiousity, is the middle scroll box in the Clara interface a text editor? It would be nice to have the image in a box over the editor. I don’t think it looks that bad, kind of like HUD (heads up display).
»

* reply

Thank you for this very
Submitted by Jean-François (not verified) on Wed, 2007/05/30 – 04:50.

Thank you for this very interesting review. It’s true OCR remains a (the?) weak point in Linux systems, although both commercial OCR companies FineReader and Readiris (the best ones ?) offer toolkits for Linux.
Obviously building an OCR software requires MUCH MORE work than a zipping software. Why don’t the meritorius programmers join forces ?
Regards
Jean-François
»

* reply

Do you know Gamera ? This is
Submitted by Anonymousse (not verified) on Fri, 2007/06/01 – 14:49.

Do you know Gamera ? This is an interesting framework for document recognition. You can easily recognize unknown language by using supervised learning algorithm.
Homepage: http://ldp.library.jhu.edu/projects/gamera/
»

* reply

Aside from the technical
Submitted by P.H. (not verified) on Fri, 2007/06/01 – 15:12.

Aside from the technical issues with OCR, I’d like to point you to a fantastic new project in this area. You know that there are many wide book digitization projects around (one of the most known being Google Books) and all suffer from the limitation of OCR software… You know that every time you leave a comment on a blog you’ve got to go through stupid captchas…

Here comes ReCAPTCHA !

Basically it’s a new captcha system taking advantage of human intelligence in order to help decipher difficult words. Learn more here :

http://recaptcha.net/learnmore.html
»

* reply

Good review. ocre version
Submitted by luisjc (not verified) on Fri, 2007/06/08 – 08:20.

Good review.

ocre version 0.027 is on.

Can I help you?

Luis Cearra


Copyright in all content belongs to the author. Groundstate is not responsible for nor endorses submitted content or comments.

GNU Ocrad is an OCR (Optical Character Recognition) program and library based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are collectively known as pnm.Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.

10 Responses to "Linux OCR is it worth it?"

rankpulse says:

Haven’t you try Free OCR? It’s a free OCR service and the accurate is good.

admin says:

On the face of it it seems less able than ocrconvert.

I’m suspicious that I got both of the links ocrconvert and goodocr suggested out of the blue like that. I’m guessing that these sites don’t take to much to put together and have a chance to look into others business docs for juicy info …?

ilnli says:

Also http://www.ocrconvert.com is a good free online optical character recognition tool.

admin says:

Thanks, will check it out.

Steven says:

I just tested a service and I wish to share it: OCR online. It’s not perfect, the service is English and no emphasis is recognized, but most of the work is done for the rest …

ilnli says:

I’m not sure about goodocr but OCRconvert has explicitly stated in their Terms of Service that users must not upload any confidential material in case they will not be responsible for it, so I don’t think OCRconvert does actually want to look into other’s business docs.

Anyway its just what I think.

Nick says:

Tried FC 12 version of tesseract. Just the rpm no mods. Initially no results until I realized the resolution had to be 300×300 or higher and grayscale. Tried it on some really old documents printed and typed ranging from 50 to 100 years old. Was amazed at the accuracy of the results. (Beats gocr by the way).

[…] There are a few open sourced OCR programs for linux. You can't just scan to OO exactly and be able to edit. http://www.splitbrain.org/blog/2010-…are_comparison http://alicious.com/linux-ocr/ […]

Sath.Linux says:

Lios (was know as easy-ocr)

Lios is a free and open source software for converting print in to text using either scanner or a camera, It can also produce text out of scanned images from other sources such as Pdf, Image or Folder containing Images. Program is given total accessibility for visually impaired. Lios is written in python, and we release it under GPL3 license. Lios will work with Debian based operating systems. There are great many possibilities for this program, Feedback is the key to it, Expecting your feedback Nalin.x.Linux@Gmail.com

Download latest deb file from Download Page open it and install

Lilou says:

For an all-in-one application to edit PDF files and OCR, please consider our application called PDF Studio:

http://www.qoppa.com/pdfstudio/

It supports review / markups, form filling, assembling, signing and content editing.

It is a commercial application.


About

Flapjacktastic is just a random collection of musings, hints&tips, notes, information ... a collection of stuff really that's overflowed from the brain of this husband, father, potter, business-man, geek ...

past posts