Need advice: Ubuntu OCR techniques

Mon Oct 10 04:19:22 UTC 2011

On Sun, Oct 9, 2011 at 6:34 PM, NoOp <glgxg at sbcglobal.net> wrote:

> On 10/09/2011 02:39 PM, Kevin O'Gorman wrote:
> > On Sun, Oct 9, 2011 at 1:09 PM, Kevin O'Gorman <kogorman at gmail.com>
> wrote:
> >
> >> On Sun, Oct 9, 2011 at 11:10 AM, Icarus Alive <icarus.alive at gmail.com
> >wrote:
> >>
> >>> On Sun, Oct 9, 2011 at 11:04 PM, Kevin O'Gorman <kogorman at gmail.com>
> >>> wrote:
> >>> > I'm new to OCR (optical character reading), have never done it
> before.
> >>> > Suddenly I have a need.
> >>> >
> >>> > I've been diving through old papers and have found hard-copy (appears
> to
> >>> be
> >>> > real Courier font, laser printed on white background) of a program I
> >>> wrote
> >>> > decades ago on a Macintosh 512K in Lightspeed C.  I thought I had
> lost
> >>> it
> >>> > completely.  I would like to recover it from the hard-copy without
> >>> typing
> >>> > ~100 pages of code.  I have a scanner, and full Acrobat CS5 on a
> Windows
> >>> > machine, plus all the FOSS of Ubuntu (tesseract, gocr, plus anything
> >>> useful
> >>> > in multiverse).  Does anybody know the fastest way to usable code
> from
> >>> this
> >>> > situation?
> >>>
> >>> Use the power-of-the-cloud... Google docs can do OCR. For english
> >>> language printed text, scanned well, it works pretty well.
> >>> http://docs.google.com/support/bin/answer.py?answer=176692
> >>>
> >>> Icarus (may your wings stay on),
> >>
> >> Great idea.  I'll check it out.
> >>
> >> I was unable to make it work.  I scanned one of the files as a 3-page
> TIFF
> > file with Irfanview, and uploaded it to Google Docs.  I marked all the
> > checkboxes for conversion, but did not get a text document.  I've marked
> it
> > shared to all, and the link (for me) is
> >
> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B6pbHEZND52eZWNlZGQ4MmUtMTgwZi00MTQ3LWJkMTUtNzIzOTIwMWRlOWJk&hl=en_US
> > (modulo any folding)
> ...
>
> Does:
> $ tesseract crystal.h1.tif crystal
> Tesseract Open Source OCR Engine
> Page 1
> Page 2
> $ gedit crystal.txt
> not work for you?
>

Funny you should mention that.  I just installed tesseract after finding
that gocr(1) could not deal with multipage TIF files.  It works about 99%
other than whitespace, which still leaves a lot of proofreading and
indenting.
On the subject of multipage TIF files, I created one this morning using
Irfanview for the scanning, but have been unable to do that since then.
I've since started using the -append flag of convert(1) to build a
document's worth of images.

Still, I wonder what I forgot to do with Irfanview.

Anyway, it appears I have a way to proceed, so this question is solved.
Thanks to all.
-- 
Kevin O'Gorman, PhD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20111009/33d5ae75/attachment.html>