SOLVED: Need advice: Ubuntu OCR techniques

Mon Oct 10 04:20:01 UTC 2011

On Sun, Oct 9, 2011 at 9:19 PM, Kevin O'Gorman <kogorman at gmail.com> wrote:

> On Sun, Oct 9, 2011 at 6:34 PM, NoOp <glgxg at sbcglobal.net> wrote:
>
>> On 10/09/2011 02:39 PM, Kevin O'Gorman wrote:
>> > On Sun, Oct 9, 2011 at 1:09 PM, Kevin O'Gorman <kogorman at gmail.com>
>> wrote:
>> >
>> >> On Sun, Oct 9, 2011 at 11:10 AM, Icarus Alive <icarus.alive at gmail.com
>> >wrote:
>> >>
>> >>> On Sun, Oct 9, 2011 at 11:04 PM, Kevin O'Gorman <kogorman at gmail.com>
>> >>> wrote:
>> >>> > I'm new to OCR (optical character reading), have never done it
>> before.
>> >>> > Suddenly I have a need.
>> >>> >
>> >>> > I've been diving through old papers and have found hard-copy
>> (appears to
>> >>> be
>> >>> > real Courier font, laser printed on white background) of a program I
>> >>> wrote
>> >>> > decades ago on a Macintosh 512K in Lightspeed C.  I thought I had
>> lost
>> >>> it
>> >>> > completely.  I would like to recover it from the hard-copy without
>> >>> typing
>> >>> > ~100 pages of code.  I have a scanner, and full Acrobat CS5 on a
>> Windows
>> >>> > machine, plus all the FOSS of Ubuntu (tesseract, gocr, plus anything
>> >>> useful
>> >>> > in multiverse).  Does anybody know the fastest way to usable code
>> from
>> >>> this
>> >>> > situation?
>> >>>
>> >>> Use the power-of-the-cloud... Google docs can do OCR. For english
>> >>> language printed text, scanned well, it works pretty well.
>> >>> http://docs.google.com/support/bin/answer.py?answer=176692
>> >>>
>> >>> Icarus (may your wings stay on),
>> >>
>> >> Great idea.  I'll check it out.
>> >>
>> >> I was unable to make it work.  I scanned one of the files as a 3-page
>> TIFF
>> > file with Irfanview, and uploaded it to Google Docs.  I marked all the
>> > checkboxes for conversion, but did not get a text document.  I've marked
>> it
>> > shared to all, and the link (for me) is
>> >
>> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B6pbHEZND52eZWNlZGQ4MmUtMTgwZi00MTQ3LWJkMTUtNzIzOTIwMWRlOWJk&hl=en_US
>> > (modulo any folding)
>> ...
>>
>> Does:
>> $ tesseract crystal.h1.tif crystal
>> Tesseract Open Source OCR Engine
>> Page 1
>> Page 2
>> $ gedit crystal.txt
>> not work for you?
>>
>
> Funny you should mention that.  I just installed tesseract after finding
> that gocr(1) could not deal with multipage TIF files.  It works about 99%
> other than whitespace, which still leaves a lot of proofreading and
> indenting.
> On the subject of multipage TIF files, I created one this morning using
> Irfanview for the scanning, but have been unable to do that since then.
> I've since started using the -append flag of convert(1) to build a
> document's worth of images.
>
> Still, I wonder what I forgot to do with Irfanview.
>
> Anyway, it appears I have a way to proceed, so this question is solved.
> Thanks to all.
> --
> Kevin O'Gorman, PhD
>
>

-- 
Kevin O'Gorman, PhD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20111009/719f7865/attachment.html>