Need advice: Ubuntu OCR techniques
Patton Echols
p.echols at comcast.net
Mon Oct 10 01:26:14 UTC 2011
On 10/09/2011 10:34 AM, Kevin O'Gorman wrote:
> I'm new to OCR (optical character reading), have never done it
> before. Suddenly I have a need.
>
> I've been diving through old papers and have found hard-copy (appears
> to be real Courier font, laser printed on white background) of a
> program I wrote decades ago on a Macintosh 512K in Lightspeed C. I
> thought I had lost it completely. I would like to recover it from the
> hard-copy without typing ~100 pages of code. I have a scanner, and
> full Acrobat CS5 on a Windows machine, plus all the FOSS of Ubuntu
> (tesseract, gocr, plus anything useful in multiverse). Does anybody
> know the fastest way to usable code from this situation?
>
> --
> Kevin O'Gorman, PhD
>
Tesseract works well if you have single page tiffs. Apparently not as
well with multipage. If your scanner will spit out a stack of tiffs,
then try something like:
$ for i in *.tif; do tesseract ${i} "${i/.tif}"; done
Then cat together the resulting text files.
I am sure there is a good way to split apart a multipage tiff, but I
don't know it. If the scanner puts all the pages into a pdf, then you
can use the pdf2tif script here:
http://www.groklaw.net/articlebasic.php?story=20061210115516438
That page also has the script "ocr.sh" which will automate the whole
process except for catting together the results
More information about the ubuntu-users
mailing list