Need advice: Ubuntu OCR techniques

Mon Oct 10 01:26:14 UTC 2011

On 10/09/2011 10:34 AM, Kevin O'Gorman wrote:
> I'm new to OCR (optical character reading), have never done it 
> before.  Suddenly I have a need.
>
> I've been diving through old papers and have found hard-copy (appears 
> to be real Courier font, laser printed on white background) of a 
> program I wrote decades ago on a Macintosh 512K in Lightspeed C.  I 
> thought I had lost it completely.  I would like to recover it from the 
> hard-copy without typing ~100 pages of code.  I have a scanner, and 
> full Acrobat CS5 on a Windows machine, plus all the FOSS of Ubuntu 
> (tesseract, gocr, plus anything useful in multiverse).  Does anybody 
> know the fastest way to usable code from this situation?
>
> -- 
> Kevin O'Gorman, PhD
>

Tesseract works well if you have single page tiffs.  Apparently not as 
well with multipage.  If your scanner will spit out a stack of tiffs, 
then try something like:

$ for i in *.tif; do tesseract ${i} "${i/.tif}"; done

Then cat together the resulting text files.

I am sure there is a good way to split apart a multipage tiff, but I 
don't know it.  If the scanner puts all the pages into a pdf, then you 
can use the pdf2tif script here:
http://www.groklaw.net/articlebasic.php?story=20061210115516438

That page also has the script "ocr.sh" which will automate the whole 
process except for catting together the results