trying to OCR a simple tif file with tesseract-ocr

Sat Mar 26 12:56:10 UTC 2011

On Sat, 26 Mar 2011, Nicolae Ghimbovschi wrote:

> Hi,
>
> Firstly try to pre-process the tiff file with unpaper, you'll need
> netpbm utils to convert tiff to pgm and back to tiff
>
> Stage 1:
>
> $  tifftopnm -respectfillorder -verbose file.tiff > out.pgm
> $  unpaper -v --layout single out.pgm unpaper_out.pgm
> $  pnmtotiff unpaper_out.pgm > unpaper_out.tiff
>
>
> Then ocr it with tesseract:
>
> Stage 2:
> $ tesseract unpaper_out.tiff ocr.txt -l eng

  that doesn't appear to make any difference -- still an empty output
file.  but i have to admit, i don't understand *why* unpaper would
make any difference.  as i read the man page, its purpose is to
increase the quality of scanned book pages, perhaps in preparation for
OCR processing.  however, if i *start* with perfect quality tif images
as a result of a screenshot, what value would running unpaper have?

rday

-- 

========================================================================
Robert P. J. Day                               Waterloo, Ontario, CANADA
                        http://crashcourse.ca

Twitter:                                       http://twitter.com/rpjday
LinkedIn:                               http://ca.linkedin.com/in/rpjday
========================================================================