trying to OCR a simple tif file with tesseract-ocr
Robert P. J. Day
rpjday at crashcourse.ca
Sat Mar 26 12:56:10 UTC 2011
On Sat, 26 Mar 2011, Nicolae Ghimbovschi wrote:
> Hi,
>
> Firstly try to pre-process the tiff file with unpaper, you'll need
> netpbm utils to convert tiff to pgm and back to tiff
>
> Stage 1:
>
> $ tifftopnm -respectfillorder -verbose file.tiff > out.pgm
> $ unpaper -v --layout single out.pgm unpaper_out.pgm
> $ pnmtotiff unpaper_out.pgm > unpaper_out.tiff
>
>
> Then ocr it with tesseract:
>
> Stage 2:
> $ tesseract unpaper_out.tiff ocr.txt -l eng
that doesn't appear to make any difference -- still an empty output
file. but i have to admit, i don't understand *why* unpaper would
make any difference. as i read the man page, its purpose is to
increase the quality of scanned book pages, perhaps in preparation for
OCR processing. however, if i *start* with perfect quality tif images
as a result of a screenshot, what value would running unpaper have?
rday
--
========================================================================
Robert P. J. Day Waterloo, Ontario, CANADA
http://crashcourse.ca
Twitter: http://twitter.com/rpjday
LinkedIn: http://ca.linkedin.com/in/rpjday
========================================================================
More information about the ubuntu-users
mailing list