trying to OCR a simple tif file with tesseract-ocr

Sat Mar 26 18:17:57 UTC 2011

On 03/26/2011 06:28 AM, Nicolae Ghimbovschi wrote:
> It was a tif file, imageshack has converted it to png.

Copied the png, converted to .tif with Gimp. First try I'd forgotten to
flatten & decompress:

$ tesseract ocrtest.tif testocr
Tesseract Open Source OCR Engine
check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:32
Segmentation fault

Flattened & decomressed:
$ tesseract ocrtest2.tif testocr
Tesseract Open Source OCR Engine
$ ls *.txt
testocr.txt

And the result:
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer at website.com is spam.
Der ,,schnelle" braune Fuchs springt
uber den faulen Hund. Le renard brun
<<rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marron répido salta sobre el perro
perezoso. A raposa marrom répida
salta sobre o cio preguicoso.

$ apt-cache policy tesseract-ocr
tesseract-ocr:
  Installed: 2.04-2
  Candidate: 2.04-2
  Version table:
 *** 2.04-2 0
        500 http://us.archive.ubuntu.com/ubuntu/ maverick/universe i386
Packages
        100 /var/lib/dpkg/status