scanner optimization and usage

Tue Aug 12 16:56:15 UTC 2008

I am in the process of converting a number of high-end math
and physics texts to PDF.  I want to be able to sit with a laptop
and not have to run back and forth to the bookcases lugging
heavy books and winding up with a pile of them in the floor.

Using Kubuntu 8.04.1 with xsane and a Canon LiDE 25 scanner.

I scan in pages at 300DPI to PS files, i.e. 001.ps, 002.ps, ...
I need the 300DPI (I think) to see clean crisp text and images
at 400% zoom factors.  

After I generate the book in many files of PS, I have a script page_txt
with the line

gs -sPAPERSIZE=a4 -sDEVICE=pnmraw -r300 -dNOPAUSE -dBATCH -sOutputFile=- -q \ 
$1 | ocrad  > `basename $1 .ps`.txt

and I then, from the command line in the directory:

for i in *ps
do
page_txt $i
done

This gives me pages with the OCR text.  I tried gocr and tesseract and
did not get as good results as with ocrad.  

Then I run the following

gs -q -dNOPAUSE -dBATCH  -sDEVICE=pdfwrite -sOutputFile=out.pdf *.ps

to consolidate all the .ps files into one PDF file.  I can live with this.
I can NFS mount several TB of disc space, so that is not an issue at
this time.  :-)

Is there a way to further compress the file sizes at any point and still
not lose the desired resolution?  Using only software that comes with
Kubuntu or available from the Kubuntu repos.  Inquiring minds want
to know.

I may have reinvented the wheel or gone about this all wrong, but
education is expensive no matter how you get it.

Thanks in advance,

chuck