linux program to convert PDF to text
sktsee
sktsee at tulsaconnect.com
Thu Feb 11 17:07:30 UTC 2010
On Wed, 10 Feb 2010 20:39:52 -0800, Robert Swanson wrote:
>> $ pdftotext -layout -eol unix -nopgbrk Novell-629.pdf novell-629.txt
>>
>> Is this what you are needing, or something that adds additional
>> formatting?
>>
>> --
>> sktsee
>
> Sktsee,
> The command worked almost perfectly except that it will not
convert
> quotes,
> dashes, and apostrophes. It takes a little proof reading, but it is a
> great improvement over what I had.
> Thanks,
> Bob
Hmm, that's odd. The pdf file I converted, Document 629, contains quotes,
apostrophes and dashes in the text file. Here's the first paragraph
(without double-spacing): watch for wrap
SCO claims that Novell slandered SCO’s alleged title to the UNIX
copyrights by falsely
stating that SCO did not own the UNIX copyrights. (Second Am. Compl. ¶¶
9-10, 91-92, Dkt.
No. 96.) Novell has asserted the First Amendment as a defense. (Novell’s
Answer ¶ 136, Dkt.
No. 115.) The First Amendment protects corporations, as the Supreme Court
recently
confirmed. Citizens United v. FEC, 2010 U.S. LEXIS 766 (U.S. Jan. 21,
2010) (Ex. 2A). Novell
moves for a ruling that First Amendment defenses apply to SCO’s slander
of title claim, because
the First Amendment applies to any claim based on an alleged “injurious
falsehood.”
That's a straight copy-n-paste from the created text file, and as you can
see it has apostrophes, dashes, quotation marks, and even paragraph marks
(at least it does on my machine). I'm wondering if there is an encoding
issue here. By default, pdftotext uses UTF-8 character encoding, so
whatever program you used to view the text file should understand UTF-8.
You can also tell pdftotext to use a different encoding with the "-enc
<encoding name>" option. You can type "pdftotext -listenc" to get a list
of available encodings pdftotext can use. You might try "-enc Latin1", or
"-enc ASCII7" and see if you get different results (ASCII7 loses
paragraph marks, though).
Other than that, I'm not sure what's going on. Can you provide a link to
a pdf document that has missing punctuation after being converted? I'll
convert it on my machine and see if I get the same result that you did.
--
sktsee
More information about the ubuntu-users
mailing list