linux program to convert PDF to text

sktsee sktsee at tulsaconnect.com
Thu Feb 11 17:07:30 GMT 2010


On Wed, 10 Feb 2010 20:39:52 -0800, Robert Swanson wrote:

>> $ pdftotext -layout -eol unix -nopgbrk Novell-629.pdf novell-629.txt
>>
>> Is this what you are needing, or something that adds additional
>> formatting?
>>
>> --
>> sktsee
> 
> Sktsee,
> 	The command worked almost perfectly except that it will not 
convert
> 	quotes,
> dashes, and apostrophes.  It takes a little proof reading, but it is a
> great improvement over what I had.
> Thanks,
> Bob

Hmm, that's odd. The pdf file I converted, Document 629, contains quotes, 
apostrophes and dashes in the text file. Here's the first paragraph 
(without double-spacing): watch for wrap

        SCO claims that Novell slandered SCO’s alleged title to the UNIX 
copyrights by falsely
stating that SCO did not own the UNIX copyrights. (Second Am. Compl. ¶¶ 
9-10, 91-92, Dkt.
No. 96.) Novell has asserted the First Amendment as a defense. (Novell’s 
Answer ¶ 136, Dkt.
No. 115.) The First Amendment protects corporations, as the Supreme Court 
recently
confirmed. Citizens United v. FEC, 2010 U.S. LEXIS 766 (U.S. Jan. 21, 
2010) (Ex. 2A). Novell
moves for a ruling that First Amendment defenses apply to SCO’s slander 
of title claim, because
the First Amendment applies to any claim based on an alleged “injurious 
falsehood.”

That's a straight copy-n-paste from the created text file, and as you can 
see it has apostrophes, dashes, quotation marks, and even paragraph marks 
(at least it does on my machine). I'm wondering if there is an encoding 
issue here. By default,  pdftotext uses UTF-8 character encoding, so 
whatever program you used to view the text file should understand UTF-8. 
You can also tell pdftotext to use a different encoding with the "-enc 
<encoding name>" option. You can type "pdftotext -listenc" to get a list 
of available encodings pdftotext can use. You might try "-enc Latin1", or 
"-enc ASCII7" and see if you get different results (ASCII7 loses 
paragraph marks, though).

Other than that, I'm not sure what's going on. Can you provide a link to 
a pdf document that has missing punctuation after being converted? I'll 
convert it on my machine and see if I get the same result that you did.

-- 
sktsee





More information about the ubuntu-users mailing list