One more advocate needed for python-nltk

rmunn at rmunn at
Mon Feb 15 20:16:02 GMT 2010

Hi all,

After the Ubuntu Developer Week, I got excited about doing packaging and started looking for a piece of software that needed to be packaged for Ubuntu, so that I could get some practice on a "real" project after the learning exercises during Daniel Holbach's Packaging 101 class. I'd recently been a technical reviewer for a new O'Reilly book, "Natural Language Processing with Python" (, so I figured I'd package the NLTK library that the book is about. After about a week's worth of work, I had a package ready to go. So I set out to create a needs-packaging bug and discovered that someone else had independently worked on the same package. I got in touch with him and we agreed that I would probably have more time to maintain the package, so I took over the needs-packaging bug (LP #514936) and uploaded python-nltk to REVU.

It took a little while to work out all the problems with the package -- some were classic newbie mistakes from a new packager, while others were problems with the upstream package that I had to work out -- but I finally got them all worked out, and a few days ago I got my first MOTU advocate. (Thanks, sistpoty!) I just need one more advocate before Feature Freeze this Thursday so that the python-nltk package can go into Lucid.

I'd really like to get this into Lucid. NLTK is a rather famous package in its field (famous enough that O'Reilly wanted to publish a book about it), so the fact that "apt-get install python-nltk" is not possible in Ubuntu is a gap in Ubuntu's package list that I'd like to plug. But I've done just about everything I can do by now, and the rest is up to the current MOTUs, who are naturally busy with Feature Freeze coming up. I hope to join the MOTU ranks someday, but for now I need to impose on your time. Could someone please review and be a second advocate for the package before Thursday?

In the interest of saving you some time when you review the package, here are some areas that have previously been brought up for discussion on IRC, and the conclusions that were reached about them:

* There's a binary (nltk.jar) included in the upstream tarball. It contains compiled Java classes whose code is also included in the upstream tarball, under javasrc/org/nltk/mallet/*.java. There are no licensing or copyright issues involved in redistributing this binary in the .orig.tar.gz file, since its content is entirely NLTK code, copyright by the NLTK Project, and explicitly licensed under the Apache-2.0 license (which permits redistributing both source and binaries as long as they're accompanied by a copy of the license). Including the upstream-provided nltk.jar in the binary .deb package would pose a security risk (how do you prove that a Trojan horse wasn't slipped into the .jar?), so my package deletes nltk.jar and re-creates it from the source files.

* That is, the package *would* re-create nltk.jar from source if its dependencies were in Debian or Ubuntu. But the Java classes in nltk.jar depend on an old version of Mallet (, version 0.4. Mallet is open-source, but has never been packaged for Debian, and has had an API change with Mallet 2.0 that would make NLTK's Mallet interface obsolete. (The nltk.jar classes import from "edu.umass.cs.mallet.base", but Mallet 2.0 moved its classes to the "cc.mallet" location). After discussing the problem with upstream (thread can be viewed at, I decided, as a temporary measure, to remove the NLTK-Mallet interface and make its functions and classes raise a NotImplementedError, since getting Mallet packaged properly *and* converting the NLTK code to use Mallet's new API would have taken too long and missed the window for including python-nltk in Lucid. Upstream agreed, and I made the change in the python-nltk package. In the process, I discovered that the Mallet interface wasn't being imported in NLTK's files, so most users would never have even realized it was there. Thus, my removing it from the Ubuntu package for Lucid will go unnoticed by the vast majority of people who use NLTK, since the Mallet interface doesn't work as documented on Windows or Mac systems either!

* Another issue that was raised at one point was a copyright quirk. NLTK uses the ElementTree code that comes standard with Python 2.5, but only requires Python 2.4 (or later) -- so they imported the ElementTree code from the Python 2.5 codebase. The license for ElementTree says that "[b]y obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions" (and the rest is BSD-style boilerplate -- see the debian/copyright for nltk/etree/* in my package if you want to read the whole thing). One person on #ubuntu-motu brought up the fact that a "By using ... this software, you agree that you have read ... [the license]" clause would seem to require a debconf hack to display the license during package installation, which tends to annoy users. However, others pointed out that a license you implicitly agree to by "obtaining ... this software" would be unenforceable in pretty much any jurisdiction, especially since the copyright holder's own download page ( doesn't display this license anywhere and so most people would obtain the software without being aware of the license. And then there's the fact that this same code is found in the python and python-elementtree packages, both of which have been in Debian and Ubuntu for years without any copyright or licensing issues being raised. In the end, it was decided that a debconf hack to display the license is NOT needed in this particular case.

* There was also a false positive from lintian about "build-depends-without-arch-dep", which refers to the fact that python-yaml is listed in Build-Depends (and lintian thought it should go in Build-Depends-Indep). However, python-yaml is required for the clean step of debian/rules and thus Debian policy says it belongs in Build-Depends; furthermore, current versions of lintian don't give the warning. Only the older version of lintian found in REVU was raising this warning. After discussing it on #ubuntu-motu, I decided to follow policy rather than lintian's false-positive warning, and I left python-yaml in Build-Depends.

As you can see, I've put quite a bit of work into this already. :-) So unless your review of identifies a brand-new problem that neither I, nor the 4-5 people who've already looked at this, spotted before, I believe it should be a fairly quick review process. So who would like to improve Lucid by advocating for this package? Now, several days before Feature Freeze, is the time to do it -- that will give me time to respond to any other issues.

I can usually be found hanging out on #ubuntu-motu as rmunn, and that's probably the quickest way to tell me about any issues that might still need to be ironed out on this package. You can also reach me through this list or on ubuntu-motu-sponsors at, where I recently sent out a message introducing myself.

Thanks in advance to whoever ends up reviewing this package, and thus helping me improve Lucid by making more open-source software available to its users.

Robin Munn
rmunn at
GPG key 0x4543D577

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 271 bytes
Desc: OpenPGP digital signature
Url : 

More information about the Ubuntu-motu mailing list