[Bug 1683428] Re: read_csv on bzip2 file unzips only the first block
Julian Taylor
jtaylor.debian at googlemail.com
Mon Apr 17 17:29:19 UTC 2017
** Also affects: python2.7 (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to python2.7 in Ubuntu.
https://bugs.launchpad.net/bugs/1683428
Title:
read_csv on bzip2 file unzips only the first block
Status in pandas package in Ubuntu:
New
Status in python2.7 package in Ubuntu:
New
Bug description:
It seems that the read_csv() suffers the same symptoms as eg the early
boost implementations, see
https://svn.boost.org/trac/boost/ticket/3853 for details. The bz2
files can namely be composed of many concatenated bz2 blocks which
have to be treated as a continuous stream.
How to test: create large csv file, much larger than 900k. Compress
with pbzip2 (each process creates one bz2 block). Alternatively create
many such csv files, bzip2 them individually and then cat *.bz2
>joined.bz2
read_csv() will uncompress and read only the first block.
Note that this is a severe bug since the parallel bzip2 is getting
increasingly common on multi-core systems.
ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: python-pandas 0.17.1-3ubuntu2
ProcVersionSignature: Ubuntu 4.8.0-42.45-generic 4.8.17
Uname: Linux 4.8.0-42-generic x86_64
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: amd64
CurrentDesktop: XFCE
Date: Mon Apr 17 18:42:52 2017
InstallationDate: Installed on 2014-10-21 (909 days ago)
InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2)
PackageArchitecture: all
SourcePackage: pandas
UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pandas/+bug/1683428/+subscriptions
More information about the foundations-bugs
mailing list