[Bug 1683428] Re: read_csv on bzip2 file unzips only the first block

Julian Taylor jtaylor.debian at googlemail.com
Mon Apr 17 17:29:19 UTC 2017


** Also affects: python2.7 (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to python2.7 in Ubuntu.
https://bugs.launchpad.net/bugs/1683428

Title:
  read_csv on bzip2 file unzips only the first block

Status in pandas package in Ubuntu:
  New
Status in python2.7 package in Ubuntu:
  New

Bug description:
  It seems that the read_csv() suffers the same symptoms as eg the early
  boost implementations, see
  https://svn.boost.org/trac/boost/ticket/3853 for details. The bz2
  files can namely be composed of many concatenated bz2 blocks which
  have to be treated as a continuous stream.

  How to test: create large csv file, much larger than 900k. Compress
  with pbzip2 (each process creates one bz2 block). Alternatively create
  many such csv files, bzip2 them individually and then cat *.bz2
  >joined.bz2

  read_csv() will uncompress and read only the first block.

  Note that this is a severe bug since the parallel bzip2 is getting
  increasingly common on multi-core systems.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.10
  Package: python-pandas 0.17.1-3ubuntu2
  ProcVersionSignature: Ubuntu 4.8.0-42.45-generic 4.8.17
  Uname: Linux 4.8.0-42-generic x86_64
  ApportVersion: 2.20.3-0ubuntu8.2
  Architecture: amd64
  CurrentDesktop: XFCE
  Date: Mon Apr 17 18:42:52 2017
  InstallationDate: Installed on 2014-10-21 (909 days ago)
  InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2)
  PackageArchitecture: all
  SourcePackage: pandas
  UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pandas/+bug/1683428/+subscriptions



More information about the foundations-bugs mailing list