[Bug 737234] Re: too much data transferred making a new stacked branch

Fri Jun 10 16:11:36 UTC 2011

** Changed in: bzr (Ubuntu Natty)
   Importance: Undecided => High

** Changed in: bzr (Ubuntu Natty)
     Assignee: (unassigned) => Jelmer Vernooij (jelmer)

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to bzr in Ubuntu.
https://bugs.launchpad.net/bugs/737234

Title:
  too much data transferred making a new stacked branch

Status in Bazaar Version Control System:
  Fix Released
Status in Bazaar 2.3 series:
  Fix Released
Status in “bzr” package in Ubuntu:
  Fix Released
Status in “bzr” source package in Natty:
  In Progress

Bug description:
  In thread "Linaro bzr feedback" John writes:

  Note, I just did 'bzr branch lp:gcc-linaro', and it transferred about
  500MB, about 457MB on disk. (Not bad considering lp:emacs transferred
  400-500MB and was only 200MB on disk.)

  I then ran 'bzr serve' and 'bzr branch --stacked bzr://localhost:...'.
  What was scary was:

  8141442kB 24128kB/s / Finding Revisions
  ...
  > Grepping the .bzr.log file in question, I do, indeed see about 8.1GB of
  > data transferred before we read the first .tix.
  > If my grep fu is strong, then we only read 30MB of .cix data. Which
  > leaves us with 8GB of .pack content, or actual CHK page content.

  This is a change which drops the 8GB down to 150MB:

  === modified file 'bzrlib/inventory.py'
  - --- bzrlib/inventory.py 2010-09-14 13:12:20 +0000
  +++ bzrlib/inventory.py 2011-03-17 15:38:40 +0000
  @@ -736,6 +736,13 @@
              specific_file_ids = set(specific_file_ids)
          # TODO? Perhaps this should return the from_dir so that the root is
          # yielded? or maybe an option?
  +        if from_dir is None and specific_file_ids is None:
  +            # They are iterating from the root, assume they are iterating
  +            # everything and preload all file_ids into the
  +            # _fileid_to_entry_cache. This doesn't build things into
  .children
  +            # for each directory, but that will happen later.
  +            for _ in self.iter_just_entries():
  +                continue
          if from_dir is None:
              if self.root is None:
                  return

  Basically, iter_entries_by_dir goes in a specific order which doesn't
  match the order in the repository. 'iter_just_entries' loads everything
  in repository order, and puts it into the
  CHKInventory._file_id_entry_cache, and then the rest of the requests are
  fed from there.

  We don't usually notice this effect, because of the
  chk_map._thread_caches.page_cache and the GCCHKRepository block cache.
  Once the inventory is large enough to not be in the bytes cache, we have
  to load it from the repository again.

  I just checked, and this also has a large effect for local
  repositories.

  'time list(rev_tree.inventory.iter_entries_by_dir())'
  drops from 4m30s down to 13s with the patch.

  So we certainly should think about other ramifications, but short term
  it looks quite good.

To manage notifications about this bug go to:
https://bugs.launchpad.net/bzr/+bug/737234/+subscriptions