Rev 34: Another disk-format bump. in http://bazaar.launchpad.net/%7Ebzr/bzr-groupcompress/trunk

Thu Mar 5 17:21:06 GMT 2009

At http://bazaar.launchpad.net/%7Ebzr/bzr-groupcompress/trunk

------------------------------------------------------------
revno: 34
revision-id: john at arbash-meinel.com-20090305172017-mefnbegtuk4vt99i
parent: john at arbash-meinel.com-20090304223810-agw3duzy5tul01da
parent: john at arbash-meinel.com-20090305165238-o5be2o7v8wzewnlk
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: trunk
timestamp: Thu 2009-03-05 11:20:17 -0600
message:
  Another disk-format bump.

  Move the labels/sha1 information into a pre-header. This also makes it
  easier to decide to enable/disable the headers, as we can support
  both with the same deserialising code (at least until we remove
  the extra info from the indexes.)

  This also makes a fulltext record stream start with 'f' and a delta
  record stream start with 'd', which makes them more self describing.
  The next step would probably be to write the base128 length of the
  encoded bytes, which would make them fully independent, though
  you wouldn't know what content they refer to.

  This also brings in an update to .compress() which allows us to
  see that we overflowed our group, roll back and start a new one.
  This seems to give better compression in a 'more stable' manner.
  Still open to tweaking, though.

  Also introduce the 'gcc-chk255-big' which uses 64k leaf pages
  rather than 4k leaf pages. Initial results show smaller compressed
  size at a small (10%) increase in uncompressed size. Also shows
  a full level decrease in the tree depth.

  No-labels decreases the inv size approx 300kB, and big-page decreases
  the inv size another 300kB, not to mention the 116k decrease in the
  .cix index, just from not having the extra pages.

  Having both no-labels and big inv pages brings a total drop of
  11023k down to 9847k for the repo (1176kB savings, or 10% overall).

  For now, leave the default with labels, but consider changing it.
removed:
  equivalence_table.py           equivalence_table.py-20080723225607-fk4rlr7rm1wln8w4-1
modified:
  __init__.py                    __init__.py-20080705181503-ccbxd6xuy1bdnrpu-6
  errors.py                      errors.py-20080705181503-ccbxd6xuy1bdnrpu-7
  groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
  repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
  tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.16
    revision-id: john at arbash-meinel.com-20090305165238-o5be2o7v8wzewnlk
    parent: john at arbash-meinel.com-20090305154227-41elarat0xs75c1p
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Thu 2009-03-05 10:52:38 -0600
    message:
      Make sure we don't inter-pack to GCCHKBig repos.
      Change the code so that we can branch from a source that has no labels
      even if we don't have _NO_LABELS set locally.
      Restore labels and slow as the default.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 32.1.15
    revision-id: john at arbash-meinel.com-20090305154227-41elarat0xs75c1p
    parent: john at arbash-meinel.com-20090305132400-k1i3iw0vz53oywy0
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Thu 2009-03-05 09:42:27 -0600
    message:
      Implement a 'bigpage' version of chk serializer, which uses 64kB pages for leaf nodes. (this is approx 255 leaf entries, similar to the internal fan out.)
    modified:
      __init__.py                    __init__.py-20080705181503-ccbxd6xuy1bdnrpu-6
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
    ------------------------------------------------------------
    revno: 32.1.14
    revision-id: john at arbash-meinel.com-20090305132400-k1i3iw0vz53oywy0
    parent: john at arbash-meinel.com-20090305042604-9d9sl2idrw3lvlqu
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Thu 2009-03-05 07:24:00 -0600
    message:
      Fix a bug in 'FAST' handling.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.13
    revision-id: john at arbash-meinel.com-20090305042604-9d9sl2idrw3lvlqu
    parent: john at arbash-meinel.com-20090305040549-1egrt0x9kqzl3d7j
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 22:26:04 -0600
    message:
      bring back the code that handles _NO_LABELS.
      Basically, we omit the header, and just hold the content.
      This drops the chk from 1.5MB => 1.1MB, and the texts from 8.1=>7.7
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.12
    revision-id: john at arbash-meinel.com-20090305040549-1egrt0x9kqzl3d7j
    parent: john at arbash-meinel.com-20090305034657-t3qbsogy187yul4z
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 22:05:49 -0600
    message:
      Add a single byte to indicate whether the following text is a fulltext
      or a delta.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.11
    revision-id: john at arbash-meinel.com-20090305034657-t3qbsogy187yul4z
    parent: john at arbash-meinel.com-20090305032949-ffww56phklv1vhbj
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 21:46:57 -0600
    message:
      Slightly different handling of large texts.

      We should only use 2*max_fulltext as a minimum size if we are still working
      on the same file. That allows us to avoid packing all texts in
      after an ISO.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.10
    revision-id: john at arbash-meinel.com-20090305032949-ffww56phklv1vhbj
    parent: john at arbash-meinel.com-20090304223243-xrg48jyhczvpkjxc
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 21:29:49 -0600
    message:
      Play around with detecting compression breaks.
      Trying to get tricky with whether the last insert was a fulltext or delta
      did not pay off well (yet).
      However, using similar logic actually shows some of the best results yet.
      The main difference is probably that we detect overflow and rollback.
      So if we got a big fulltext that pushes us over the line, in the past
      we would leave it alone (poorly compressed in the last group),
      and start a new group, which would start off with a new fulltext.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.9
    revision-id: john at arbash-meinel.com-20090304223243-xrg48jyhczvpkjxc
    parent: john at arbash-meinel.com-20090304214211-rg22q09z8queeer0
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 16:32:43 -0600
    message:
      Add some benchmark results for various flush sizes.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.8
    revision-id: john at arbash-meinel.com-20090304214211-rg22q09z8queeer0
    parent: john at arbash-meinel.com-20090304212250-xcvwt1yx4zt76pev
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 15:42:11 -0600
    message:
      Fix up the tests. Mostly it was just changing things to
      no longer include the labels.
      It also means we get a positive compression ratio :).
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.7
    revision-id: john at arbash-meinel.com-20090304212250-xcvwt1yx4zt76pev
    parent: john at arbash-meinel.com-20090304210622-ur7wz2dz0w4lhzn3
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 15:22:50 -0600
    message:
      Have the GroupCompressBlock decide how to compress the header and content.
      It can now decide whether they should be compressed together or not.
      As long as we make the to_bytes() function match the from_bytes() one, we should be fine.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
    ------------------------------------------------------------
    revno: 32.1.6
    revision-id: john at arbash-meinel.com-20090304210622-ur7wz2dz0w4lhzn3
    parent: john at arbash-meinel.com-20090304183131-p433dz5coqrmv8pw
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 15:06:22 -0600
    message:
      (tests broken) implement the basic ability to have a separate header
      This puts the labels/sha1/etc together, and then has the actual content deltas
      combined later on.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      repofmt.py                     repofmt.py-20080715094215-wp1qfvoo7093c8qr-1
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.5
    revision-id: john at arbash-meinel.com-20090304183131-p433dz5coqrmv8pw
    parent: john at arbash-meinel.com-20090304182042-yo1m7n2i2bpdldfl
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 12:31:31 -0600
    message:
      Now using a zlib compressed format.
      We encode the length of the compressed and uncompressed content,
      and then compress the actual content.
      Need to do some testing with real data to see if this is efficient
      or if another structure would be better.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.4
    revision-id: john at arbash-meinel.com-20090304182042-yo1m7n2i2bpdldfl
    parent: john at arbash-meinel.com-20090304180240-xbl3a604h819an7y
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 12:20:42 -0600
    message:
      We at least have the rudimentary ability to encode and decode values.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.3
    revision-id: john at arbash-meinel.com-20090304180240-xbl3a604h819an7y
    parent: john at arbash-meinel.com-20090304170218-c3thty7hh2yfrnye
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: groupcompress
    timestamp: Wed 2009-03-04 12:02:40 -0600
    message:
      Add a encode/decode base128 functions.

      Not entirely sure if I'll use them, but they may come in handy.
    modified:
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.2
    revision-id: john at arbash-meinel.com-20090304170218-c3thty7hh2yfrnye
    parent: john at arbash-meinel.com-20090304165605-zbap3q69laok4o6p
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 11:02:18 -0600
    message:
      First cut at meta-info as text form.
    modified:
      errors.py                      errors.py-20080705181503-ccbxd6xuy1bdnrpu-7
      groupcompress.py               groupcompress.py-20080705181503-ccbxd6xuy1bdnrpu-8
      tests/test_groupcompress.py    test_groupcompress.p-20080705181503-ccbxd6xuy1bdnrpu-13
    ------------------------------------------------------------
    revno: 32.1.1
    revision-id: john at arbash-meinel.com-20090304165605-zbap3q69laok4o6p
    parent: john at arbash-meinel.com-20090304161119-wjb6l5idp2k9niwq
    committer: John Arbash Meinel <john at arbash-meinel.com>
    branch nick: internal_index
    timestamp: Wed 2009-03-04 10:56:05 -0600
    message:
      fully remove the eq table for now.
    removed:
      equivalence_table.py           equivalence_table.py-20080723225607-fk4rlr7rm1wln8w4-1
-------------- next part --------------

Diff too large for email (1081 lines, the limit is 1000).