[PACKS] Performance opportunities.

Thu Aug 30 18:05:24 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I haven't made these changes yet, but I've been profiling where the time
> goes during commit. Timings in this tree are on my laptop; the tree is
> an export of the HEAD of the mozilla sample tree we converted to bzr
> some time back - 550MB of data, 55K files. The baseline is 4m3 seconds
> user time, and 4m23 
> 
> Three things so far stand out as things we don't /need/ to spend time
> on. 
> 
> One is having gzip files in the pack, rather than raw zlib. There is a
> massive difference here - GzipFile takes 86 seconds, using zlib directly
> a trivial implementation takes 38 seconds, to compress a 550MB tar. (the
> gzip command line takes 36 seconds). Possibly we can fix up GzipFile,
> but I have looked closely at it before, so I'm not convinced that its
> worth doing this - packs are not zcattable, unlike .knit files which had
> no delimiters between gzip objects. So we need our own debug tools
> anyway. A rough and ready change to this shaved 30s off commit.
> 

Well, I know that zlib can directly decompress gzip streams. You just have to
pass the right initializers. I'm doing it in my "pyrex_knit_extract" branch. I
would guess that you could do the same to generate a gzip stream.

I found a pretty big benefit for working around GzipFile when doing
extractions. I'm guessing we could get something similar for doing it when
compressing.

> Secondly we sha the working tree twice on an initial commit (bzr init;
> bzr add; bzr commit) because everything is a miss - thats only
> ~3seconds, but 3 seconds on a 263 is still > 1%.
> 

It may be a bit more than that, because we are reading the file 2 times. Which
could certainly mess up the caching algorithms. (We read the file 1 time when
building up the inventory, and then a second time when we go to commit the texts.)

We've discussed this one a bit, and I think the answer is to change the
"DirState.update_entry()" call, so that it doesn't produce a sha1sum if it
doesn't have one. It takes a little bit of thinking to get that change to
propagate up the stack correctly. But it could be beneficial.

(In my mind, we might allow for _iter_changes to have 3 possibilities,
definitely changed, definitely not changed, and 'maybe changed'.)

> Thirdly the way we store annotations has quite some overhead at the
> moment. Turning our knit storage to use the PlainFactory rather than the
> Annotated one saves 30 seconds.
> 
> So I have a prototype branch (it doesn't convert data, so its quite
> un-interoperable as yet) where I have commit at:
> 
> no-anno, zlib direct:
> real    3m24.990s
> user    3m1.215s
> sys     0m11.377s
> 
> no anno:
> real    3m50.336s
> user    3m34.897s
> sys     0m10.941s
> 
> baseline (my normal packs branch off of bzr.dev):
> real    4m23.884s
> user    4m3.963s
> sys     0m11.649s
> 
> Thats a 25% saving of userspace time.
> 
> Concretely, I plan to switch to using zlib directly in packs. I'll also
> look at making the annotation cache be separate and disable-able.

Getting the C version of PatienceDiff may provide a bigger benefit than just
disabling annotations. Well, for commits after the first one. Obviously for the
first commit, there are other benefits to be had.
I think we could genuinely decrease the cost of annotating a first commit
(since you know that all the lines must be from the given commit, it could be
treated as 1 big hunk, rather than having to add extra data to each line).

> 
> I'm looking for critiques and 'good idea', 'bad idea' comments on this.
> As well as suggestions for other things we can do in the short time
> remaining before I'll need to start solidfying packs for 0.92 - when I'd
> like to release the first user-exposed format.
> 
> -Rob

I would be really interested to see a different delta operation for packs, but
the ones I have in mind would probably require getting rid of annotations first.

zlib/gzip.... I'm okay with losing gzip. I'm not 100% convinced we need to. But
it does bloat the size of the compressed streams a little bit, and is slightly
slower. It honestly shouldn't be much slower since gzip is just zlib with a bit
of extra metadata in the stream. The fact that it *is* slower is probably just
an artifact of GzipFile. But then again, it lets us stick to the python stdlib,
rather than implementing our own.

On the flip side, why not have the compression be parameterized. So that when
sending a pack file over the wire you can actually bzip2 it. Well, actually you
would want to send the raw texts and bzip2 the whole pack file, not just each
section. (Like what we do now for the bundle in merge directives.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG1vjTJdeBCYSNAAMRAjU2AKDV/Pdp5SOK1K4A+szC/wTsBQQFtwCfSnkk
l7FEpSbMOO1XesIl/nxi/5I=
=zFZN
-----END PGP SIGNATURE-----