Excess data size for a single revision

Mon Jan 23 11:22:37 UTC 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> The pack is this:
> 
> -rw-rw-rw- 1 eliz eliz  28040427 Jan 19 06:35
> 36bfdda5be84a32615e6db8f9eaabed3.pack
> 
> I verified (by looking at .bzr.log) that there was no repacking
> since then.

So 28MB vs the ~58MB you said it downloaded. I can certainly come up
with a scenario where this could happen. Namely:
  https://bugs.launchpad.net/bzr/+bug/402669

I believe the emacs main repository has a lot of people committing
directly to the repository, so you probably get a fair number of
single commit pack files. Which coupled with bug #402669, means that
all of those single-commits have the fulltext of all the texts
present, and that gets transmitted. Then when you receive the 35
revisions locally, you can combine any files that were changed
multiple times.

> 
>> 2) Use the name of that file to then inspect the associated
>> index files (tix = text content, rix = revision, cix = inventory
>> stuff, iix/six are probably not very interesting). For example,
>> my most recent file in bzr is:
>> c0ba9a41c20d1b447d3b603361b63bbf.pack
>> 
>> You can use
>> 
>> head -n5
>> .bzr/repository/indicies/c0ba9a41c20d1b447d3b603361b63bbf.tix
>> 
>> To get the summary information, and you can this to get the
>> detail:
>> 
>> bzr dump-btree [--raw] 
>> .bzr/repository/indicies/c0ba9a41c20d1b447d3b603361b63bbf.tix
>> 
>> Note that it will probably be a bit verbose, but if you look
>> around in it, you can see how many files are affected, etc. In my
>> case, a 2.7MB bzr pack file had 1712 entries in the .tix (1700
>> files were affected), 549 entries in .rix (549 total revisions),
>> and 2,159 entries in .cix (which has to do with inventory
>> management.)
>> 
>> With the raw data, you can start working out what the size on
>> disk actually comprises of.
> 
> What am I looking for, though?  E.g., the .rix index corresponding
> to the above pack has 35 revisions, while the corresponding .tix
> file has 4088 texts.  Is the latter unusually large?

Averaged across a lot of histories (bzr, mysql, linux kernel, emacs I
think), a good heuristic is <10 texts changed per commit. Above is
averaging 100 texts changed per commit, or about 10x normal. So yes,
it is larger than expected.

repo	# texts	# revs	t/r
bzr	172249	 63446	2.7
emacs	264859	118524	2.2
mysql	388608 	 74779	5.2

Now that is averaged over a lot of history, and development workflow
impacts this a lot. Merges, in particular, can swing it high or low.
In the case of MySQL, they tend to do a lot more merges that touch the
same files, so a merge creates one commit that touches lots of files.
While bzr tends to do merges that are orthogonal, so the actual merge
commit doesn't introduce new content, so the merge looks more like a
commit that doesn't change much.

> 
>> 3) If you want to test the fetch again, you can create a new 
>> repository, and branch your old revision into it (so it shouldn't
>> copy any new data) and then do the fetch again. So something
>> like:
>> 
>> bzr branch -r 106888 . ../../somewhere-not-in-the-shared-repo
>> --no-tree cd ../../somewhere-not-in-the-shared-repo bzr pull
>> bzr+ssh://eliz@bzr.savannah.gnu.org/emacs/trunk -Dhpss
> 
> Will do, thanks.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8dQv0ACgkQJdeBCYSNAAMc1ACgp+1JR7+8DU0RK47Epz+Gh8vH
YRYAn0tSp4QmYT/LVzCwG3Vqc8uioSeN
=sIXn
-----END PGP SIGNATURE-----