[MERGE] Packs. Kthxbye.
John Arbash Meinel
john at arbash-meinel.com
Wed Oct 17 23:39:52 BST 2007
John Arbash Meinel has voted comment.
Status is now: Semi-approved
Comment:
General design question:
Are the indexes purely supplemental data (can they be rebuilt from the
.pack
file)? I know for Knits this is not the case, but I was hoping you had a
chance
to fix that.
why did you feel the need to implement both a Knit1 format and a Knit3
format?
It would make more sense to me to just have 1 format going forward.
Though I suppose the time to convert from K1 => K3 might be a bit slow.
+ def get_format_string(self):
+ """See RepositoryFormat.get_format_string()."""
+ return "Bazaar Experimental no-subtrees\n"
+
+ def get_format_description(self):
+ """See RepositoryFormat.get_format_description()."""
+ return "Experimental no-subtrees"
Obviously you'll want to have a name other than 'experimental' and these
strings should end up having the version of Bazaar that introduced them.
I'm sure you know, I just wanted to remind you about it.
Are we going to call these just 'pack'. (bzr init --format=pack)
def insert_data_stream(self, stream):
+ """XXX What does this really do?
+
+ Is it a substitute for fetch?
+ Should it manage its own write group ?
+ """
^- This is part of Andrew's new work to stream data across using the
smart server.
So, AFAICT it is indeed a substitute for fetch. You ask the source for a
stream,
and hand that off to the target.
+ def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
+ revision_ids):
^- Was there a real need to factor this out of
fileids_altered_by_revision_ids(self, revision_ids)=
?
Is it re-used somewhere? Or is this just to decrease the level of
indenting?
+class Pack(object):
+ """An in memory proxy for a pack and its indices.
+
+ This is a base class that is not directly used, instead the classes
+ ExistingPack and NewPack are used.
+ """
+
...
+
+ def inventory_index_name(self, name):
+ """The inv index is the name + .iix."""
+ return self.index_name('inventory', name)
^- It would probably be good to have a
def index_name(self, kind, name):
raise NotImplementedError
In general, I prefer to see any functions the abstract class will use be
at
least defined. Since it makes it more obvious (you don't have to search
for all
the implementations, and see if they agree on how it should work.)
Actually, I don't even see the implementation of it on ExistingPack, so
I'm
wondering if these should even exist on the base Pack object.
+class NewPack(Pack):
+ """An in memory proxy for a pack which is being created."""
...
+ def __init__(self, upload_transport, index_transport,
pack_transport,
+ upload_suffix=''):
+ """Create a NewPack instance.
+
+ :param upload_transport: A writable transport for the pack to
be
+ incrementally uploaded to.
+ :param index_transport: A writable transport for the pack's
indices to
+ be written to when the pack is finished.
+ :param pack_transport: A writable transport for the pack to be
renamed
+ to when the upload is complete. This *must* be the same as
+ upload_transport.clone('../packs').
^- Passing in 3 separate transport objects seems a bit odd to me. I
suppose
this is because you want to stage the files in 'upload' and then rename
them into
'pack', and then write the index? Though it would seem to me that you
should upload
the file and the index, and then rename them into place. (Obviously the
.pack before
the .*ix).
Also, passing in a target transport which *must* be clone('../packs')
seems a bit odd.
Why not clone it yourself if it must be in a fixed relative position?
+ # - refresh the pack-list to see if the pack is now absent
+ self.upload_transport.rename(self.random_name,
+ '../packs/' + self.name + '.pack')
+ self._state = 'finished'
^- I realize that this works, but I would honestly prefer it if
Transports were
not allowed to use '..' for anything but cloning.
Since you are hard-coding the transport locations anyway, you could just
take an
'upload_transport', and then create a
self.parent_transport = upload_transport.clone('..')
self.pack_transport = self.parent_transport.clone('packs')
And then your rename becomes
self.parent_transport.rename(self.upload_name + self.random_name,
'packs/'+self.name+'.pack')
Just a thought.
+ def add_writable_index(self, index, pack):
+ """Add an index which is able to have data added to it.
+
+ :param index: An index from the pack parameter.
+ :param pack: A Pack instance.
+ """
+ assert self.add_callback is None
^- Why does add_callback need to be None at this point? If there is a
specific reason,
it would be good to have it documented in the docstring.
Also, it isn't 100% clear why you have an AggregatedIndex on top of a
CombinedIndex.
I'm sure there is a reason, but they sure sound like they would be the
same class.
It would be nice to see a docstring at the top of
bzrlib/repofmt/pack_repo.py
which defines the overall layout of a pack repository.
For example, it seems like you have a directory with 'packs' a directory
with
'indexes' and a directory for uploading.
Why aren't the indexes next to the packs?
+class RepositoryPackCollection(object):
+ """Management of packs within a repository."""
+
+ def __init__(self, repo, transport, index_transport,
upload_transport,
+ pack_transport):
+ """Create a new RepositoryPackCollection.
+
+ :param transport: Addresses the repository base directory
+ (typically .bzr/repository/).
+ :param index_transport: Addresses the directory containing
indexes.
+ :param upload_transport: Addresses the directory into which
packs are written
+ while they're being created.
+ :param pack_transport: Addresses the directory of existing
complete packs.
+ """
+ self.repo = repo
+ self.transport = transport
+ self._index_transport = index_transport
+ self._upload_transport = upload_transport
+ self._pack_transport = pack_transport
+ self._suffix_offsets = {'.rix':0, '.iix':1, '.tix':2, '.six':3}
+ self.packs = []
+ # name:Pack mapping
+ self._packs = {}
^- You are using 'self.packs' which is a list of packs and 'self._packs'
which
is a dictionary mapping the pack name to the pack objects. I would at
least
call them "self.packs" and "self._pack_map". Why is one public and the
other
private?
Why are repo, transport, packs, and the indexes public?
Why do you have:
+ def all_packs(self):
+ """Return a list of all the Pack objects this repository has.
+
+ Note that an in-progress pack being created is not returned.
+
+ :return: A list of Pack objects for all the packs in the
repository.
+ """
+ result = []
+ for name in self.names():
+ result.append(self.get_pack_by_name(name))
+ return result
if you made self.packs() public?
It seems silly to go through self.names() and another get_pack_by_name()
call
for all of that.
Do we need 'get_pack_by_name()' and "names()" at all?
During autopack:
+ pack_distribution = self.pack_distribution(total_revisions)
+ existing_packs = []
+ for pack in self.all_packs():
+ revision_count = pack.get_revision_count()
+ if revision_count == 0:
+ # revision less packs are not generated by normal
operation,
+ # only by operations like sign-my-commits, and thus
will not
+ # tend to grow rapdily or without bound like commit
containing
+ # packs do - leave them alone as packing them really
should
+ # group their data with the relevant commit, and that
may
+ # involve rewriting ancient history - which autopack
tries to
+ # avoid. Alternatively we could not group the data but
treat
+ # each of these as having a single revision, and thus
add
+ # one revision for each to the total revision count, to
get
+ # a matching distribution.
+ continue
+ existing_packs.append((revision_count, pack))
^- This would seem to indicate that counting by number of revisions may
not be
the best method. I would probably treat them as 1 count files, though
you make
some decent points here.
Your "plan_autopack_combinations" seems interesting. I'm not sure if the
algorithm is 100% correct.
Take this example: existing_packs = 50, 40, 12, 7, 3
50+40+12+7+3 = 112
pack_distribution = 100, 10, 1, 1
I tried to walk through what it would do, but it seems to do very weird
things
when trying to add them in.
Also, in the above scenario, you are doing
+ if next_pack_rev_count >= pack_distribution[0]:
But you are never decrementing pack_distribution[0], which means that in
the
above you will pack everything into the '100' pack, even though it
overflows a
lot. (you never got a pack > 100, so you always stay in the
+ # add the revisions we're going to add to the next
output pack
portion.
At the very least, I think when you add a pack to the current pack, you
need to
decrement the pack distribution, or something.
I could just be misunderstanding what your algorithm was trying to do.
If you are trying not to split existing packs, I would have thought
100, 10, 1, 1
would have at least tried to do
90, 12, 10
or maybe
112, 10
# (overflowing slightly on one is better than not packing)
I do think the algorithm tries to pack big ones into bigger ones before
trying
to pack small ones in. Which is probably correct. (small ones == recent
ones,
and you want recent ones to be fast to access, with small indexes.)
But I do think you need to consider a few more arrangements of packs,
and make
sure it is doing what you think it is. (Maybe it is, but it isn't doing
what
*I* think it should be doing.)
+ 100 every 100, 1000 every 1000
....
Caching and writeing of data
============================
^- At the very end you have this. which is pretty obviously not correct
(though you didn't write it).
Sorry I didn't get past pack_distribution(). I'll try to review more
later.
For details, see:
http://bundlebuggy.aaronbentley.com/request/%3C1192614619.31212.3.camel%40lifeless-64%3E
More information about the bazaar
mailing list