[MERGE] Packs. Kthxbye.
Robert Collins
robertc at robertcollins.net
Fri Oct 19 03:35:16 BST 2007
On Wed, 2007-10-17 at 18:39 -0400, John Arbash Meinel wrote:
> John Arbash Meinel has voted comment.
> Status is now: Semi-approved
> Comment:
> General design question:
> Are the indexes purely supplemental data (can they be rebuilt from the
> .pack
> file)? I know for Knits this is not the case, but I was hoping you had a
> chance
> to fix that.
No. I wanted to do so but time was not sufficient.
> why did you feel the need to implement both a Knit1 format and a Knit3
> format?
> It would make more sense to me to just have 1 format going forward.
> Though I suppose the time to convert from K1 => K3 might be a bit slow.
Having a Knit3 only would make it a watershed event for early adopters.
Packs are not suitable to be a default format yet, so this would impede
experimentation.
> + def get_format_string(self):
> + """See RepositoryFormat.get_format_string()."""
> + return "Bazaar Experimental no-subtrees\n"
> +
> + def get_format_description(self):
> + """See RepositoryFormat.get_format_description()."""
> + return "Experimental no-subtrees"
>
> Obviously you'll want to have a name other than 'experimental' and these
> strings should end up having the version of Bazaar that introduced them.
> I'm sure you know, I just wanted to remind you about it.
> Are we going to call these just 'pack'. (bzr init --format=pack)
I think pack is ok, though its somewhat generic, and only first of many
I expect. Perhaps pack-knits or something would be better.
>
> def insert_data_stream(self, stream):
> + """XXX What does this really do?
> +
> + Is it a substitute for fetch?
> + Should it manage its own write group ?
> + """
>
> ^- This is part of Andrew's new work to stream data across using the
> smart server.
> So, AFAICT it is indeed a substitute for fetch. You ask the source for a
> stream,
> and hand that off to the target.
This was a hint to get Andrew to write a docstring :).
> + def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
> + revision_ids):
> ^- Was there a real need to factor this out of
> fileids_altered_by_revision_ids(self, revision_ids)=
> ?
> Is it re-used somewhere? Or is this just to decrease the level of
> indenting?
Yes, its used by the pack create_pack_from_packs function.
>
> +class Pack(object):
> + """An in memory proxy for a pack and its indices.
> +
> + This is a base class that is not directly used, instead the classes
> + ExistingPack and NewPack are used.
> + """
> +
> ...
>
> +
> + def inventory_index_name(self, name):
> + """The inv index is the name + .iix."""
> + return self.index_name('inventory', name)
>
> ^- It would probably be good to have a
>
> def index_name(self, kind, name):
> raise NotImplementedError
>
> In general, I prefer to see any functions the abstract class will use be
> at
> least defined. Since it makes it more obvious (you don't have to search
> for all
> the implementations, and see if they agree on how it should work.)
> Actually, I don't even see the implementation of it on ExistingPack, so
> I'm
> wondering if these should even exist on the base Pack object.
The overall refactoring I was heading towards, and have not reached yet,
was to remove all outside users of these functions, so they became just
'revision_index_name()', at which point they would be supported on both
NewPack and ExistingPack. I'm ok with use adding a NotImplemented marker
stub.
>
>
> +class NewPack(Pack):
> + """An in memory proxy for a pack which is being created."""
> ...
>
> + def __init__(self, upload_transport, index_transport,
> pack_transport,
> + upload_suffix=''):
> + """Create a NewPack instance.
> +
> + :param upload_transport: A writable transport for the pack to
> be
> + incrementally uploaded to.
> + :param index_transport: A writable transport for the pack's
> indices to
> + be written to when the pack is finished.
> + :param pack_transport: A writable transport for the pack to be
> renamed
> + to when the upload is complete. This *must* be the same as
> + upload_transport.clone('../packs').
>
> ^- Passing in 3 separate transport objects seems a bit odd to me. I
> suppose
> this is because you want to stage the files in 'upload' and then rename
> them into
> 'pack', and then write the index? Though it would seem to me that you
> should upload
> the file and the index, and then rename them into place. (Obviously the
> .pack before
> the .*ix).
Well, indices before the pack, because today the indices are required
data. When the indices are regenerable, then the pack first because
anyone can generate an index on demand.
> Also, passing in a target transport which *must* be clone('../packs')
> seems a bit odd.
> Why not clone it yourself if it must be in a fixed relative position?
pack_transport is used by base class methods, we could clone it
ourselves, but actually - I'd like to fix the Transport.relpath bug that
makes it impossible to get a relative path like '../foo' from
'PREFIX/bar'.relpath('PREFIX/foo')
> + # - refresh the pack-list to see if the pack is now absent
> + self.upload_transport.rename(self.random_name,
> + '../packs/' + self.name + '.pack')
> + self._state = 'finished'
>
> ^- I realize that this works, but I would honestly prefer it if
> Transports were
> not allowed to use '..' for anything but cloning.
> Since you are hard-coding the transport locations anyway, you could just
> take an
> 'upload_transport', and then create a
> self.parent_transport = upload_transport.clone('..')
> self.pack_transport = self.parent_transport.clone('packs')
>
> And then your rename becomes
> self.parent_transport.rename(self.upload_name + self.random_name,
> 'packs/'+self.name+'.pack')
>
> Just a thought.
I'd prefer
self.upload_transport.rename(self.random_name,
self.upload_transport.relpath(self.pack_transport.base) +
self.file_name())
> + def add_writable_index(self, index, pack):
> + """Add an index which is able to have data added to it.
> +
> + :param index: An index from the pack parameter.
> + :param pack: A Pack instance.
> + """
> + assert self.add_callback is None
>
> ^- Why does add_callback need to be None at this point? If there is a
> specific reason,
> it would be good to have it documented in the docstring.
It was a guard to prevent having two writable indices at once - which
would indicate a bug.
> Also, it isn't 100% clear why you have an AggregatedIndex on top of a
> CombinedIndex.
> I'm sure there is a reason, but they sure sound like they would be the
> same class.
CombinedIndex is the GraphIndex layer. AggregatedIndex holds the state
for the mapping back to knits.
> It would be nice to see a docstring at the top of
> bzrlib/repofmt/pack_repo.py
> which defines the overall layout of a pack repository.
> For example, it seems like you have a directory with 'packs' a directory
> with
> 'indexes' and a directory for uploading.
> Why aren't the indexes next to the packs?
Well, I intended to have indices be regeneratable, so it made sense to
me to put the derived data in a different directory. I thought I noted
much of this in the repository format documentation, but I agree that
there should be more of a docstring. I'm not sure how much should be
module, and how much format/class.
> +class RepositoryPackCollection(object):
> + """Management of packs within a repository."""
> +
> + def __init__(self, repo, transport, index_transport,
> upload_transport,
> + pack_transport):
...
> ^- You are using 'self.packs' which is a list of packs and 'self._packs'
> which
> is a dictionary mapping the pack name to the pack objects. I would at
> least
> call them "self.packs" and "self._pack_map". Why is one public and the
> other
> private?
> Why are repo, transport, packs, and the indexes public?
Public is rather vague :). Certainly, they are not marked as private,
but they aren't public to users of bzrlib as they are on an internal
object only exposed as aRepository._packs.
>
> Why do you have:
>
> + def all_packs(self):
> + """Return a list of all the Pack objects this repository has.
> +
> + Note that an in-progress pack being created is not returned.
> +
> + :return: A list of Pack objects for all the packs in the
> repository.
> + """
> + result = []
> + for name in self.names():
> + result.append(self.get_pack_by_name(name))
> + return result
>
> if you made self.packs() public?
> It seems silly to go through self.names() and another get_pack_by_name()
> call
> for all of that.
Well...
> Do we need 'get_pack_by_name()' and "names()" at all?
Possibly less so now that before.
There should be one and only one place where pack ordering is done.
(Currently the ordering is by name, which is bad, it should be size -
smallest first, as smallest are most likely to contain new data)
> During autopack:
>
> if revision_count == 0:
> # revision less packs are not generated by normal operation,
> # only by operations like sign-my-commits, and thus will not
> # tend to grow rapdily or without bound like commit containing
> # packs do - leave them alone as packing them really should
> # group their data with the relevant commit, and that may
> # involve rewriting ancient history - which autopack tries to
> # avoid. Alternatively we could not group the data but treat
> # each of these as having a single revision, and thus add
> # one revision for each to the total revision count, to get
> # a matching distribution.
> continue
> existing_packs.append((revision_count, pack))
>
> ^- This would seem to indicate that counting by number of revisions may
> not be
> the best method. I would probably treat them as 1 count files, though
> you make
> some decent points here.
I'm open - as that comment indicates. I don't think this sort of policy
tweaking needs to be decided right now though.
> Your "plan_autopack_combinations" seems interesting. I'm not sure if the
> algorithm is 100% correct.
> Take this example: existing_packs = 50, 40, 12, 7, 3
> 50+40+12+7+3 = 112
> pack_distribution = 100, 10, 1, 1
>
> I tried to walk through what it would do, but it seems to do very weird
> things
> when trying to add them in.
There are 112 revisions. So the ideal histogram is [1x100,1x10,2x1]
(given by pack_distribution(112)). The actual assigned operations will
be different - because we want to avoid shuffling large amounts of data
- its better to keep an existing overly large pack than to split it; and
having an overly large pack reduced round trips so it is good to keep
it.
What we'll do is start with the biggest bucket and fill it:
50 + 40 + 12 = 102.
Then the next
7 +3 = 10
So we'll end up with one 102 pack, and one 10 pack, as expected.
> Also, in the above scenario, you are doing
> + if next_pack_rev_count >= pack_distribution[0]:
> But you are never decrementing pack_distribution[0], which means that in
> the
> above you will pack everything into the '100' pack, even though it
> overflows a
> lot. (you never got a pack > 100, so you always stay in the
> + # add the revisions we're going to add to the next
> output pack
> portion.
We don't modify pack_distribution[0], because we can't tell what we were
aiming for if we do that, instead we accumulate into
pack_operations[-1][0]. We special case over-packed packs to shrink the
size of the next pack we create, though perhaps we don't need to do that
- it just seemed like a good idea at the time.
> At the very least, I think when you add a pack to the current pack, you
> need to
> decrement the pack distribution, or something.
> I could just be misunderstanding what your algorithm was trying to do.
> If you are trying not to split existing packs, I would have thought
> 100, 10, 1, 1
> would have at least tried to do
> 90, 12, 10
> or maybe
> 112, 10
> # (overflowing slightly on one is better than not packing)
>
> I do think the algorithm tries to pack big ones into bigger ones before
> trying
> to pack small ones in. Which is probably correct. (small ones == recent
> ones,
> and you want recent ones to be fast to access, with small indexes.)
>
> But I do think you need to consider a few more arrangements of packs,
> and make
> sure it is doing what you think it is. (Maybe it is, but it isn't doing
> what
> *I* think it should be doing.)
Hmm, I think the tests for it are solid; if you have specific ones you
want to add to convince yourself its correct, thats fine by me :).
> + 100 every 100, 1000 every 1000
>
> ....
>
> Caching and writeing of data
> ============================
>
> ^- At the very end you have this. which is pretty obviously not correct
> (though you didn't write it).
>
>
> Sorry I didn't get past pack_distribution(). I'll try to review more
> later.
No probs, it is a big patch :).
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071019/a3c4dcd5/attachment-0001.pgp
More information about the bazaar
mailing list