[MERGE] Packs. Kthxbye.

Robert Collins robertc at robertcollins.net
Fri Oct 19 05:40:59 BST 2007


On Fri, 2007-10-19 at 13:11 +1000, Martin Pool wrote:
> ok, I've read the whole patch.  I could not spot the intentional
> mistake, if any.  The most important things, which are pretty
> important but need not block landing at least as an experimental
> format, are already filed as bugs: having readers cope with pack
> rearrangements, and avoiding ever-growing duplicated data.
> 
> === added file 'bzrlib/repofmt/pack_repo.py'
> --- bzrlib/repofmt/pack_repo.py	1970-01-01 00:00:00 +0000
> +++ bzrlib/repofmt/pack_repo.py	2007-10-17 09:39:41 +0000
> @@ -0,0 +1,1709 @@
> +class Pack(object):
> +    """An in memory proxy for a pack and its indices.
> 
> Odd to use 'indices' in some places and 'indexes' in others.  Either is ok,
> indexes might be more obvious.  I dread having an object with .indices
> and .indexes :)

I prefer indices.


> +class ExistingPack(Pack):
> +    """An in memory proxy for an exisiting .pack and its disk indices."""
> +
> 
> Worth saying, slightly redundantly, that these are readonly?


Perhaps. I'd be inclined not to though.

> +    def __init__(self, pack_transport, name, revision_index, inventory_index,
> +        text_index, signature_index):
> +        """Create an ExistingPack object.
> +
> +        :param pack_transport: The transport where the pack file resides.
> +        :param name: The name of the pack on disk in the pack_transport.
> +        """
> +        Pack.__init__(self, revision_index, inventory_index, text_index,
> +            signature_index)
> +        self.name = name
> +        self.pack_transport = pack_transport
> +        assert None not in (revision_index, inventory_index, text_index,
> +            signature_index, name, pack_transport)
> +
> +    def __eq__(self, other):
> +        return self.__dict__ == other.__dict__
> +
> +    def __ne__(self, other):
> +        return not self.__eq__(other)
> 
> I think on complex objects it's better to use an explicit method name that says
> just what is being compared.  Otherwise, there is a tendency to
> compare the whole dict
> for ExistingPack, but just some attributes for others and a bug-prone situation
> results.  If it's intended for use only in testing them spelling out the
> method name is not too burdensome.

I agree that the situation with __eq__ etc can be complex and lead to
confusion with non-trivial objects. However, ExistingPack and GraphIndex
both act as value objects - they are immutable with only public query
methods.

In this case, the need to compare the indices *and* the pack dict
together makes a specific test method somewhat ugly because its going to
be two levels deep - and cannot just compare the dict in the
ExistingPack instance unless GraphIndex has its __eq__ still - and in
that case why object to this specific occurrence :)>

> 
> +
> +    def __repr__(self):
> +        return "<bzrlib.repofmt.pack_repo.Pack object at 0x%x, %s, %s" % (
> +            id(self), self.transport, self.name)
> +
> +
> +class NewPack(Pack):
> +    """An in memory proxy for a pack which is being created."""
> +
> +    # A map of index 'type' to the file extension and position in the
> +    # index_sizes array.
> +    indices = {
> +        'revision':('.rix', 0),
> +        'inventory':('.iix', 1),
> +        'text':('.tix', 2),
> +        'signature':('.six', 3),
> +        }
> 
> pep8 says or implies there should be a space after the colon, iirc.

Yup.

> This is not really a list of indexes - because it's also visible on instances I
> think it's good to make that clear - maybe index_definitions or something.

Sounds good.

> The point of the numbers is not just that they index the array but they give
> the order in the names list (?)

An index size array has the sizes of each index in bytes for a single
pack. The names list is currently sorted by the pack name, but it would
be an improvement to sort by the size of the revision index for each
pack (with smallest first).

> This list is effectively repeated several places; it should be combined.

Agreed - this dict is where I was combining them into.

> 
> +
> +    def __init__(self, upload_transport, index_transport, pack_transport,
> +        upload_suffix=''):
> +        """Create a NewPack instance.
> +
> +        :param upload_transport: A writable transport for the pack to be
> +            incrementally uploaded to.
> +        :param index_transport: A writable transport for the pack's indices to
> +            be written to when the pack is finished.
> +        :param pack_transport: A writable transport for the pack to be renamed
> +            to when the upload is complete. This *must* be the same as
> +            upload_transport.clone('../packs').
> 
> Well, if it must be the same, remove the parameter and clone it yourself?  I
> guess constructing the new transport would be a bit inefficient as the caller
> already has them, and this is only called by RepositoryPackCollection.

Thats part of it, the other part I discussed in my reply to John's
review.

> +    ..
> +        # where should the new pack be opened
> +        self.upload_transport = upload_transport
> +        # where are indices written out to
> +        self.index_transport = index_transport
> +        # where is the pack renamed to when it is finished?
> +        self.pack_transport = pack_transport
> +        # tracks the content written to the .pack file.
> +        self._hash = md5.new()
> +        # a four-tuple with the length in bytes of the indices, once the pack
> +        # is finalised. (rev, inv, text, sigs)
> +        self.index_sizes = None
> 
> This would be simpler if you just made index_sizes a dictionary keyed
> by index_type.

Simpler *where*. serialisation and parsing would be made more complex.

> +        self._buffer = [[], 0]
> 
> can you comment _buffer please, you were doing so well :)

Its a list of byte sequences, and the aggregate size of them, which is
then string joined when flushing the buffer.


> +        # create a callable for adding data
> +        def _write_data(bytes, flush=False, _buffer=self._buffer,
> +            _write=self.write_stream.write, _update=self._hash.update):
> +            _buffer[0].append(bytes)
> +            _buffer[1] += len(bytes)
> +            # 1MB buffer cap
> +            if _buffer[1] > self._cache_limit or flush:
> +                bytes = ''.join(_buffer[0])
> +                _write(bytes)
> +                _update(bytes)
> +                _buffer[:] = [[], 0]
> +        # expose this on self, for the occasion when clients want to add data.
> +        self._write_data = _write_data
> 
> Is there a reason this cannot just be a regular method?

Yes, you can't bind the parameters like I do, which is done for
performance.

> It seems like the buffer would be better off as a separate object,
> perhaps later.
> 
> +    def data_inserted(self):
> +        """True if data has been added to this pack."""
> +        return 0 != sum((self.get_revision_count(),
> +            self.inventory_index.key_count(),
> +            self.text_index.key_count(),
> +            self.signature_index.key_count(),
> +            ))
> 
> or write
> 
>    bool(x() or y() or z() ...)

Sure. perhaps just:
return x() or y() or z() etc.

> +        # XXX: should rename each index too rather than just uploading blind
> +        # under the chosen name.
> 
> Well, you're using put_file which I think does upload and then rename.  Is more
> than that needed?

Yes, we should upload them all to temp names, then rename them all, so
that a link failure during upload - the largest time window - is least
likely to leave some files in place.


> +    def make_index(self, index_type):
> +        """Construct a GraphIndex object for this packs index 'index_type'."""
> +        setattr(self, index_type + '_index',
> +            GraphIndex(self.index_transport,
> +                self.index_name(index_type, self.name),
> +                self.index_sizes[self.index_offset(index_type)]))
> 
> Would this be better called _load_index?  It does more than just make one.

It does not read any data. _load to me implies reading data.

> Rather than setattr tricks, why not have
> 
>   self.indexes[index_type] = ....
> 
> not much longer and simplifies some other code too.

if that was self.indices sure. Oh, and make the parsing code (affects
every pack object * count of packs) work that way too (it will be
somewhat slower).


> +    def index_offset(self, index_type):
> +        """Get the position in a index_size array for a given index type."""
> +        return NewPack.indices[index_type][1]
> 
> That docstring doesn't seem very helpful.

Sorry, I'm not sure what would be clearer there.

> As 'label' is only used in the trace message I'd suggest it's simpler to just
> omit the parameter and use 'index_type' instead.  Also the docstring is out of
> date with the parameters, which I suppose comes from an earlier version that
> necessitated label.
> 
> In all these places, -Dfetch does not seem like the appropriate flag; why
> not -Dpack or -Dpackrepo?

Refactoring for the win. I was mentally tossing up what to do about
this. Thanks for the reminder. I think -Dpack for everything in this
file might be reasonable.


> +class AggregateIndex(object):
> +    """An aggregated index for the RepositoryPackCollection.
> +
> +    AggregateIndex is reponsible for managing the PackAccess object,
> +    Index-To-Pack mapping, and all indices list for a specific type of index
> +    such as 'revision index'.
> +    """
> 
> Can you explain more /why/ not /how/?

Hmm, this is what more than how, but yeah some why would be good. I made
this object to remove duplication between the
revision/signature/inventory/file text thunks that expose 'knits' based
on the underlying pack data.

> +    def __init__(self):
> +        """Create an AggregateIndex."""
> +        self.index_to_pack = {}
> +        self.combined_index = CombinedGraphIndex([])
> +        self.knit_access = _PackAccess(self.index_to_pack)
> 
> Comment that combined_index is the combined index of this type(?)
> across all packs.

Thats accurate. E.g. _packs.revision_index.combined_index is the
combined index of all the GraphIndex objects for .rix indices.

> Should add_callback be set to None here, and documented?  

Yes.

> Rather than
> generically "add_callback" the name could say more what it does.
> However I'm not sure a bound method reference is really the best thing to
> use here: the callers seem like they would be happier getting a reference
> to the writable index, rather than just the add callback.

Not sure why you think that.

> +    def add_writable_index(self, index, pack):
> +        """Add an index which is able to have data added to it.
> +
> +        :param index: An index from the pack parameter.
> +        :param pack: A Pack instance.
> +        """
> 
> doc that there can only be one at a time.

Agreed.


> +    def remove_index(self, index, pack):
> +        """Remove index from the indices used to answer queries.
> +
> +        :param index: An index from the pack parameter.
> +        :param pack: A Pack instance.
> +        """
> +        del self.index_to_pack[index]
> +        self.combined_index._indices.remove(index)
> +        if (self.add_callback is not None and
> +            getattr(index, 'add_nodes', None) == self.add_callback):
> +            self.add_callback = None
> +            self.knit_access.set_writer(None, None, (None, None))
> 
> So this for example could be just self.writing_index and avoid the getattr, etc.

the getattr is to see if the index being removed is a writable index.
Its true this function would be cleaner if we had a self.writing_index.
OTOH code that currently looks for .add_callback would need to guard
against the writing_index being None, which is why it currently using
add_callback - I simplified a bunch of code moving to the current
structure. As always, if someone makes it cleaner, change++.



> +        self.packs = []
> +        # name:Pack mapping
> +        self._packs = {}
> 
> having .packs and ._packs as different variables is poor.

Yes, this is some of the older code, and I'm not going to defend it :).
IIRC self.packs was meant to be ordered, and self._packs is a names:pack
mapping.

> +        # the previous pack-names content
> +        self._packs_at_load = None
> 
> more docs - this is the packs that existed when this object was created - when
> we first read the pack names list?

When we last read or wrote it. Its used to do a three-way diff when we
re-read [not implemented] or write [implemented].
...
> +        :param pack: A Pack object.
> +        """
> +        self.packs.append(pack)
> +        assert pack.name not in self._packs
> +        self._packs[pack.name] = pack
> +        self.revision_index.add_index(pack.revision_index, pack)
> +        self.inventory_index.add_index(pack.inventory_index, pack)
> +        self.text_index.add_index(pack.text_index, pack)
> +        self.signature_index.add_index(pack.signature_index, pack)
> 
> Precondition assertions should come first.

Oopps, yes.


> +        if self._max_pack_count(total_revisions) >= total_packs:
> +            return False
...
> +        return True
> 
> I guess this should return False if no pack_operations were planned?

It does :).


> +        else:
> +            # eat the iterator to cause it to execute.
> 
>   ... which as a side effect copies the inventory data
> 
> +            list(inv_lines)
> +            text_filter = None
> 
> The mostly-repeated code for copying each type of object could be split out.

Yah, its there as an itch to find the right abstraction.

> +    def remove_pack_from_memory(self, pack):
> +        """Remove pack from the packs accessed by this repository.
> +
> +        Only affects memory state, until self._save_pack_names() is invoked.
> +        """
> +        self._names.pop(pack.name)
> +        self._packs.pop(pack.name)
> +        self._remove_pack_indices(pack)
> 
> Should be private?
> 
> Just del would be clearer than pop as you don't want the result.

I think you mean 'remove' ? Sure.

> +    def _make_index_map(self, index_suffix):
> +        """Return information on existing indexes.
> +
> +        :param suffix: Index suffix added to pack name.
> +
> +        :returns: (pack_map, indices) where indices is a list of GraphIndex
> +        objects, and pack_map is a mapping from those objects to the
> +        pack tuple they describe.
> +        """
> +        # TODO: stop using this; it creates new indices unnecessarily.
> +        self.ensure_loaded()
> +        suffix_map = {'.rix':'revision_index',
> +            '.six':'signature_index',
> +            '.iix':'inventory_index',
> +            '.tix':'text_index',
> +        }
> 
> The pack definitions are duplicated (triplicated?) again.

I want to nuke _make_index_map completely, I've been removing callers
incremetnally.

> +    def release_names(self):
> +        """Release the mutex around the pack-names index."""
> +        self.repo.control_files.unlock()
> 
> Should be unlock_names to match.

Sure.


> +        # reused revision and signature knits may need updating
> +        if self.repo._revision_knit is not None:
> +            self.repo._revision_knit._index._add_callback = \
> +                self.revision_index.add_callback
> +        if self.repo._signature_knit is not None:
> +            self.repo._signature_knit._index._add_callback = \
> +                self.signature_index.add_callback
> +        # create a reused knit object for text addition in commit.
> +        self.repo._text_knit = self.repo.weave_store.get_weave_or_empty(
> +            'all-texts', None)
> 
> I don't understand this.  Why only those two knits?

Hysterical raisins. client code in bzrlib grabs those knits outside of
write groups and then mutates it inside the write group.

> +class GraphKnitRevisionStore(KnitRevisionStore):
> +    """An object to adapt access from RevisionStore's to use GraphKnits.
> +
> +    This should not live through to production: by production time we should
> +    have fully integrated the new indexing and have new data for the
> +    repository classes; also we may choose not to do a Knit1 compatible
> +    new repository, just a Knit3 one. If neither of these happen, this
> +    should definately be cleaned up before merging.
> 
> Is this true?  I'm not sure what 'new' means here.

I've moved the comment.

> +
> +    This class works by replacing the original RevisionStore.
> +    We need to do this because the GraphKnitRevisionStore is less
> +    isolated in its layering - it uses services from the repo.
> +    """
> +
> +    def __init__(self, repo, transport, revisionstore):
> +        """Create a GraphKnitRevisionStore on repo with revisionstore.
> +
> +        This will store its state in the Repository, use the
> +        indices FileNames to provide a KnitGraphIndex,
> +        and at the end of transactions write new indices.
> +        """
> 
> FileNames?

Oh, this comment needs santising too.


> +class GraphKnitTextStore(VersionedFileStore):
> +    """An object to adapt access from VersionedFileStore's to use GraphKnits.
> 
> no apostrophe
> 
> The docstring might be clearer as
> 
>   Presents a TextStore abstraction on top of packs.

Yes.

> if that's correct.
> 
> +
> +    This should not live through to production: by production time we should
> +    have fully integrated the new indexing and have new data for the
> +    repository classes; also we may choose not to do a Knit1 compatible
> +    new repository, just a Knit3 one. If neither of these happen, this
> +    should definately be cleaned up before merging.
> 
> 
> again

agreed.

> +    get_weave = get_weave_or_empty
> +
> +    def __iter__(self):
> +        """Generate a list of the fileids inserted, for use by check."""
> 
> /a list of file_ids in this repository/

sure.

> +class GraphKnitRepository(KnitRepository):
> +    """Experimental graph-knit using repository."""
> 
> (comment) I've probably done this too, but it's better not to say
> 'experimental' in docstrings or class names because they can hang around after
> the code's considered mature.


We need to rename GraphKnitRepository to PackRepository and change
docstrings and format signatures too.



> (comment) we really need a better exposure in the ui of the different
> components; a 'dirstate format repository' meaning a 'repository in the
> versioned used when dirstate came out' is confusing.
> 
> I wonder then if rather than this eventually being called --format=packs
> it should be just 0.92 and 0.92-subtree.
> 
> Obviously the text needs to be updated when this is finally
> merged, as well as getting new names.

Yah. I don't know that 0.92 is all that good either, but I share the
concern that just 'packs' isn't really useful.
...
> I think providing __eq__ on objects that are not value objects is not a good
> idea, because it's unclear what is or isn't compared.  This is not checking all
> the members of the class, so I would prefer instead renaming it to e.g.
> has_same_location() as we did for Repository(?).
> 
> Also as far as I can see this is not tested.

Possibly true about the testing :). This is however, a Value Object. 

> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py	2007-10-12 05:26:46 +0000
> +++ bzrlib/knit.py	2007-10-17 09:39:41 +0000
> @@ -1950,7 +1950,8 @@
> 
>      def set_writer(self, writer, index, (transport, packname)):
>          """Set a writer to use for adding data."""
> -        self.indices[index] = (transport, packname)
> +        if index is not None:
> +            self.indices[index] = (transport, packname)
>          self.container_writer = writer
>          self.write_index = index
> 
> 
> It looks like this is being done so that you can pass None, None, (None, None)
> to disable writing to this object.  If so, you should document it.
> But maybe it'd
> be clearer and safer to add a different method remove_writer() instead.

Not quite, we already pass None, None, (None, None), I added this change
so that doing so doesn't add None to the underlying index. Actually I
think a good change to make is to remove the index, transport and
packname parameters completely.



> +    XXX: The index corruption that _fix_text_parents performs is needed for
> +    packs, but not yet implemented. The basic approach is to:
> +     - lock the names list
> +     - perform a customised pack() that regenerates data as needed
> +     - unlock the names list
> +    """
> 
> You should add a bug number here. s/corruption/inconsistency/ because the index
> is fine in itself.

Sure. It is corrupt though, not inconsistent. That is the index is
correct for the data, but the data is wrong.
https://bugs.edge.launchpad.net/bzr/+bug/154173

> +    def fileids_altered_by_revision_ids(self, revision_ids):
> +        """Find the file ids and versions affected by revisions.
> +
> +        :param revisions: an iterable containing revision ids.
> +        :return: a dictionary mapping altered file-ids to an iterable of
> +        revision_ids. Each altered file-ids has the exact revision_ids that
> +        altered it listed explicitly.
> +        """
> +        assert self._serializer.support_altered_by_hack, \
> +            ("fileids_altered_by_revision_ids only supported for branches "
> +             "which store inventory as unnested xml, not on %r" % self)
> +        selected_revision_ids = set(revision_ids)
> +        w = self.get_inventory_weave()
> 
> (query) And not only that, but it also needs them to be able to do
> iter_lines_added_or_present_in_versions, which is not supported by
> the base VersionedFile class.  I guess this is moot because
> the current code works for all formats and you get a decent error
> message anyhow.

hmm, iter_lines_added.. is required in the VersionedFile api, by the
versioned file interface tests.



> +    @needs_read_lock
> +    def missing_revision_ids(self, revision_id=None):
> +        """See InterRepository.missing_revision_ids()."""
> +        if revision_id is not None:
> +            source_ids = self.source.get_ancestry(revision_id)
> +            assert source_ids[0] is None
> +            source_ids.pop(0)
> +        else:
> +            source_ids = self.source.all_revision_ids()
> +        source_ids_set = set(source_ids)
> +        # source_ids is the worst possible case we may need to pull.
> +        # now we want to filter source_ids against what we actually
> +        # have in target, but don't try to check for existence where we know
> +        # we do not have a revision as that would be pointless.
> +        target_ids = set(self.target.all_revision_ids())
> +        actually_present_revisions = target_ids.intersection(source_ids_set)
> +        required_revisions =
> source_ids_set.difference(actually_present_revisions)
> +        required_topo_revisions = [rev_id for rev_id in source_ids if
> rev_id in required_revisions]
> +        return required_topo_revisions
> +
> +
> 
> I think you can avoid the intermediate sets and just say
> 
>    return [r for r in source_ids if (r not in target_ids)]

Well, this really needs to be fixed to not use all_revision_ids calls at
all.



> To answer Ian's question, I think this code is here because the InterModel1and2
> fetcher does the insertion of root ids when moving from one format to another?
> A comment might help.
> 
> So, what will happen if you try to fetch from a knit1 into a pack3?

Now, it will work.

> This doesn't seem to be directly tested and maybe it should be.



> +    def test_concurrent_writers_merge_new_packs(self):
> +        format = self.get_format()
> +        self.make_repository('.', shared=True, format=format)
> +        r1 = repository.Repository.open('.')
> +        r2 = repository.Repository.open('.')
> +        r1.lock_write()
> +        try:
> +            # access enough data to load the names list
> +            list(r1.all_revision_ids())
> +            r2.lock_write()
> +            try:
> +                # access enough data to load the names list
> +                list(r2.all_revision_ids())
> +                r1.start_write_group()
> +                try:
> +                    r2.start_write_group()
> +                    try:
> +                        self._add_text(r1, 'fileidr1')
> +                        self._add_text(r2, 'fileidr2')
> +                    except:
> +                        r2.abort_write_group()
> +                        raise
> +                except:
> +                    r1.abort_write_group()
> +                    raise
> +                # both r1 and r2 have open write groups with data in them
> +                # created while the other's write group was open.
> +                # Commit both which requires a merge to the pack-names.
> +                try:
> +                    r1.commit_write_group()
> +                except:
> +                    r2.abort_write_group()
> +                    raise
> +                r2.commit_write_group()
> +                # tell r1 to reload from disk
> +                r1._packs.reset()
> +                # Now both repositories should now about both names
> +                r1._packs.ensure_loaded()
> +                r2._packs.ensure_loaded()
> +                self.assertEqual(r1._packs.names(), r2._packs.names())
> +                self.assertEqual(2, len(r1._packs.names()))
> +            finally:
> +                r1.unlock()
> +        finally:
> +            r2.unlock()
> 
> The finally blocks are mismatched.  It might be simpler just to do
> add_cleanup().

Good catch.

> (comment) Maybe unlock() should implicitly abort a pending write group (does it
> error at the moment?)

It errors.



> +class TestPack(TestCaseWithTransport):
> +    """Tests for the Pack object."""
> +
> +    def assertCurrentlyEqual(self, left, right):
> +        self.assertTrue(left == right)
> +        self.assertTrue(right == left)
> +        self.assertFalse(left != right)
> +        self.assertFalse(right != left)
> +
> +    def assertCurrentlyNotEqual(self, left, right):
> +        self.assertFalse(left == right)
> +        self.assertFalse(right == left)
> +        self.assertTrue(left != right)
> +        self.assertTrue(right != left)
> 
> 'Currently' is an odd name, maybe 'thoroughly'?  But really, are these
> comparisons used other than in this test?

Well, there's a series of these calls made, and I want to make it clear
how this is different to assertEqual. We're not testing that a *might*
be equal to b, we're asserting that it is and that __eq__ and __ne__
should line up correctly. We would benefit from these helpers in several
other tests.


-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071019/82e72dda/attachment-0001.pgp 


More information about the bazaar mailing list