versioned file api audit

Mon Mar 17 23:28:16 GMT 2008

On Mon, 2008-03-17 at 10:16 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > so I just audited the public interface of versioned file. Heres what I
> > propose we keep/discuss/remove from it.
> > 
> > 
> > vf apis to keep:
> > check_not_reserved_id
> > versions
> > add_lines
> >   (with ghosts version/perhaps flag)
> 
> I'm not sure if we did this yet, but 'add_lines' should also return the
> sha1 sum if it had to compute it. (I think we just did it and said
> 'screw backwards compatibility', but I'm not positive.)

Definitely.

> I think you mean 'get_parent_map'

Yah.

> I'm not sure what you mean by 'iter_parents' being better, but I'm
> willing to discuss it.

If you have a 10MB index, get_parent_map(all_keys) will load the entire
thing and then give you a big dict. iter_parents would start processing
the index and give results as soon as any were determined; this can
lower peak memory usage (by streaming), and makes it easier to provide
progress feedback in the UI, but introduces stack thrashing as you
enter/exit the iterator a lot. So I think its better overall to use
get_parent_map, but there is a tradeoff.

> IIRC, iter_parents was calling into get_parents() which has the problem
> that it is actually doing "get_parents_no_ghosts()" which is a
> *significant* performance penalty. At that level the pack index is doing
> 0 caching, so it is a bisect 2x.

Well, I'm fixing that by this cleanup - all the non-ghost aware apis are
being evicted with prejuidice.

> The common way of going through history is to ask the index, come back
> with the next parents, go back and ask the index again based on which
> parents you want, etc.
> 
> It would be nice if we could stream in the nodes we want and stream out
> the nodes to get. Maybe an "iter_ancestry()" style call with flags for
> lefthand_only, include_ghosts, etc?

I think its fine, based on the index layer experience, to just call in
repeatedly; the index has enough data to do locality of reference
optimisations.

> > annotate_iter (perhaps rename to iter_annotations)
> 
> For the moment, I really think it makes more sense to have a plain
> "annotate()" which just returns the lines.
> 
> All implementations basically do this and then add "iter(lines)" to
> return a real iterator.
> 
> Conceptually, I like the idea that we wouldn't have to load a full-text
> into memory including annotations. Pragmatically, we are a *long* way
> away from it.

Which is faster? Which do we want the rest of the library code using?

> > iter_lines_added_or_present_in_versions
> > weave_merge
> > 
> > apis to discuss:
> > plan_merge (perhaps keep this)
> > 
> > apis to remove:
> > has_version
> 
> I think we do need a way to probe sometimes. I know it is potentially
> expensive as a LBYL. I think the problems I ran into were when you have
> APIs which take a list of revisions. So you may end up doing
> 
> get(revision_list)
> 
> And 1 of them in the middle fails because it isn't present. It is pretty
> hard to recover cleanly from that situation. I would certainly like to
> structure code (as much as possible) to avoid it, though.

get_parent_map(revision_list) lets you answer presence/absence queries;
we don't need a dedicated function, and its presence really does
encourage slow code.

> > copy_to
> 
> gone
> 
> > access_mode in constructor
> > finished
> 
> At one point I think we had an "open()" call on VF, which isn't really
> used, so certainly it can be gone.

Yah, there was the concept that a versioned file was a _file_ on disk,
which is sufficently gone now we can clean this up.

> > iter_parents
> 
> I thought you wanted to keep this up above.

Either/or. I'd rather start with one and add the other if/when we
actually need it.

> > join
> > annotate
> > get_sha1
> > has_ghost
> > add_lines_non_ghost
> > enable_cache
> > clear_cache
> > clone_text
> > create_empty
> > get_suffixes
> > get_text
> > get_texts
> > get_lines
> > get_ancestry
> > get_ancestry_with_ghosts
> > get_graph
> > get_graph_with_ghosts
> > transaction_finished
> > 
> 
> I think a lot of these can go as we switch to Repository.get_graph().
> There are probably areas where you will pessimize Knit repos slightly,
> but I don't think anything terribly much.

Cool

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080318/fd7f8f35/attachment.pgp