[MERGE] Fetch tweaks

John Arbash Meinel john at arbash-meinel.com
Tue Jul 29 03:18:22 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
| On Mon, 2008-07-28 at 10:48 -0500, John Arbash Meinel wrote:
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|>
|> Robert Collins wrote:
|> | This allows repositories more control over their fetch operations
in the
|> | generic fetching code. Doing so allows the groupcompress format to
avoid
|> | having to figure out full text representations, rather getting
|> | everything as fulltext in the first place; and eliminates an
unnecessary
|> | reconcile post-fetch.
|> |
|> | -Rob
|> |
|>
|> BB:approve
|>
|> I like this patch as it stands, though with one caveat. Specifically
|> during first branch, passing _fetch_uses_deltas = False will read the
|> entire repository into memory. It will be somewhat efficient, in that it
|> will share strings in the in-memory lists, up until the point that you
|> actually fetch a bit of text.
|
| Yes indeed.
|
|> Then it does ''.join(lines) which doubles memory consumption for that
|> text (while still caching the original lines). If the caller doesn't
|> hang onto the text it will probably be ok.
|
| If we had the unpacked sizes in the index we could do something clever
| and simple :). We don't though. (And I think it would be a loss overall
| due to index size increasing.)
|
|> To truly scale up, we need to change the 'get_record_stream()' code that
|> blindly unpacks all of the requested keys so that we only unpack a few
|> at a time. I don't have a good answer for that, as how do you decide how
|> much to unpack for efficiency versus memory consumption.
|>
|> Anyway, this is still better than what we have (as it lets us experiment
|> with it), and it shouldn't change the behavior of anything *today*.
|
| I plan to audit the versioned file code to make sure it will do
| something nice for group compress - or can be tweaked to do so. Roughly
| thats:
|  - adding reverse-topological
|  - checking knits groups by fileid when sorting
|  - making the full text assembly a little bit more lazy (I'm thinking
| just a 100-text batches). Excluding ISO's and so on most texts are <
| 1MB, so that should be less than 100MB worst case.
|
| Another thing we should do in knits is discard the raw content once its
| not referenced anymore; but perhaps gc will make that irrelevant.
|
| -Rob
|

Well, simply doing:

~  text_map.pop(key)

rather than

~  text_map[key]

Will at least let those lines be reclaimed.

I still think getting the streaming interface to work in "chunks" would
be best. A full-text in a string [full_text] is a chunk, as is [lines].

I know it complicates some things, but if we already have to do
'split_lines(text)' we can just write a C implementation to convert
chunks => lines for anyone that needs to do so.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiOfe4ACgkQJdeBCYSNAAMbwgCdExp646d65utc5YbNIqBNTUnT
tyAAn3lfR2PLEGxdYalQ9eCqRxYucnrU
=RtKz
-----END PGP SIGNATURE-----



More information about the bazaar mailing list