Bazaar, OLPC, and the "no-packs" repository format.

Lukáš Lalinský lalinsky at gmail.com
Thu Dec 20 15:16:48 GMT 2007


On Št, 2007-12-20 at 08:57 -0600, John Arbash Meinel wrote:
> > Various logs were my primary tests, since I was playing with this
> > because of the per-file log performance bug I've reported. To get all
> > revisions it had to seek so many times that even though I had faster
> > index than with knits, the whole operation is slower than with knits.
> 
> I'm also wondering if the requests were getting properly generated. Such that
> we should be requesting several revisions at a time, and sending that down to
> the lower layers, which are then allowed to reorder the request in whatever
> fashion is fastest for them.
> 
> I would be interested in seeing some of your --lsprof results.
> 
> ...

I don't have them anymore, but I can convert some branches and try
repeat the tests. But the most expensive operations were seeking and
wrapping/unwrapping the result to/from StringIO for the pack container
reader.

> I see it by looking at the parse_header_py function:
> def _parse_header_py(hash_to_segment, data, pos, offset, segment_count):
>     for i in xrange(segment_count):
>         key_count, size = struct.unpack('<BI', data[pos:pos+5])
>         pos += 5
>         hashes_size = key_count * 4
>         hashes = struct.unpack('<%di' % key_count, data[pos:pos+hashes_size])
>         pos += hashes_size
>         for key_hash in hashes:
>             hash_to_segment.setdefault(key_hash, []).append(i)
>         yield offset, size, key_count
>         offset += size
> 
> So you describe the segments in the header as just the list of hashes present.
> So you have:
> 
> HEADER = OVERALL_DESCRIP SEGMENT_DESCRIP*
> OVERALL_DESCRIP = NUM_REF_LISTS KEY_LENGTH KEY_COUNT SEGMENT_COUNT INDEX_SIZE
> SEGMENT_DESCRIP = NUM_KEYS SEGMENT_LENGTH HASH_KEY*
> 
> I'm guessing some of your variable names aren't quite right.
> 
> Like why is it called "_key_length" but you are passing it to parse_header as
> "offset".

Actually, the python code is not in sync with the C code which I used as
the primary version (it was just an experiment to see if I can make
packs faster just by making the indexing layer fast, not a serious
code).

> And I think "index_size" just refers to the hash map section.

Right.

> It also looks like your code only supports local operations. Certainly doing
> "self._transport.get(fname)" isn't a great thing to do on HTTP. Instead you
> should be doing:
> 
>  bytes = self._transport.readv(fname, [(0, 1024)], adjust_for_latency=True
> 			       upper_limit=XXXX)

Yes, it was optimized only for local operations, for which calling readv
multiple times and opening the file every time was significantly slower.

Lukas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
	=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
	=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071220/985faadd/attachment.pgp 


More information about the bazaar mailing list