[MERGE] Implement and use Repository.iter_files_bytes

Mon Aug 20 06:20:30 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> Martin Pool has voted resubmit.
> Status is now: Resubmit
> Comment:
> I know you were already going to redraft it but I have some more comments:
> 
> +    def iter_files_bytes(self, desired_files):
> +        """Iterate through file versions.
> +
> +        The order in which files will be yielded is unspecified.
> +        Yields pairs of identifier, bytes_iterator.  identifier is an
> opaque
> +        value that uniquely identifies the file version.  (Example: a
> +        TreeTransform trans_id.)
> +
> +        The default implementation just does VersionedFile.get_lines().
> +        :param desired_files: a list of (file_id, revision_id, identifier)
> +            pairs
> +        """
> 
> What is 'identifier' and where does it come from?

I'm not sure whether you missed the top of the docstring where I said
it's an opaque value, and gave a TreeTransform trans_id as an example.

For build_tree, it is a TreeTransform trans_id.  For revert, it is a
(trans_id, mode_id) pair.  They come from TreeTransform._assign_id.  The
mode_id is used in revert to ensure that when a file is backed up, the
new file has the same unix mode as the backed-up (moved) file.

> Is it there just to
> match
> up the input and output parameters?

It is there to provide an identifier for the content that is unique in
the context of the operation.

> The test case does not seem to pass one at all.

Yes it does:

+        extracted = dict((i, ''.join(b)) for i, b in
+                         tree.branch.repository.iter_files_bytes(
+                         [('file1-id', 'rev1', 'file1-old'),
+                          ('file1-id', 'rev2', 'file1-new'),
+                          ('file2-id', 'rev1', 'file2'),
+                         ]))

In this test case, the identifiers are file1-old, file1-new, and file2.
 In the Tree.iter_files_bytes test case, they were 'id1', 'id2', and 'id3'.

> Maybe it would be better to just yield (file_id, revision_id, bytes_iter)
> and let the caller match them up from there?

That would mean lots of extra work.

First, RevisionTree.iter_files_bytes must match
WorkingTree.iter_files_bytes.  And WorkingTree.iter_files_bytes doesn't
have any revision.  So RevisionTree.iter_files_bytes must strip off the
revision id.  So it has to *be* an iterator, instead of merely *using*
an iterator.

Second, build_tree and revert must now look up the trans_id (and
mode_id) associated with that file_id.

So iter_files_bytes may not be perfect for *general* use, but for
*actual* uses, it's very nice.

> Please also document what bytes_iterator is - from the test, it yields.
> It looks like in some cases you are returning just a byte string, which
> can be used as an iterator, but which will then return one byte at
> a time - so ''.join(bytes_iterator) is pretty slow.

No, it never returns a string.  The bytes_iterator for
Tree.iter_files_bytes is tuple containing a single string.  The
bytes_iterator for Repository.iter_files_bytes is a list of lines.

> I'm wondering whether for new iterators we shoud document whether
> the caller is allowed to quit without consuming the whole thing.

I think for iterators the assumption is that the caller *may* quit
without consuming the whole thing.  I think violating that assumption
may not be a good idea.  Certainly if we do, we should document that loudly.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGySSd0F+nu1YWqI0RApLaAJ9bR29nN38TBazSB8hpAqRpAcqt4wCcDT2O
tNiFw6OpTeUoRv9VE5qmSeA=
=tAyL
-----END PGP SIGNATURE-----