[RFC] Change revision_id caching

Mon Mar 31 02:28:31 BST 2008

John Arbash Meinel writes:

 > If we trust most of the internals to not need it, then I think it
 > would be reasonable to switch to an LRUCache of plain 8-bit
 > strings, and not worry about the Unicode side.

I predict that if you "trust the internals", you will regret it.  This
comes from 10 years of watching GNU Emacs deal with the "\201 bug" and
repeated regressions thereof, while XEmacs never had a single instance
after Mule stabilized in 20.3 11 years ago.  GNU Emacs insisted on
keeping APIs for dealing with bytes in the same strings and buffers as
Mule encoding, and inevitably the wrong kind of data would leak across
the boundary.

True, internally, things that trade in revision_ids should not need to
know anything about them, except that they are UUIDs, and maybe how to
parse the time stamp out of them.  But it's a good idea to *enforce*
that.  I also think that it's reasonable to assume that Unicode
strings will only be needed for the purposes of presentation to
humans, and so can be ignored.

Specify an interface for generating those ids, and make sure that that
is the only way that they are ever generated.  Insist that strings be
interned as revision_id data as soon as they are entered.  A "shoot on
sight" policy for any code that tries to manipulate a revision_id
directly is highly recommended.

You might also want to read the threads in Python-3000 on allowing
Unicode identifiers in Python programs (PEP 3131, I think it was)
about which normalization to use and on restricting use of easily
mistaken characters (for example, Cyrillic A and Latin A have
identical glyphs but are different characters).  Unicode TR#31 is
probably a good idea.