CVS migration help

Tue Oct 7 18:40:39 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Manson wrote:
> Hi Brian,
>  
> on my new system :
>  
> LANG=en_US.UTF-8
>  
> thomas at home:~/temp/cvsrepo/crf-irp/Ressources/documentation$
> <mailto:thomas at home:~/temp/cvsrepo/crf-irp/Ressources/documentation$> ll
> total 32
> drwxr-xr-x 2 thomas thomas  4096 2008-10-06 18:07 .
> drwxr-xr-x 9 thomas thomas  4096 2008-10-06 18:07 ..
> -r--r--r-- 1 thomas thomas 23274 2008-01-20 00:56 Sp?cifications.doc,v
> thomas at home:~/temp/cvsrepo/crf-irp/Ressources/documentation$
> <mailto:thomas at home:~/temp/cvsrepo/crf-irp/Ressources/documentation$> ls
> -N | hexdump -C
> 00000000  53 70 e9 63 69 66 69 63  61 74 69 6f 6e 73 2e 64 
> |Sp.cifications.d|
> 00000010  6f 63 2c 76 0a                                    |oc,v.|
> 00000015
> 
> On my old system, from which the files came from  :
>  
> LANG=fr_FR at euro <mailto:LANG=fr_FR at euro>
>  

^- The fact that it is a single character means that it *is not* in
UTF-8, it would take 2 characters to encode é.

Now:

>>> print '\xe9'.decode('latin1')
é

>>> '\xe9'.decode('latin1').encode('utf-8')
'\xc3\xa9'


Anyway, *most* current filesystems would assume that paths are in UTF-8
(Linux doesn't actually specify, everything is just a NULL terminated
string), which causes problems because we have to "guess" what things
really are.

In this case, your filename is probably in Latin-1 encoding.

This is partially why cvsps-import doesn't support it, because we don't
really know what encoding to use for filenames. (Mostly because nobody
had non-ascii filenames and wanted us to make it work.)

For example, code like this *could* do what you want:

=== modified file 'cvsps/parser.py'
- --- cvsps/parser.py     2007-02-08 22:33:44 +0000
+++ cvsps/parser.py     2008-10-07 17:39:30 +0000
@@ -174,6 +174,7 @@
         if ':' not in line:
             return
         fname, version = line[1:].rsplit(':', 1)
+        fname = fname.decode(self._encoding)
         fname = self._cache(fname)
         versions = version.split('->')
         assert len(versions) == 2

It just uses the same encoding for filenames that we use for the log
content and the committer names.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjrnxcACgkQJdeBCYSNAAPWhwCgy/4VbBRxWIcb0JzJxz1xURW+
MuUAoKqtfapED0UniQd7vn4Nv6fAEFOt
=w//u
-----END PGP SIGNATURE-----