Incorrect unicode handling

Goffredo Baroncelli kreijack at tiscalinet.it
Thu Jul 5 20:35:01 BST 2007


On Thursday 05 July 2007, you (Vasily Sulatskov) wrote:
> Hi Goffredo,
> 
> Perhaps I didn't apply your patch properly (I tried to patch latest
> webserve-dev) but it didn't work for me.

From what repositary have you pulled the source ? Try from:
http://goffredo-baroncelli.homelinux.net/bazaar/webserve-repository/bazaar-webserve-dev


> 
> I created a test repository which uses cp1251 encoding. It's in an
> attached file.
see below

> 
> Here's the traceback I get from webserve-dev patched with a patch from
> your email (patch didn't work perfectly it had some rejections so I
> had to fix it by hand):
[...]

Ok I rearranged a bit my/your idea and so I decided that:
1) everything is utf8 but the files content/diff
2) the code which shows the files content/diff has the responsibility to 
decode from the branch encoding to the python internal representation
3) so the write2 function have to encode only from the internal unicode 
representation to utf8

In effect I don't know if the comment/usr-id is stored as UTF8 in the bazaar 
repository. For the moment I suppose yes.
I have tried to set my terminal is iso8859-15, but I don't know how set my 
LANG environment variable... Do you have some suggestion to perform tests...


In any case you can see an'examples of a file encoded in cp1251

http://goffredo-baroncelli.homelinux.net/bazaar-dev/cp1251test?cmd=content;rev=redvasily%40gmail.com-20070705163936-0lq263r08odcy615;pathrevid=redvasily%40gmail.com-20070705163936-0lq263r08odcy615;path=test.txt
 
> 
> > you can specify which test you want to perform. For examples
> >
> >    ghigo at venice:~$ bzr selftest webserve
> >        bzr: /home/ghigo/bazaar/bzr.dev/bzr
> >     bzrlib: /home/ghigo/bazaar/bzr.dev/bzrlib
> >
> >    [37/37 in 9s]    bzrlib.plugins.webserve.test_webserve.TestW....
> >    -----------------------------------------------------------------
> >    Ran 37 tests in 9.695s
> >
> >    OK
> >    tests passed
> >    ghigo at venice:~$
> >
> >
> > so you can perform only the tests related to webserve. In any case "bzr 
help
> > selftest" will give you all the info.
> 
> Hmm, I ran bzr-0.14 until recently, I
> 
> 
> > Moreover, I am reviewing  your patch, and I am inclined to accept it. The 
only
> > change that I want is to put in the config file/command line the encoding 
of
> > the files content and the encoding of the html page.
> 
> 
> > So we can have cp1251 as encoding of the file content and iso8859-1 as
> > encoding of the html file...
> 
> Well, that woudn't harm, but I don't think it will provide lots of
> benefits. HTML is viewd with browsers and now all browsers work with
> UTF8 just fine.
> 
> May I suggest extract source file encoding not only from environment
> variable but from webserve configuration files also, so different
> projects hosted under one webserve instance could use different
> encodings.
Done
> 
> Perhaps Chardet: http://chardet.feedparser.org/ could be used, but I
> am not sure how it can be integrated within webserve, as it's better
> used with large chunks of data.
> 
> So I suppose that in a perfect world webserve should use:
> 
> Chardet -> BZR_ENCODING -> Project settings
> 
> >
> > Finally, what is the meaning of
> >
> >      if isinstance(s, str):
> >          r = s.replace('@', '@')
> >      else:
> >          r = s.replace(u'@', u'@')
> >
> > can the lines above be replaced by ?
> >
> >      r = s.replace("@", "@")
> 
> You are right. Those lines can be replaced with r = s.replace("@",
> "@"), I tested it and it works. Perhaps I was afraid of automatic
> unicode conversion to ascii, so I used two replaces with string and
> unicode objects, but perhaps '@' gets promoted to u'@' automatically
> instead.
> 
> -- 
> Vasily Sulatskov
> 

Goffredo

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack at inwind.it>
Key fingerprint = CE3C 7E01 6782 30A3 5B87  87C0 BB86 505C 6B2A CFF9



More information about the bazaar mailing list