[MERGE] Using unicode Windows API to obtain command-line arguments.(#375934)

Thu May 14 09:17:55 BST 2009

John Arbash Meinel пишет:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Alexander Belchenko wrote:
>> This patch improves command line arguments handling on Windows.
>> For details see https://bugs.launchpad.net/bzr/+bug/375934
>>
>> In short: current bzr code unable to get correct unicode command-line arguments if they're outside
>> of current user encoding. Function get_unicode_argv() in win32utils.py based on the code from
>> fixutf8 extension for hg. I've added special filter for arguments when bzr used with usual python
>> interpreter (things are much simpler when standalone bzr.exe used).
>>
>> I think this patch is improvement and will make bzr's unicode support even better.
>> I've tested it manually because I can't figure how to test it with selftest.
>>
>> I hope John Meinel will be able to review this patch.
>>
> 
> So interestingly, it seems that 'cmd.exe' is better than it pretends to be.

> For example, if you do:
> C:\...\mytest>chcp
> Active code page: 437
> 
> C:\...\mytest>dir
>  Volume in drive C is C
>  Volume Serial Number is 3433-0825
> 
>  Directory of C:\Users\jameinel\dev\,tmp\test\mytest
> 
> 05/13/2009  03:54 PM    <DIR>          .
> 05/13/2009  03:54 PM    <DIR>          ..
> 05/13/2009  03:39 PM                 0 Тест.txt
>                1 File(s)              0 bytes
>                2 Dir(s)   4,409,217,024 bytes free
> 
> So in code page 437, Russian characters are not allowed. In that
> circumstance, if you do
> 
> C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st
> unknown:
>   ????.txt
> 
> If you then do:
> C:\...\mytest>chcp 1251
> C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st
> unknown:
>   Тест.txt
> 
> So while bzr is able to figure out how to write things out if the code
> page supports it, it seems that 'dir' is able to write it out regardless.

IIUC cmd.exe using unicode Win32 API to write the output directly to 
console window. IMO bzr can't use this trick because output could be 
redirected.

> Though I'll note that it always draws boxes for Arabic characters. I
> can't show that via pasting, because it actually Cut & Pastes the
> correct characters. (So Thunderbird is able to draw them correctly, etc.)
> 
> Anyway, going a bit further, it seems that if you try to do:
> C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st Тест.txt
> nonexistent:
>   ????.txt
> bzr: ERROR: Path(s) do not exist: "????.txt"
> 
> Even though my code page is 1251 and can represent those characters
> correctly.
> 
> With Alexander's patch, I get:
> C:\...\mytest>C:\Python25\python C:\...\bzr st Тест.txt
> unknown:
>   Тест.txt
> 
> So I can confirm that the change itself does mean that it 'unbreaks'
> unicode command line arguments.

Thanks. Indeed, 'unbreaks'.

> 
> Now on to the rest...
> 
> It changes the definition of the argv parameter in main(). Now, I don't
> think that main() is really the recommended api (that would be run_bzr).
> But I'm tempted to instead do:
> 
> if argv is None:
>   argv = osutils.get_unicode_argv()
> else:
>   user_encoding = osutils.get_user_encoding()
>   try:
>     argv = [a.decode(user_encoding) for a in argv[1:]]
>   except UnicodeDecodeError:
>     ...
> 
> That would change it to be:
>   If you do not supply a parameter, we will try to do the right thing
>   and decode the command line arguments correctly. (Either via sys.argv,
>   or via GetCommandLineW())
> 
>   If you *do* supply a parameter, then we will decode it as we did in
>   the past.
> 
> I'm not stuck on having it be this way, as obviously on Windows we can't
> guarantee that we can decode strings people pass in. Which would mean it
> is better to pass in Unicode, since then the caller can be exact about
> what they want to say.
> 
> I *will* say that we should at least upcast all the strings passed in to
> Unicode, rather than just attempt to decode them via ascii. So at a
> minimum I would change it to:
> 
> +        try:
> +            new_argv = []
> +            # check for presence of non-ascii strings
> +            for a in argv:
> +                if isinstance(a, unicode):
> +		     new_argv.append(a)
> +		 else:
> +		     # issue a deprecation warning here??
> +                    new_argv.append(a.decode('ascii'))
> +        except UnicodeDecodeError:
> +            raise errors.BzrError("argv should be list of unicode
> strings.")
> +        argv = new_argv
> 
> So I'll leave the choice up to you. But I would like the code here to
> ensure that argv is a list of unicode strings when it is done.

OK, I prefer the latter. I really think argv should be list of unicode 
strings, or pure ascii.

> +if sys.platform == 'win32':
> +    f = getattr(win32utils, 'get_unicode_argv', None)
> +    if f is not None:
> +        get_unicode_argv = f
> 
> ^- I don't really like this method of doing it. We are using 'getattr()'
> on a lazy import, which feels a bit off. (It should work fine, though.)
> 
> What if instead we do:
> 
> if has_ctypes and winver != 'Windows 98':
>   def get_unicode_argv():
>     ...
> else:
>   get_unicode_argv = None
> 
> and then in osutils.py we can do:
> 
> if sys.platform == 'win32':
>   if win32utils.get_unicode_argv is not None:
>     get_unicode_argv = win32utils.get_unicode_argv

Agreed, will change.

> I would also note that we already have a big:
>   if sys.platform == 'win32':
> 
> code block, and I'd rather re-use it if possible. I think that just
> means defining 'get_unicode_argv()' a bit higher.

OK.

> 
> +        argv = [pargv[i] for i in range(1, c.value)]
> +        if getattr(sys, 'frozen', None) is None:
> +            # python.exe [PYTHON_OPTIONS] bzr [BZR_OPTIONS]
> +            # manually removing python, its options and 'bzr' script name
> +            first_item = sys.argv[0]    # should be 'bzr'
> +                                        # but we cannot be 100% sure
> +            ix = argv.index(first_item)
> +            if first_item == '-c':      # python -c "..."
> +                ix += 1                 # skip python code
> +            argv = argv[ix+1:]
> 
> 
> ^- I would change the comments here a little bit

Thanks, will use it.

> +        # Skip the first argument, since we only care about parameters
> +        argv = [pargv[i] for i in range(1, c.value)]
> +        if getattr(sys, 'frozen', None) is None:
> +            # Invoked via 'python.exe' which takes the form:
> +            #   python.exe [PYTHON_OPTIONS] C:\Path\bzr [BZR_OPTIONS]
> +            # so we need to strip off more than just the first argument
> +            # first, find the expected first argument
> +            first_item = sys.argv[0]
> +            arg_start = argv.index(first_item) + 1
> +            if first_item == '-c':
> +                # python -c "..." arg1 arg2
> +                # We've skipped over '-c', we also need to skip
> +                # "...".
> +                arg_start += 1
> +            argv = argv[arg_start:]
> 
> ^- I'm not sure if we want to do this last bit. I think it is to support
> doing something like:
> 
> python -c "import bzrlib.commands; bzrlib.commands.main()" args to bzr
> 
> I suppose we can support it, it just doesn't feel very worthwhile.

Well, I've tried to DTRT here. It's not big deal but it's not hard either.

> 
> BB:tweak

I'd prefer to resubmit new patch.

> 
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iEYEARECAAYFAkoLPjAACgkQJdeBCYSNAAOkJQCfXLE5epOt1fWfby8zla5m/eQr
> KtwAn1qhfUvhywJctmHlZH3AMR7jDaGr
> =zjBX
> -----END PGP SIGNATURE-----
> 
>