[MERGE] Using unicode Windows API to obtain command-line arguments.(#375934)

John Arbash Meinel john at arbash-meinel.com
Wed May 13 22:40:00 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> This patch improves command line arguments handling on Windows.
> For details see https://bugs.launchpad.net/bzr/+bug/375934
> 
> In short: current bzr code unable to get correct unicode command-line arguments if they're outside
> of current user encoding. Function get_unicode_argv() in win32utils.py based on the code from
> fixutf8 extension for hg. I've added special filter for arguments when bzr used with usual python
> interpreter (things are much simpler when standalone bzr.exe used).
> 
> I think this patch is improvement and will make bzr's unicode support even better.
> I've tested it manually because I can't figure how to test it with selftest.
> 
> I hope John Meinel will be able to review this patch.
> 

So interestingly, it seems that 'cmd.exe' is better than it pretends to be.

For example, if you do:
C:\...\mytest>chcp
Active code page: 437

C:\...\mytest>dir
 Volume in drive C is C
 Volume Serial Number is 3433-0825

 Directory of C:\Users\jameinel\dev\,tmp\test\mytest

05/13/2009  03:54 PM    <DIR>          .
05/13/2009  03:54 PM    <DIR>          ..
05/13/2009  03:39 PM                 0 Тест.txt
               1 File(s)              0 bytes
               2 Dir(s)   4,409,217,024 bytes free

So in code page 437, Russian characters are not allowed. In that
circumstance, if you do

C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st
unknown:
  ????.txt

If you then do:
C:\...\mytest>chcp 1251
C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st
unknown:
  Тест.txt

So while bzr is able to figure out how to write things out if the code
page supports it, it seems that 'dir' is able to write it out regardless.

Though I'll note that it always draws boxes for Arabic characters. I
can't show that via pasting, because it actually Cut & Pastes the
correct characters. (So Thunderbird is able to draw them correctly, etc.)

Anyway, going a bit further, it seems that if you try to do:
C:\...\mytest>C:\Python25\python C:\...\bzr.dev\bzr st Тест.txt
nonexistent:
  ????.txt
bzr: ERROR: Path(s) do not exist: "????.txt"

Even though my code page is 1251 and can represent those characters
correctly.

With Alexander's patch, I get:
C:\...\mytest>C:\Python25\python C:\...\bzr st Тест.txt
unknown:
  Тест.txt

So I can confirm that the change itself does mean that it 'unbreaks'
unicode command line arguments.

Now on to the rest...

It changes the definition of the argv parameter in main(). Now, I don't
think that main() is really the recommended api (that would be run_bzr).
But I'm tempted to instead do:

if argv is None:
  argv = osutils.get_unicode_argv()
else:
  user_encoding = osutils.get_user_encoding()
  try:
    argv = [a.decode(user_encoding) for a in argv[1:]]
  except UnicodeDecodeError:
    ...

That would change it to be:
  If you do not supply a parameter, we will try to do the right thing
  and decode the command line arguments correctly. (Either via sys.argv,
  or via GetCommandLineW())

  If you *do* supply a parameter, then we will decode it as we did in
  the past.

I'm not stuck on having it be this way, as obviously on Windows we can't
guarantee that we can decode strings people pass in. Which would mean it
is better to pass in Unicode, since then the caller can be exact about
what they want to say.

I *will* say that we should at least upcast all the strings passed in to
Unicode, rather than just attempt to decode them via ascii. So at a
minimum I would change it to:

+        try:
+            new_argv = []
+            # check for presence of non-ascii strings
+            for a in argv:
+                if isinstance(a, unicode):
+		     new_argv.append(a)
+		 else:
+		     # issue a deprecation warning here??
+                    new_argv.append(a.decode('ascii'))
+        except UnicodeDecodeError:
+            raise errors.BzrError("argv should be list of unicode
strings.")
+        argv = new_argv

So I'll leave the choice up to you. But I would like the code here to
ensure that argv is a list of unicode strings when it is done.


+if sys.platform == 'win32':
+    f = getattr(win32utils, 'get_unicode_argv', None)
+    if f is not None:
+        get_unicode_argv = f

^- I don't really like this method of doing it. We are using 'getattr()'
on a lazy import, which feels a bit off. (It should work fine, though.)

What if instead we do:

if has_ctypes and winver != 'Windows 98':
  def get_unicode_argv():
    ...
else:
  get_unicode_argv = None

and then in osutils.py we can do:

if sys.platform == 'win32':
  if win32utils.get_unicode_argv is not None:
    get_unicode_argv = win32utils.get_unicode_argv


I would also note that we already have a big:
  if sys.platform == 'win32':

code block, and I'd rather re-use it if possible. I think that just
means defining 'get_unicode_argv()' a bit higher.

+        argv = [pargv[i] for i in range(1, c.value)]
+        if getattr(sys, 'frozen', None) is None:
+            # python.exe [PYTHON_OPTIONS] bzr [BZR_OPTIONS]
+            # manually removing python, its options and 'bzr' script name
+            first_item = sys.argv[0]    # should be 'bzr'
+                                        # but we cannot be 100% sure
+            ix = argv.index(first_item)
+            if first_item == '-c':      # python -c "..."
+                ix += 1                 # skip python code
+            argv = argv[ix+1:]


^- I would change the comments here a little bit
+        # Skip the first argument, since we only care about parameters
+        argv = [pargv[i] for i in range(1, c.value)]
+        if getattr(sys, 'frozen', None) is None:
+            # Invoked via 'python.exe' which takes the form:
+            #   python.exe [PYTHON_OPTIONS] C:\Path\bzr [BZR_OPTIONS]
+            # so we need to strip off more than just the first argument
+            # first, find the expected first argument
+            first_item = sys.argv[0]
+            arg_start = argv.index(first_item) + 1
+            if first_item == '-c':
+                # python -c "..." arg1 arg2
+                # We've skipped over '-c', we also need to skip
+                # "...".
+                arg_start += 1
+            argv = argv[arg_start:]

^- I'm not sure if we want to do this last bit. I think it is to support
doing something like:

python -c "import bzrlib.commands; bzrlib.commands.main()" args to bzr

I suppose we can support it, it just doesn't feel very worthwhile.

BB:tweak

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoLPjAACgkQJdeBCYSNAAOkJQCfXLE5epOt1fWfby8zla5m/eQr
KtwAn1qhfUvhywJctmHlZH3AMR7jDaGr
=zjBX
-----END PGP SIGNATURE-----



More information about the bazaar mailing list