[Bug 1422290] Re: Default charsets handling for Windows archives in CJKV+th locale

Yuan Chao 1422290 at bugs.launchpad.net
Sun Mar 1 20:33:20 UTC 2015


This is from one of my machine running LUbuntu:

$ export |grep LANG
declare -x LANG="en_US.UTF-8"

$ export |grep LC
declare -x LC_ADDRESS="en_US.UTF-8"
declare -x LC_IDENTIFICATION="en_US.UTF-8"
declare -x LC_MEASUREMENT="en_US.UTF-8"
declare -x LC_MONETARY="en_US.UTF-8"
declare -x LC_NAME="en_US.UTF-8"
declare -x LC_NUMERIC="en_US.UTF-8"
declare -x LC_PAPER="en_US.UTF-8"
declare -x LC_TELEPHONE="en_US.UTF-8"
declare -x LC_TIME="en_US.UTF-8"

$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...

Use the file from here: http://www1.axfc.net/uploader/Sc/so/325701.zip
(passwd: backer) (CP932)

$ unzip celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/В╣ВщВчВдВ╟.ust  
  inflating: celluloid/В╣ВщВчВдВ╟2Ф╘.ust  
  inflating: celluloid/В╣ВщВчВдВ╟СхГTГrСOВйВч.ust  

$ unzip -O cp932 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/せるらうど.ust  
  inflating: celluloid/せるらうど2番.ust  
  inflating: celluloid/せるらうど大サビ前から.ust  

$ unzip -O cp936 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/偣傞傜偆偳.ust  
  inflating: celluloid/偣傞傜偆偳2斣.ust  
  inflating: celluloid/偣傞傜偆偳戝僒價慜偐傜.ust  

$ unzip -O cp950 celluloid.zip 
Archive:  celluloid.zip
  inflating: celluloid/readme.txt    
  inflating: celluloid/�����炤��.ust  
  inflating: celluloid/�����炤��2��.ust  
  inflating: celluloid/�����炤�Ǒ��T�r�O����.ust  

Another file from here  http://3jf.wodemo.com/file/310894   (CP936)

$ unzip -L 王妃.zip 
Archive:  王妃.zip
  inflating: ═їх·_a.ust         
  inflating: ═їх·_b.ust         

$ unzip -O cp932 王妃.zip 
Archive:  王妃.zip
  inflating: ヘ銈A.ust          
  inflating: ヘ銈B.ust          

$ unzip -O cp936 王妃.zip 
Archive:  王妃.zip
  inflating: 王妃_A.ust            
  inflating: 王妃_B.ust            

$ unzip -O cp950 王妃.zip 
Archive:  王妃.zip
  inflating: 卼漦_A.ust            
  inflating: 卼漦_B.ust            

Actually, not all the wrong cases map to illegal UTF8 string (question
marks). I guess why an auto-detect is not so straight forward?

-- 
You received this bug notification because you are a member of Ubuntu
Sponsors Team, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1422290

Title:
  Default charsets handling for Windows archives in CJKV+th locale

Status in unzip package in Ubuntu:
  Triaged
Status in unzip package in Debian:
  Confirmed

Bug description:
  With the current unzip package in Ubuntu, we need to specify charset
  explicitly to extract zip files sent from localized Windows systems.

  For example zip files sent from Japanese localized Windows,
  $ zipinfo -O CP932 sent-from-localized-windows.zip
  $ unzip -O CP932 sent-from-localized-windows.zip

  This method won't work for GUI application like file-roller, users do
  not have way to specify charset from GUI.

  Attached branch adds default charsets handling for Windows archives in
  CJKV+th locale, inspired by Ubuntu Kylin way.

  As a result of bug #580961, two options have been added as Ubuntu patch.
  > -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
  > -I CHARSET specify a character encoding for UNIX and other archives

  Then Ubuntu Kylin added default encoding as environment variables for their distribution.
  http://bazaar.launchpad.net/~ubuntukylin-members/ubuntukylin-default-settings/trunk/revision/171

  Now as Ubuntu, we can go further by a better way:
   - per user settings by their locales instead of global settings
   - don't interfere in other locales by locale guard

  I only add "-O", so no behavior change for zip files created on Ubuntu
  or other Linux/UNIX systems. This branch just handles zip file created
  on localized Windows system seamlessly.

  charsets list is taken from:
  https://msdn.microsoft.com/en-us/goglobal/bb964654
  and
  msdos/msdos.c in unzip package:
     1682 case 932: /* Japanese */
     1683 case 949: /* Korean */
     1684 case 936: /* Chinese, simple */
     1685 case 950: /* Chinese, traditional */
     1686 case 874: /* Thai */
     1687 case 1258: /* Vietnamese */

  (Copied from @nobuto's branch description.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/1422290/+subscriptions



More information about the Ubuntu-sponsors mailing list