[Bug 875713] [NEW] cut fails to handle correctly utf-8

Sun Oct 16 13:23:32 UTC 2011

Public bug reported:

1) I'm using Lucid :
$ lsb_release -rd
Description:	Ubuntu 10.04.3 LTS
Release:	10.04

2) The version of coreutils (that contains the 'cut' utility)
$ apt-cache policy coreutils
coreutils:
  Installé : 7.4-2ubuntu3
  Candidat : 7.4-2ubuntu3
 Table de version :
 *** 7.4-2ubuntu3 0
        500 http://fr.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        100 /var/lib/dpkg/status
     7.4-2ubuntu2 0
        500 http://fr.archive.ubuntu.com/ubuntu/ lucid/main Packages

3) What I expect to happen, as the man says:
$man cut
(...)
  -b, --bytes=LIST
    select only these bytes
  -c, --characters=LIST
    select only these characters
(...)

I expect a different behavior, when in 'not-1-byte' character sets such
as UTF-8, UTF-16, etc...

The same of different behavior we have with wc that says:
$man wc
(...)
  -c, --bytes
     print the byte counts
  -m, --chars
     print the character counts
(...)

So when I have on my environment :
$ env | grep 'LANG'
LANG=fr_FR.UTF-8
GDM_LANG=fr_FR.UTF-8

I can do:
$ printf '%s' 'déjà vu' | wc -c
9
$ printf '%s' 'déjà vu' | wc -m
7

That is CORRECT (with wc) because as the ENV variable says I'm using
UTF-8, the 'é' and 'à' count for 1 character but 2 bytes.

4) What happens instead: I get something wrong with 'cut'
$ printf '%s' 'déjà vu' | cut -b 1-4 | hd
00000000  64 c3 a9 6a 0a                                    |d..j.|
$ printf '%s' 'déjà vu' | cut -c 1-4 | hd
00000000  64 c3 a9 6a 0a                                    |d..j.|

(I piped it to 'hd', so that we have a better view of what's happening)

It can even be worse and give an invalid UTF-8 output
$ printf '%s' 'déjà vu' | cut -c 1-5 | hd
00000000  64 c3 a9 6a c3 0a                                 |d..j..|

That is because it 'cuts' in the middle of an UTF-8 sequence, and with the appended \n, it gives an incoherent UTF-8 sequence as we can prove:
$ printf '%s' 'déjà vu' | cut -c 1-5 | iconv -f utf-8 -t iso8859-1
d�jiconv: séquence d'échappement non permise à la position 4

We can clearly see that -b or -c makes NO difference at all... but it should, as 'wc' does, because we are in UTF-8

So either:
-Option a) the cut program is buggy, mixes the concept of 'byte' of 'character', and does things wrong when not in '1-byte' charset.
-Option b) the help/man is wrong, and there is no difference when handling bytes and chars (but then why do we have two different options!)

Note that some other GNU utilities seam to mix the concept of 'byte' and 'char', but at least the misconception is clear in the man/help, for example with 'head'
$man head
(...)
 -c, --bytes=[-]N
     print the first N bytes of each file; with the leading `-', print all but the last N bytes of each file

The -b option is unused for 'head', and yet they chosed to use -c, short
of --bytes... Looks like they meant -c as --chars, but at the last
moment decided to handle only bytes!

** Affects: coreutils (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/875713

Title:
  cut fails to handle correctly utf-8

Status in “coreutils” package in Ubuntu:
  New

Bug description:
  1) I'm using Lucid :
  $ lsb_release -rd
  Description:	Ubuntu 10.04.3 LTS
  Release:	10.04

  2) The version of coreutils (that contains the 'cut' utility)
  $ apt-cache policy coreutils
  coreutils:
    Installé : 7.4-2ubuntu3
    Candidat : 7.4-2ubuntu3
   Table de version :
   *** 7.4-2ubuntu3 0
          500 http://fr.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
          100 /var/lib/dpkg/status
       7.4-2ubuntu2 0
          500 http://fr.archive.ubuntu.com/ubuntu/ lucid/main Packages

  3) What I expect to happen, as the man says:
  $man cut
  (...)
    -b, --bytes=LIST
      select only these bytes
    -c, --characters=LIST
      select only these characters
  (...)

  I expect a different behavior, when in 'not-1-byte' character sets
  such as UTF-8, UTF-16, etc...

  The same of different behavior we have with wc that says:
  $man wc
  (...)
    -c, --bytes
       print the byte counts
    -m, --chars
       print the character counts
  (...)

  So when I have on my environment :
  $ env | grep 'LANG'
  LANG=fr_FR.UTF-8
  GDM_LANG=fr_FR.UTF-8

  I can do:
  $ printf '%s' 'déjà vu' | wc -c
  9
  $ printf '%s' 'déjà vu' | wc -m
  7

  That is CORRECT (with wc) because as the ENV variable says I'm using
  UTF-8, the 'é' and 'à' count for 1 character but 2 bytes.

  4) What happens instead: I get something wrong with 'cut'
  $ printf '%s' 'déjà vu' | cut -b 1-4 | hd
  00000000  64 c3 a9 6a 0a                                    |d..j.|
  $ printf '%s' 'déjà vu' | cut -c 1-4 | hd
  00000000  64 c3 a9 6a 0a                                    |d..j.|

  (I piped it to 'hd', so that we have a better view of what's
  happening)

  It can even be worse and give an invalid UTF-8 output
  $ printf '%s' 'déjà vu' | cut -c 1-5 | hd
  00000000  64 c3 a9 6a c3 0a                                 |d..j..|

  That is because it 'cuts' in the middle of an UTF-8 sequence, and with the appended \n, it gives an incoherent UTF-8 sequence as we can prove:
  $ printf '%s' 'déjà vu' | cut -c 1-5 | iconv -f utf-8 -t iso8859-1
  d�jiconv: séquence d'échappement non permise à la position 4

  We can clearly see that -b or -c makes NO difference at all... but it should, as 'wc' does, because we are in UTF-8

  So either:
  -Option a) the cut program is buggy, mixes the concept of 'byte' of 'character', and does things wrong when not in '1-byte' charset.
  -Option b) the help/man is wrong, and there is no difference when handling bytes and chars (but then why do we have two different options!)

  Note that some other GNU utilities seam to mix the concept of 'byte' and 'char', but at least the misconception is clear in the man/help, for example with 'head'
  $man head
  (...)
   -c, --bytes=[-]N
       print the first N bytes of each file; with the leading `-', print all but the last N bytes of each file

  The -b option is unused for 'head', and yet they chosed to use -c,
  short of --bytes... Looks like they meant -c as --chars, but at the
  last moment decided to handle only bytes!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/875713/+subscriptions