[storm] Latin-1 Character Set

Fri Nov 7 07:53:00 GMT 2008

On Thu, Oct 30, 2008 at 11:36 AM, James Henstridge <james at jamesh.id.au> wrote:
> [CC'ing the mailing list, since you dropped it in your reply]
>
> On Wed, Oct 29, 2008 at 6:37 PM, kevin gill <kevin at movieextras.ie> wrote:
>>> PostgreSQL should reencode input/output between the database encoding
>>> and client encoding for text/character fields.
>>>
>>>     http://www.postgresql.org/docs/8.3/static/multibyte.html
>>>
>>> Storm sets the client encoding to UTF-8, which should work with any
>>> database encoding (of course, some unicode strings passed to the
>>> database may give errors if they can't be represented, but that is
>>> what you'd expect.  Is this not happening for you?
>>
>> This is an old database which is connected to a Zope 2 site. The database
>> is SQL_ASCII, and the Zope 2 system binds to it using latin-1 (PyscopgDA
>> etc). The result is that there is data on the database encoded in latin-1
>> but PostgreSQL has no rules for handling it.
>
> That does sound like a problem.  I don't suppose you'd have the
> opportunity to dump and restore your database with a correct encoding?
>  The page I referenced above strongly recommends against use of that
> encoding.

This is a valid setup, although a non-optimal one. I think in Kevin's
case he might save himself future pain if the database can be rebuilt
to explicitly specify LATIN1 as the encoding if he doesn't plan to
migrate to a UTF8 database in the future. I think other people do use
this setup though when they need to store data in multiple encodings
in the database.

The absolute worst case would be subsets of rows storing data in
different columns. eg. a table that stores text in the original input
encoding (along with enough information that it can be decoded again!)
rather than normalizing it to a common encoding.

A slightly better case is a table that stores data in different
columns in different encodings.

Next is database where data in different tables is stored in different
encodings.

Finally is a database where all data is stored in a particular
encoding, but the client needs to know the encoding so it can decode
it (Kevin's Latin1 database).

If people thinks these legacy systems are worth supporting, I would
hope it can be done adding minimal complexity. Perhaps an EncodedText
column type to use instead of Unicode that has the DB encoding as a
required parameter? I think this is preferable as it is explicit,
supports more scenarios and database backends, and allows systems to
gradually migrate to a UTF8 only DB.

(I don't think the first scenario I listed is supportable by an ORM
directly without great complexity, as the ORM would need to be taught
now to deduce the encoding of columns, and how to store a valid row
when writing).

-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/