[delta-l] Delta files - character encoding

Thomas Kluyver takowl at gmail.com
Mon Jan 23 15:15:14 CET 2012


Thanks, Mike,

> Text Encoding
> > --------------------
> > The directive to specify the text encoding used in DELTA files is:
> >     *TEXT ENCODING cp1252
> > New programs should be able to read files using at least cp850, cp1252
> > and utf-8. If the directive is not present, programs should assume
> > text is stored as cp1252, unless instructed otherwise.
>
> It might be better to use the form of the HTML 'charset' parameter:
>     *TEXT ENCODING windows-1252
>
> An alternative mechanism would be to incorporate the information in a
> special comment at the start of the file:
>     *COMMENT @charset=UTF-8
> This would allow the file to be used by older programs that might not need
> the encoding information.
>

I quite like this comment idea, if it avoids having to add a separate file
for the encoding. Borrowing from how encoding is specified in Python source
code, I suggest a first line comment in each DELTA format file is checked
for the regex "coding[:=]\s*([-\w.]+)" [1]. This would match comments
including:
*COMMENT coding: UTF-8
*COMMENT encoding=UTF-8

Encoding names would be case insensitive. New programs should be prepared
to accept at least 'windows-1252', 'cp850' (or is 'ibm850' better for
this?) and 'utf-8'. If the first line isn't a comment, or is a comment
without this pattern, it is assumed to be 'windows-1252'.

[1] http://www.python.org/dev/peps/pep-0263/

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.science.uu.nl/pipermail/delta-l/attachments/20120123/a0bdee09/attachment.html 


More information about the delta-l mailing list