draft-zhu-apng-cc-encoding-v2-01

Network Working Group                                            HF. Zhu
Internet Draft: Chinese Character Encoding                    Tsinghua U
Document: internet-drafts/draft-apng-cc-encoding-03.8.txt         DY. Hu
                                                              Tsinghua U
                                                                ZG. Wang
                                                                    CITS
                                                                 TC. Kao
                                                                     III
                                                               WC. Chang
                                                                     III
                                                              M. Crispin
                                                            U Washington

                                                               July 1995


            Chinese Character Encoding for Internet Messages


Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts  are  working
   documents of the Internet Engineering Task Force (IETF), its  areas,
   and its working groups.  Note that other groups may also  distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   To learn the current status of any Internet-Draft, please check the
   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
   Directories on ds.internic.net (US East Coast), nic.nordu.net
   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
   Rim).

   This is a draft document of APNG-CC, the Chinese Character
   sub-working group of the I18N/L10N (Internationalization and
   Localization) working group of APNG (Asia-Pacific Networking Group).
   A revised version of this draft document will be submitted to the RFC
   editor as an Informational RFC for the Internet Community.
   Discussion and suggestions for improvement are requested, and should
   be sent to apng-cc@apng.org or zhf@net.edu.cn (the coordinator).  This
   document will expire before February 30, 1996.  Distribution of this
   memo is unlimited.


Abstract

   This memo provides methods for transporting Chinese characters
   through, but not limited to, electronic mail [RFC-822] and network
   news [RFC-1036] in the Internet community.


Introduction

   As the use of Internet covers more and more Chinese people in the
   world, the need has increased for the ability to send documents
   containing Chinese characters on the Internet.  The methods
   described in this document provide means of transporting existing
   Chinese character sets as well as leaving sufficient space for future
   extension.

   This document describes three groups of encodings:
      1. ISO-2022-CN and ISO-2022-CN-EXT
      2. CN-GB and CN-Big5
      3. ISO/IEC 10646/Unicode

    The first group of encodings are designed with interoperability in
    mind and are encouraged in this document; they are 7-bit, support
    both simplified and traditional characters using both GB and CNS/Big5,
    and do not impose any unusual quoting requirements on ASCII characters
    The second group of encodings describes current common domestic
    usage.  The third group of encodings refers to the universal
    multilingual character set defined by ISO.

    Note:  ISO/IEC 10646 [ISO-10646] defines a 32bit character space
    with the intent to encode all characters in the world. Currently, only
    the lowest 16bit plane of ISO 10646, the Basic Multilingual Plane (BMP),
    is defined. The BMP is code-by-code identical to Unicode [Unicode 1.1].



Specification

1. 7bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT

 1.1  Description

   ISO-2022-CN is based upon ISO 2022 [ISO-2022],   similar  to  earlier
   work on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for Japanese
   and Korean languages.  It is 7-bit, and supports both simplified Chinese
   characters using GB 2312-80 [GB-2312] and traditional Chinese characters
   using the first two planes of CNS 11643 [CNS-11643], as well as  ASCII
   [ASCII] characters.

   ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally
   supports other GB character sets and planes of CNS 11643.

   Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do
   not require the 8-bit SMTP extensions.  ISO-2022-CN supports almost
   all the characters which appear in Big5 [BIG5] except for two duplicate
   characters which were mistakes in defining Big5.


 1.2  ISO-2022-CN

   The starting code of ISO-2022-CN is ASCII.  ASCII and Chinese characters
   are distinguished by the use of designations (ESC sequences) and shift
   functions.

   Designations define the Chinese character sets  used  in  the  text.
   There are three kinds of designations: SOdesignation, SS2designation
   and SS3designation.

   The SOdesignation is in the form ESC $ ) <F>,  where  <F>   is  the
   "final character" assigned to the character set by ISO (refer to the
   ISO registry [ISOREG] for more details).  The SS2designation  is  in
   the form ESC $ * <F>, and the SS3designation is in the form ESC $ +
   <F> .   A  designation  overrides  any  previous  designation  for
   subsequent bytes in the text.

   There are four kinds of shifts: SI, SO, SS2 and SS3.

   The shift SI (one byte with  hexadecimal  value  0F)   declares  that
   subsequent bytes are interpreted in ASCII.

   The shift SO (one byte with hexadecimal  value  0E)   declares  that
   subsequent bytes are interpreted in  the  character  set  defined  by
   SOdesignation.

   The shift SS2 (two bytes with hexadecimal values  1B  4E)   declares
   that the subsequent TWO bytes are interpreted in the  character  set
   defined by SS2designation, after which the previous interpretation
   (from SI or SO) is restored.

   The shift SS3 (two bytes with hexadecimal values  1B  4F)   declares
   that the subsequent TWO bytes are interpreted in the  character  set
   defined by SS3designation, after which the previous interpretation
   (from SI or SO) is restored.

   For example, the sequence:
     ESC $ ) A SO c_char1 ... c_char1 ESC $ ) G c_char2 ... c_char2 SI
   transfers mixed simplified Chinese and traditional Chinese text, in
   which c_char1s are simplified Chinese characters from GB-2312 and
   c_char2s are traditional Chinese characters from CNS-11643-plane 1.

   The escape sequence, shift function and character set used in an
   ISO-2022-CN text are as follows:

   Character sets                                       Shift in with
  --------------------------------------------------------------------
    ASCII                                                    SI
    GB 2312, CNS 11643-plane-1                                SO
             CNS 11643-plane-2                                SS2


      ESC $ ) A   Indicates the bytes following SO are Chinese characters
                  as defined in GB 2312-80, until another SOdesignation
                  appears

      ESC $ ) G   Indicates the bytes following SO are as defined in
                  CNS 11643-plane-1, until another SOdesignation appears

      ESC $ * H   Indicates the two bytes immediately following SS2 is a
                  Chinese character as defined in CNS 11643-plane-2, until
                  another SS2designation appears

   If there are any GB or CNS characters on a line, a designation for
   the corresponding character set should be used so that each line has
   its own character set information and the text can be displayed
   correctly when scroll back in a window.  Also, there must be a shift
   to ASCII (SI) before the end of the line (i.e., before the CRLF).  In
   other words, each line starts in ASCII, and ends in ASCII.

   The name given to this character encoding is "ISO-2022-CN". This name
   is intended to be used as the "charset" parameter in MIME [MIME-1,
   MIME-2] messages.

       Content-Type: text/plain; charset=iso-2022-cn

   The ISO-2022-CN encoding is already in 7-bit form, so it is not
   necessary to use a Content-Transfer-Encoding header.

   Other restrictions are given in the "Formal Syntax of ISO-2022-CN
   and ISO-2022-CN-EXT" part at the end of this document.

 1.3  ISO-2022-CN-EXT

   ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS
   11643 character sets.

   The escape sequence, shift function and character set used in an
   ISO-2022-CN-EXT text are as follows:

   Character sets                                       Shift in with
  --------------------------------------------------------------------
    ASCII                                                    SI
    GB 2312, GB 12345, CNS 11643-plane-1, GB 2312+GB 8565    SO
    GB 7589, GB 13131, CNS 11643-plane-2                     SS2
    GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or  SS3
     higher planes of CNS 11643

      Note: Currently, there are some GB sets that have not been
      registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131>
      and <X13132> represent the final character that will be assigned
      by ISO for those sets.

      ESC $ ) A   Indicates the bytes following SO are Chinese characters
                  as defined in GB 2312-80, until another SOdesignation
                  appears

      ESC $ * <X7589>
                  Indicates the two bytes immediately following SS2 is a
                  Chinese character as defined in GB 7589-87 [GB-7589],
                  until another SS2designation appears

      ESC $ + <X7590>
                  Indicates the two bytes immediately following SS3 is a
                  Chinese character as defined in GB 7590-87 [GB-7590],
                  until another SS3designation appears


      ESC $ ) <X12345>
                  Indicates the bytes following SO are as defined in
                  GB 12345-90 [GB-12345], until another SOdesignation
                  appears

      ESC $ * <X13131>
                  Indicates the two bytes immediately following SS2 is a
                  Chinese character as defined in GB 13131-91 [GB-13131],
                  until another SS2designation appears

      ESC $ + <X13132>
                  Indicates the two bytes immediately following SS3 is a
                  Chinese character as defined in GB 13132-91 [GB-13131],
                  until another SS3designation appears


      ESC $ ) E   Indicates the bytes following SO are as defined in GB 2312+
                  GB 8565 [GB-8565], until another SOdesignation appears


      ESC $ ) G   Indicates the bytes following SO are as defined in
                  CNS 11643-plane-1, until another SOdesignation appears
      ESC $ * H   Indicates the two bytes immediately following SS2 is a
                  Chinese character as defined in CNS 11643-plane-2, until
                  another SS2designation appears
      ESC $ + I   Indicates the immediate two bytes following SS3 is a
                  Chinese character as defined in CNS 11643-plane-3,
                  until another SS3designation appears
      ESC $ + J   Indicates the immediate two bytes following SS3 is a
                  Chinese character as defined in CNS 11643-plane-4,
                  until another SS3designation appears
      ESC $ + K   Indicates the immediate two bytes following SS3 is a
                  Chinese character as defined in CNS 11643-plane-5,
                  until another SS3designation appears
      ESC $ + L   Indicates the immediate two bytes following SS3 is a
                  Chinese character as defined in CNS 11643-plane-6,
                  until another SS3designation appears
      ESC $ + M   Indicates the immediate two bytes following SS3 is a
                  Chinese character as defined in CNS 11643-plane-7,
                  until another SS3designation appears

   As in ISO-2022-CN, each line should start in ASCII, and end in ASCII,
   and should have its own designation information before any Chinese
   characters appear.

   The name given to this character encoding is "ISO-2022-CN-EXT". This name
   is intended to be used as the "charset" parameter in MIME messages.

       Content-Type: text/plain; charset=ISO-2022-CN-EXT

   The ISO-2022-CN-EXT encoding is also in 7-bit form, so it is not
   necessary to use a Content-Transfer-Encoding header.

   Other restrictions are given in the "Formal Syntax of ISO-2022-CN and
   ISO-2022-CN-EXT" part at the end of this document.


  1.4  How to Support Big5 or other internal codesets with ISO-2022-CN
       and ISO-2022-CN-EXT

   Since there are many different Chinese internal coding systems [CJK],
   such as Big5, GB internal code, CCCII (an encoding for library systems
   in Taiwan), XGB (the codepage for Microsoft simplified Chinese Windows
   95) etc.  ISO-2022-CN and ISO-2022-CN-EXT,  which are 7bit and will not
   lose information during communication among different codesets and thus
   increase interoperability,  are ideal interchange encodings for various
   internal Chinese codesets in international communication.

   For instance, ISO-2022-CN and ISO-2022-CN-EXT can be used to support
   Big5,  because CNS-11643-plane 1 and 2 incorporate all Chinese characters
   in Big5 except two duplicate characters which was a mistake when defining
   Big5.

   Since the code sequence of Big5 and CNS-11643 is different, it needs a
   conversion table for converting Big5 to and from CNS-11643.  The
   conversion table is attached as an appendix in this document.

   Public domain software (either binary or source in C) is provided in
   many places in the Internet too:

   1) Beijing:

    ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/
    (IP address: 166.111.1.11)

   2) Taiwan:

    ftp://tpdns.seed.net.tw/Pub/Chinese/DOS/code-convert/chcode.zip
    (IP address: 139.175.1.12)

   3) US:

    ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz
    (IP address: 128.123.1.55)

   4) Japan:
    ftp://etlport.etl.go.jp/pub/iso-2022-cn/
    (IP address: 192.31.197.99)


2. 8bit Chinese encodings: CN-GB and CN-Big5

   The CN-GB and CN-Big5 charset names are given below.
   Among other things, these support current practice; specifically,
   CN-GB reflects the current usage for simplified Chinese e-mail,
   and CN-Big5 reflects the current usage for traditional Chinese e-mail.

     Note: the use of 8-bit character sets requires the use of
     either an 8-to-7 Content-Transfer-Encoding mechanism such as
     "BASE64" or "QUOTED-PRINTABLE" if the network is not 8-bit clean,
     or the 8-bit SMTP extensions [SMTPEXT] with the "8BIT"
     Content-Transfer-Encoding on 8-bit clean networks.  Otherwise,
     an 8-bit message which passes through a 7-bit mailer is likely
     to have the 8th bit truncated, resulting in an unreadable
     message.  Although "just send 8-bit data" has been common
     practice in the past, it is incorrect according to the
     Internet standards and causes interoperability problems.

 2.1   CN-GB

   E-mail using GB characters is sent in this way:

   GB 2312-80 characters are used with ASCII characters,
   not GB 1988-80 [GB-1988].

   GB 2312-80 is also 7-bit, to avoid conflicting with ASCII.  If the
   character is from GB 2312-80, the MSB (bit-8) of each byte is set to
   1, and therefore becomes a 8-bit character.  Otherwise, the byte is
   interpreted as ASCII.  This constructs a character set named "GB
   Internal Code".

   This method is also adopted in the .gb files in the Internet.

   To use this character scheme with MIME, CN-GB is used as the value
   for the charset parameter:

      Content-Type: text/plain; charset=cn-gb

   GB-12345 is the traditional form of GB-2312, the charset name given
   to this set is CN-GB-12345-90.

   There is also a kind of dependent character set that can only be used
   with one of the above sets.  For example, if GB 8565 is used, it can
   only be used with GB 2312 or GB 12345, in this case, "+" is permitted
   to appear in the charset name, i.e. CN-GB-2312-80+GB-8565-88.

   Similarly as CN-GB, CN-GB-12345-90 and CN-GB-2312-80+GB-8565-88 support
   ASCII too, the MSB of Chinese characters should be set to 1, in order to
   be distinguished from ASCII.

     Note: There are some supplementary character sets in GB, i.e. GB 7589-87,
     GB 7590-87, GB 13131-91 and GB 13132-91.  Normally, they won't
     be used independently without using GB-2312 or GB-12345, so they
     are not necessarily be registered.  Characters in these standards
     could be support with ISO-2022-CN and ISO-2022-CN-EXT.  If, in the
     future, they do needed to be used with "charset" names in some cases,
     it is the responsibility of any interested third party (the
     standardization organization herself or anybody else) to write the
     necessary documents and do the IANA registration for them.  It is
     greatly encouraged that their charset names should also take the form
     of CN-GB-<number>-<edition> as CN-GB-12345-90.  Here, <number> is the
     GB standard number, and <edition> is the year of edition represented
     with the last two digits of the year.  They should be coded in 8-bit
     as CN-GB.

   To avoid hindering interoperability, CN-GB is encouraged to be used
   whenever possible.

  2.2   CN-Big5

   Big5 is a character set of traditional Chinese characters, widely
   used in Taiwan and overseas.  E-mail using Big5 characters is
   sent in this way:

   Big5 characters are used with ASCII characters.

   Big5 is a two-byte coding, in which the first byte is 7-bit,  and
   the second byte is 8-bit.  If the character is from Big5, the MSB
   (bit-8) of the first byte is set to 1, and therefore becomes an 8-bit
   character.  Otherwise, the byte is interpreted as ASCII.  (Big5 uses
   the code space: [0xa1-0xfe,0x40-0x7e] and [0xa1-0xfe,0xa1-0xfe], and
   two other user areas with the first byte in the range of [0x81-0xa0].)

   To use this character scheme with MIME, CN-Big5 is used as the value
   for the charset parameter:

      Content-Type: text/plain; charset=cn-big5


3. Universal Multilingual Character Set:  ISO/IEC-10646/Unicode

   ISO/IEC 10646's BMP (code-to-code identical to Unicode) contains
   large repertoire of Chinese characters (it currently includes all
   the characters of GB 2312-80, GB 12345-90, GB 8565-89, CNS 11643's
   plane 1 and 2, and part of some other standards) and therefore can
   be used to transporting Chinese characters in the Internet community.
   This document does not give any details on how to do this, as this has
   been done elsewhere.  For details of using Unicode with MIME, refer to
   RFC 1641 [RFC-1641], RFC 1642 [RFC-1642].  For assigned names for
   10646 sets, refer to STD 2--"Assigned Numbers", which is RFC 1700
   [RFC-1700] currently.  For more up-to-date assigned numbers, please
   check:

    ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets



A New MIME parameter -- "charset-variant"

   Here, a new MIME parameter--"charset-variant" is defined as below:

   This parameter is used after the MIME "charset" parameter mainly in
   the form of <variant>-<version>, or any extension based on this form,
   in which <variant> is the product name and <version> indicates its
   version number.  It is case-insensetive and optional, and any value
   of this parameter should be registered in IANA.

   For example:
    Content-Type: text/plain; charset=CN-Big5; charset-variant=ETen-2.00.03-DOS

   This may indicate Eten company's variant of Big5: ETen 2.00.03 for DOS.

   The reason to define this parameter is that some implementation may
   want to check the variants in order to deal with them in slightly different
   methods to gain better operability.  Although some features of certain
   variant may bring problem of interoperability, however, variants will
   still exist as they will go; moreover, certain variant may be so popular
   that it becomes de facto industrial standard, therefore indicating its
   name can improve the ability of communication implementation in handling
   its messages.



Background Information

1. Writing systems and their encodings in Chinese-spoken nations and regions

   The mainland provinces of China use simplified Chinese character in
   daily life.  GB is the standard electronic character set.  It is the
   main means for communications between people who share simplified
   Chinese characters in the world.

   Taiwan uses traditional Chinese characters in daily life.  CNS-11643
   is the formal character set for information interchange in Taiwan;
   however, Big5, a widely-used character set of traditional Chinese
   characters, is the de-facto industrial standard in Taiwan.

   Hong Kong uses traditional Chinese characters in daily life, but uses
   both GB and Big5 in electronic form, because Hong Kong people often
   communicate with people in all of China's provinces.

   Singapore seldom uses Chinese characters, and uses the simplified
   form when Chinese characters are used.  In electronic form, Unicode
   is more popular, however GB is also used.

2. Miscellaneouses about Chinese character sets

   The GB 1988-80 character set is identical to ISO 646 [ISO-646] except
   for currency symbol and tilde. The currency symbol and the tilde are
   replaced by the Yuan sign and the over line.  This set is GB's variant
   of ISO 646.  This character set and CNS 5205 [CNS-5205] are not
   encouraged for use in the Internet, since ASCII combined with GB 2312
   or CNS 11643-plane 1 and plane 2 comprises all characters in them.

   The GB 2312-80 character set consists of simplified Chinese
   characters, digits, Latin, Greek and Russian alphabets, and some
   other symbols; in all, 7445 characters.  Each character is represented
   with two bytes.

   GB 13000-95 [GB-13000] is the GB's variant of ISO 10646.  However, for
   interoperability in the Internet, assigned names for ISO 10646 are
   encouraged to be used.

3. Miscellaneous implementation information

   For maximum interoperability, implementations SHOULD at least support
   sending and receiving ISO-2022-CN.  Supporting all registered character
   sets in ISO-2022-CN-EXT is greatly encouraged.

   It is also essential to be able to support CN-GB (the status quo for
   simplified Chinese e-mail ) and CN-Big5 (the status quo for traditional
   Chinese e-mail).  But sending ISO-2022-CN message is always encouraged
   whenever possible.

   To the maximum extent possible, implementations should be capable of
   receiving messages in any of the encodings introduced in this document,
   even if they only transmit messages in one form.  Preferably the
   implementation should display the characters with glyphs appropriate
   to the typographic tradition that is implied in the encoding of the
   received text.  Implementation may also translate these encodings
   to the encoding that its platform supports.

   The human user (not implementor) should try to keep lines within 80
   display columns, or, preferably, within 75 (or so) columns, to allow
   insertion of ">" at the beginning of each line in excerpts. Each
   Chinese character takes up two columns, and the shift sequences do
   not take up any columns. The implementor is reminded that Chinese
   characters take up two bytes and should not be split in the middle to
   break lines for displaying, etc.

   Freely available fonts of Chinese characters:

     Beijing:
       ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/fonts/
     Taiwan:
       ftp://ftp.edu.tw/Chinese/ifcss/software/fonts/
       ftp://ftp.ntu.edu.tw/Chinese/ifcss/software/fonts/
     HongKong:
       ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/fonts/
     Singapore:
       ftp://ftp.technet.sg:/pub/chinese/fonts/
     US:
       ftp://ftp.ifcss.org/pub/software/fonts/
       http://ccic.ifcss.org/www/pub/software/fonts/



X.400 Considerations

   X.400 has the ability of carrying different character sets in a
   message by using the body part "GeneralText" defined by ISO/IEC-10021-7.
   [ISO-10021].

   The X.400 ASN.1 definition of the GeneralText body part is:

      general-text-body-part EXTENDED-BODY-PART-TYPE
        PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-text
        DATA       GeneralTextData
        ::= id-et-general-text

      GeneralTextParameters ::= SET OF CharacterSetRegistration

      CharacterSetRegistration ::= INTEGER (1..32767)

      GeneralTextData ::= GeneralString

   Therefore, to use ISO-2022-CN, set the "CharacterSetRegistration"
   part as { 6 58 171 172 }, and add an ESC sequence of ESC ( B (three bytes,
   hexadecimal values: 1B 28 42) before the beginning of ISO-2022-CN text.

   Similarly, to use ISO-2022-CN-EXT, set the registered numbers of
   all character sets in the "CharacterSetRegistration" part and add ESC
   ( B at the beginning.  For the registered numbers, please refer to
   ISO registry.  In addition to the character sets supported by ISO-2022-CN,
   currently registered numbers are:

     GB 2312+GB 8565:   165
     CNS 11643-plane 3: 183
     CNS 11643-plane 4: 184
     CNS 11643-plane 5: 185
     CNS 11643-plane 6: 186
     CNS 11643-plane 7: 187

   176 is the registered number for the BASESET of ISO/IEC 10646-1:1993
   UCS-2 with implementation level 3, Escape sequence of ESC % / E
   (four bytes, hexadecimal values 1B 25 2F 45) indicates starting of
   this codeset.

   For CN-GB and CN-Big5 character sets, there currently are no formal
   methods that could be used in X.400 yet.

   For detail about X.400 use of character sets, please refer to
   RFC 1502 [RFC-1502].



Formal Syntax of ISO-2022-CN and ISO-2022-CN-EXT

   The notational conventions used here are identical to those used in
   RFC 822.

1.  Formal Syntax of ISO-2022-CN

   body  ::= * ( ascii_line / c_line )

   ascii_line  ::= *char CRLF

   c_line ::= *char 1*(1*designation 1*(*char 1*c_text *char)) CRLF

   designation  ::= SOdesignation / SS2designation

   SOdesignation  ::= ESC "$" ")" finalchar_for_SO

   SS2designation  ::= ESC "$" "*" finalchar_for_SS2

   finalchar_for_SO  ::= "A" / "G"

   finalchar_for_SS2  ::= "H"

   c_text  ::= 1* ( SO-SI-segment / SS2segment )

   SO-SI-segment ::= SO 1*c_char *designation *( c_segment / SO-segment ) SI

   c_segment  ::= 1* ( c_char / SS2segment )

   SO-segment ::= SO 1*c_char

   SS2segment  ::= SS2 c_char

   c_char  ::= one_of_94  one_of_94

                                                     ; ( Octal, Decimal.)

   ESC             ::= <ISO-646 ESC, escape>         ; ( 33, 27.)

   SI              ::= <ASCII SI, shift in>          ; ( 17, 15.)

   SO              ::= <ASCII SO, shift out>         ; ( 16, 14.)

   SS2             ::= <ISO 2022 Single_shift two>   ; ( 33 116, 27 78.)

   SS3             ::= <ISO 2022 Single_shift three> ; ( 33 117, 27 79.)

   one_of_94       ::= <any char in 94_char set>     ; (41-176, 33-126.)

   char            ::= <any char in 96_char_set>     ; (40-177, 30-127.)


2.  Formal Syntax of ISO-2022-CN-EXT


   body  ::= * ( ascii_line / c_line )

   ascii_line  ::= *char CRLF

   c_line ::= *char 1*(1*designation 1*(*char 1*c_text *char)) CRLF

   designation  ::= SOdesignation / SS2designation / SS3designation

   SOdesignation  ::= ESC "$" ")" finalchar_for_SO

   SS2designation  ::= ESC "$" "*" finalchar_for_SS2

   SS3designation  ::= ESC "$" "+" finalchar_for_SS3

   finalchar_for_SO  ::= "A" / <X12345> / "G" / "E"

   finalchar_for_SS2  ::= <X7589> / <X13131> / "H"

   finalchar_for_SS3  ::= <X7590> / <X13132> / "I" / "J" / "K" / "L" / "M"

   c_text  ::= 1* ( SO-SI-segment / SS2segment / SS3segment )

   SO-SI-segment ::= SO 1*c_char *designation *( c_segment / SO-segment ) SI

   c_segment  ::= 1* ( c_char / SS2segment / SS3segment )

   SO-segment ::= SO 1*c_char

   SS2segment  ::= SS2 c_char

   SS3segment  ::= SS3 c_char

   c_char  ::= one_of_94  one_of_94

                                                     ; ( Octal, Decimal.)

   ESC             ::= <ISO-646 ESC, escape>         ; ( 33, 27.)

   SI              ::= <ASCII SI, shift in>          ; ( 17, 15.)

   SO              ::= <ASCII SO, shift out>         ; ( 16, 14.)

   SS2             ::= <ISO 2022 Single_shift two>   ; ( 33 116, 27 78.)

   SS3             ::= <ISO 2022 Single_shift three> ; ( 33 117, 27 79.)

   one_of_94       ::= <any char in 94_char set>     ; (41-176, 33-126.)

   char            ::= <any char in 96_char_set>     ; (40-177, 30-127.)



Registration of New "charset"s and New MIME parameter

1. This document defines the following MIME "charset" names for Chinese
   text:

       ISO-2022-CN, ISO-2022-CN-EXT
       CN-GB, CN-Big5
       CN-GB-12345-90
       CN-GB-2312-80+GB-8565-88

2.  This document defines a new MIME parameter:

       charset-variant


Acknowledgments

   This document is the result of cooperation in the APNG-CC, the
   Chinese Character sub-working group of the I18N/L10N
   (Internationalization and Localization) working group of APNG
   (Asia-Pacific Networking Group), coordinator Zhu Haifeng
   <zhf@net.tsinghua.edu.cn>.  The membership of APNG-CC consists
   of individuals from both sides of the Taiwan Strait, HongKong,
   and from Singapore and other countries.  The authors wish to
   thank all members of APNG-CC.

   Prof.Yao Shiquan and Ms.Lin Ning of CITS (China Information Technology
   Standardization Technical Committee), Prof. Zhao Jingrong, Prof. Li Xing,
   and Mr.YouYue of Tsinghua University gave many help in the process
   of the work.

   Many thanks to Mr. C.J.Cherng and Mr. C.K.Fan of III (Institute for
   Information Industry), and Mr. Chang JingShin from Tsinghua University
   in Hsinchu, Taiwan.

   In particular, Mr.Masataka Ohta, who is the coordinator of APNG-I18N,
   contributed many efforts towards the work from the beginning of APNG-CC.

   The authors also wish to thank the following people who contributed
   in many ways towards this draft.

     Martin J. Duerst           Kenichi Handa
     Zhang Ling                 Zhang ZhouCai
     Zhu Bin                    Nelson Chin
     Lu Chin                    Ding ZyKaan
     Chen Shuyi                 Mao Yonggang
     Mao Yonggang               Ken Lunde
     Lua Kim Teng               Victor Cheng
     Stephen G. Simpson         Yuan Jiang
     Liu HuiFang                Harald T. Alvestrand
     Feng Hui


Security Considerations

   Security issues are not discussed in this memo.


Authors' Addresses


   Zhu,Hai-feng  (HF. Zhu)
   Dept. of Computer Science & Technology
   Tsinghua University
   Beijing, 100084
   China

   Tel: +86-1-2561144 ext. 3492
   Fax: +86-1-2564173
   Email: zhf@net.edu.cn, zhf@net.tsinghua.edu.cn


   Hu,Dao-yuan  (DY. Hu)
   Tsinghua Networking Center
   Tsinghua University
   Beijing, 100084
   China

   Tel: +86-1-2594016
   Fax: +86-1-2564173
   Email: hdy@tsinghua.edu.cn

   Wang,Zhi-guan  (ZG. Wang)
   SubCommitte 2 (SC2)
   China Information Technology Standardization Technical Committee
   (CITS)
   Beijing, 100083
   China

   Tel: +86-1-4012392
   Fax: +86-1-4010601


   Kao,Tien-cheu (TC. Kao)
   I.T. Promotion Division
   Institute for Information Industry(III)
   Taipei
   Taiwan

   Tel: +886-2-5631688
   Fax: +886-2-563-4209
   Email: tckao@iiidns.iii.org.tw


   Chang,Wen-chung  (WC. Chang)
   Institute for Information Industry(III)
   Taipei
   Taiwan

   Tel: +886-2-7327771
   Fax: +886-2-7370188
   Email: chung@iiidns.iii.org.tw


   Mark R. Crispin
   Networks and Distributed Computing
   University of Washington
   4545 15th Avenue NE
   Seattle, WA  98105-4527
   USA

   Tel: +1 (206) 543-5762
   Fax: +1 (206) 685-4045
   Email: MRC@CAC.Washington.EDU


Appendix -- Conversion Table for CNS-11643 and Big5

  This is a conversion table for the Chinese characters in Big5 and
  CNS-11643, including some specific characters in Eten variant of Big5.
  Noted that this list only contains Chinese characters, symbols are
  not provided.  For more complete table, please refer to [CJK] or
  the ftp sites listed in section 1.4, where conversion programs are
  available.

 1. Big5 Level 1 correspondence to CNS 11643-1992 Plane 1:

  0xA440-0xACFD <-> 0x4421-0x5322  # Level 1 Chinese start
         0xACFE <-> 0x5753
  0xAD40-0xAFCF <-> 0x5323-0x5752
  0xAFD0-0xBBC7 <-> 0x5754-0x6B4F
  0xBBC8-0xBE51 <-> 0x6B51-0x6F5B
         0xBE52 <-> 0x6B50
  0xBE53-0xC1AA <-> 0x6F5C-0x7534
  0xC1AB-0xC2CA <-> 0x7536-0x7736
         0xC2CB <-> 0x7535
  0xC2CC-0xC360 <-> 0x7737-0x782C
  0xC361-0xC3B8 <-> 0x782E-0x7863
         0xC3B9 <-> 0x7865
         0xC3BA <-> 0x7864
  0xC3BB-0xC455 <-> 0x7866-0x7961
         0xC456 <-> 0x782D
  0xC457-0xC67E <-> 0x7962-0x7D4B  # Level 1 Chinese end

 2. Big5 Level 2 correspondence to CNS 11643-1992 Plane 2:

  0xC940-0xC949 <-> 0x2121-0x212A
         0xC94A  -> 0x4442         # duplicate of 0xA461
  0xC94B-0xC96B <-> 0x212B-0x214B
  0xC96C-0xC9BD <-> 0x214D-0x217C
         0xC9BE <-> 0x214C
  0xC9BF-0xC9EC <-> 0x217D-0x224C
  0xC9ED-0xCAF6 <-> 0x224E-0x2438
         0xCAF7 <-> 0x224D
  0xCAF8-0xD6CB <-> 0x2439-0x376E
         0xD6CC <-> 0x3E63
  0xD6CD-0xD779 <-> 0x3770-0x387D
         0xD77A <-> 0x3F6A
  0xD77B-0xDADE <-> 0x387E-0x3E62
         0xDADF <-> 0x376F
  0xDAE0-0xDBA6 <-> 0x3E64-0x3F69
  0xDBA7-0xDDFB <-> 0x3F6B-0x4423
         0xDDFC  -> 0x4176         # duplicate of 0xDCD1
  0xDDFD-0xE8A2 <-> 0x4424-0x554A
  0xE8A3-0xE975 <-> 0x554C-0x5721
  0xE976-0xEB5A <-> 0x5723-0x5A27
  0xEB5B-0xEBF0 <-> 0x5A29-0x5B3E
         0xEBF1 <-> 0x554B
  0xEBF2-0xECDD <-> 0x5B3F-0x5C69
         0xECDE <-> 0x5722
  0xECDF-0xEDA9 <-> 0x5C6A-0x5D73
  0xEDAA-0xEEEA <-> 0x5D75-0x6038
         0xEEEB <-> 0x642F
  0xEEEC-0xF055 <-> 0x6039-0x6242
         0xF056 <-> 0x5D74
  0xF057-0xF0CA <-> 0x6243-0x6336
         0xF0CB <-> 0x5A28
  0xF0CC-0xF162 <-> 0x6337-0x642E
  0xF163-0xF16A <-> 0x6430-0x6437
         0xF16B <-> 0x6761
  0xF16C-0xF267 <-> 0x6438-0x6572
         0xF268 <-> 0x6934
  0xF269-0xF2C2 <-> 0x6573-0x664C
  0xF2C3-0xF374 <-> 0x664E-0x6760
  0xF375-0xF465 <-> 0x6762-0x6933
  0xF466-0xF4B4 <-> 0x6935-0x6961
         0xF4B5 <-> 0x664D
  0xF4B6-0xF4FC <-> 0x6962-0x6A4A
  0xF4FD-0xF662 <-> 0x6A4C-0x6C51
         0xF663 <-> 0x6A4B
  0xF664-0xF976 <-> 0x6C52-0x7165
  0xF977-0xF9C3 <-> 0x7167-0x7233
         0xF9C4 <-> 0x7166
         0xF9C5 <-> 0x7234
         0xF9C6 <-> 0x7240
  0xF9C7-0xF9D1 <-> 0x7235-0x723F
  0xF9D2-0xF9D5 <-> 0x7241-0x7244

 3. Big5 Level 2 correspondence to CNS 11643-1992 Plane 3:

         0xF9D6 <-> 0x4337         # ETen-specific Chinese
         0xF9D7 <-> 0x4F50         # ETen-specific Chinese
         0xF9D8 <-> 0x444E         # ETen-specific Chinese
         0xF9D9 <-> 0x504A         # ETen-specific Chinese
         0xF9DA <-> 0x2C5D         # ETen-specific Chinese
         0xF9DB <-> 0x3D7E         # ETen-specific Chinese
         0xF9DC <-> 0x4B5C         # ETen-specific Chinese


References

   [ASCII] American National Standards Institute, "Coded character set
   -- 7-bit American National Standard Code for Information
   Interchange", ANSI X3.4-1986.

   [BIG5] Institute for Information Industry, "Chinese Coded
   Character Set in Computer ", March, 1984

   [CJK] Ken Lunde, On-line documentation of Chinese/Japanese/Korean
   Information Processing, 1995, available at:
   ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf

   [CNS-5205] "Information processing -- 7-Bit Coded Character Set For
   Information Interchange", CNS-5205.

   [CNS-11643] "Chinese Standard Interchange Code", CNS-11643 version
   1992; "Standard Interchange Code for Generally-Used Chinese
   Characters", CNS 11643 version 1986.

   [GB-1988] "7-bit Coding Character Set for Information Interchange",
   GB 1988-80.

   [GB-2312] "Coding of Chinese Ideogram Set for Information Interchange
   Basic Set",  GB 2312-80.

   [GB-7589] "Code of Chinese Ideograms Set for Information Interchange,
   the 2nd Supplementary Set", UDC 681.3.048, GB 7589-87.

   [GB-7590] "Code of Chinese Ideogram Set for Information Interchange,
   the 4th Supplementary Set",UDC 681.3.048, GB 7590-87.

   [GB-8565] "Information Processing Coded Character Sets for Text
   Communication", UDC 681.3,  GB 8565-88.

   [GB-12345] "Code of Chinese Ideogram Set for Information Interchange
   Supplementary Set", GB/T 12345-90.

   [GB-13000]  "Information technology--Universal Multiple-Octet Coded
   Character Set(UCS)---Part 1: Architecture and Basic Multilingual Plane",
   GB13000.1

   [GB-13131] "Code of Chinese Ideogram Set for Information Interchange,
   the 3rd Supplementary Set", GB 13131-91.

   [GB-13132] "Code of Chinese Ideogram Set for Information Interchange,
   the 5th Supplementary Set", GB 13132-91.

   [ISO-646] International Organization for Standardization (ISO),
   "Information technology -- ISO 7-bit coded character set for
   information interchange", International Standard, Ref. No. ISO/IEC
   646:1991.

   [ISO-2022] International Organization for Standardization (ISO),
   "Information processing -- ISO 7-bit and 8-bit coded character sets
   -- Code extension techniques", International Standard, Ref. No. ISO
   2022-1986 (E).

   [ISO-10021] Information Technology - Text communication -
   Message-Oriented Text Interchange Systems (MOTIS), ISO 10021,
   October 1988.

   [ISO-10646] ISO/IEC 10646-1:1993(E) Information Technology--Universal
   Multiple-octet Coded Character Set (UCS)---Part 1: Architecture and
   Basic Multilingual Plane"

   [ISOREG] International Organization for Standardization (ISO),
   "International Register of Coded Character Sets To Be Used With
   Escape Sequences".

   [MIME-1] Borenstein, N., and Freed, N., "MIME (Multipurpose Internet
   Mail Extensions) Part One: Mechanisms for Specifying and Describing
   the Format of Internet Message Bodies", RFC 1521, Bellcore, Innosoft,
   September 1993.

   [MIME-2] Moore, K., "MIME (Multipurpose Internet Mail Extensions)
   Part Two: Message Header Extensions for Non-ASCII Text", RFC 1522,
   University of Tennessee, September 1993.

   [RFC-822] Crocker, D., "Standard for the Format of ARPA Internet Text
   Messages", STD 11, RFC 822, University of Delaware, August 1982.

   [RFC-1036] Horton M., and Adams, R., "Standard for Interchange of
   USENET Messages", RFC 1036, AT&T Bell Laboratories, Center for
   Seismic Studies, December 1987.

   [RFC-1468] Murai J., Crispin  M.   and  E. van  der  Poel,   Japanese
   Character Encoding for Internet Messages, June 1993.

   [RFC-1557] Choi U., Chon K. and Park H.,  Korean  Character  Encoding
   for Internet Messages, December 1993.

   [RFC-1641] Goldsmith D., and Davis M., "Using Unicode with MIME", RFC
   1641, Taligent Inc., July 1994

   [RFC-1642] Goldsmith D., and Davis M.," UTF-7, A Mail-Safe Transformation
   Format of Unicode", July 1994

   [RFC-1700] Reynolds J., and Postel J., "Assigned Numbers",RFC 1700,
   STD 2, ISI, October 1994

   [SMTP] Postel, Jonathan B. "Simple Mail Transfer Protocol", STD 10,
   RFC 821, USC/Information Sciences Institute, August 1982.

   [SMTPEXT] Klensin, J.; Freed, N.; Rose, M.; Stefferud, E.; and
   Crocker, D., "SMTP Service Extensions", RFC 1651, July 1994.

   [Unicode 1.1] "The Unicode Standard, Version 1.1",
   Addison-Wesley, Reading, MA (to be published; the contents
   of this standard is currently available by combining
   [Unicode92], [Unicode93], and [Unicode4]).

   [Unicode92] The Unicode Consortium, "The Unicode Standard -
   Worldwide Character Encoding - Version 1.0", Volume 1,
   Addison-Wesley, Reading, MA, 1992 (ISBN 0-201-56788-1).

   [Unicode93] The Unicode Consortium, "The Unicode Standard -
   Worldwide Character Encoding - Version 1.0", Volume 2,
   Addison-Wesley, Reading, MA, 1992 (ISBN 0-201-60845-6).

   [Unicode4] The Unicode Consortium, "The Unicode Standard -
   Version 1.1 (Prepublication Edition)", Unicode Technical
   Report #4 (avaliable from the Unicode Consortium).
Document	Document type	This is an older version of an Internet-Draft that was ultimately published as RFC 1922.
	Select version	00 01 RFC 1922
	Compare versions
	Author	Mark Crispin Email authors
	RFC stream	Legacy stream
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml