draft-jseng-idn-admin-01

INTERNET DRAFT                                     Editors:  James SENG
draft-jseng-idn-admin-01.txt                               John KLENSIN
18th Oct 2002                                      Authors:  K. KONISHI
Expires 18th April 2003                        K. HUANG, H. QIAN, Y. KO

     Internationalized Domain Names Registration and Administration
               Guideline for Chinese, Japanese and Korean

Status of this Memo

     This document is an Internet-Draft and is in full conformance
     with all provisions of Section 10 of RFC2026 except that the
     right to produce derivative works is not granted.

    Internet-Drafts are working documents of the Internet
    Engineering Task Force (IETF), its areas, and its working
    groups. Note that other groups may also distribute working
    documents as Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of
    six months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use Internet-
    Drafts as reference material or to cite them other than as
    "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

Abstract

Achieving internationalized access to domain names raises many complex
issues.  These include not only associated with basic protocol design
(i.e., how the names are represented on the network, compared, and
converted to appropriate forms) but also issues and options for
deployment, transition, registration and administration.

The IETF IDN working group focused on the development of a standards
track specification for access to domain names in a broader range of
scripts than the original ASCII.  It became clear during its efforts
that there was great potential for confusion, and difficulties in
deployment and transition, due to characters with similar appearances
or interpretations and that those issues could best be addressed
administratively, rather than through restrictions embedded in the
protocols.

This document provides guidelines for zone administrators (including
but not limited to registry operators and registrars), and information
for all domain names holders, on the administration of those domain
names which contain characters drawn from Chinese, Japanese and Korean
scripts (CJK).  Other language groups are encouraged to develop their
own guidelines as needed, based on these guideline if that is helpful.

Comments on this document can be sent to the authors at
idn-admin@jdna.jp.

Table of Contents

0. Pre-Note for ASCII-version of this document                        2

1. Introduction                                                       3

2. Definitions                                                        5

3. Administrative Framework                                           6
3.1. Principles underlying these Guidelines                           7
3.2. Registration of IDL                                              8
3.2.1. Language character variant table                               9
3.2.2  Formal syntax                                                 10
3.2.3. Registration Algorithm                                        10
3.3. Deletion and Transfer of IDL and IDL Package                    12
3.4. Activation and De-activation of IDN variants                    13
3.5. Adding/Deleting language(s) association                         13
3.6. Versioning of the language character variant tables             13

4. Example of Guideline Adoption                                     14

i. Notes                                                             17

ii. Acknowledgements                                                 17

iii. Authors                                                         18

iv. Appendex A                                                       18

v. Normative References                                              19

vi. Non-normative References                                         19

vii. Other Issues                                                    19



0. Pre-Note for ASCII-version of this document

In order to make meanings clear, especially in examples, Han ideographs
are used in several places in this document.  Of course, these
ideographs do not appear in its ASCII form of this document.  So, for
the convenience of readers of the ASCII format and some readers not
familiar with recognizing and distinguishing Chinese characters, each
use of a particular character will be associated with both its Unicode
code point and an "asterisk tag" with its corresponding Chinese
Romanization [ISO7098] with the tone mark represented by a number 1 to
4.  Those tags have no meaning outside this document; they are intended
simply to provide a quick visual and reading reference to facilitate
the combinations and transformations of characters in the guideline and
table excerpts.  Appendix A would provide the Romanization of the
ideographs in Japanese (ISO 3602) and Korean (ISO 11941).

1. Introduction

Defining and specifying protocols for Internationalized Domain Names
has been one of the most controversial tasks initiated by the IETF in
recent years.  Domain names are the fundamental naming architecture of
the Internet; many Internet protocols and applications rely on the
stability, continuity, and absence of ambiguity of the DNS.

The introduction of internationalized domain names (IDN) amplifies the
difficulty of putting names into identifiers and the confusion between
scripts and languages.  It impacts many internet protocols and
applications and creates more complexity in technical administration
and services.

While the IETF IDN working group [IDN-WG] focused on the technical
problems of IDN, administrative guidelines are also important in order
to reduce unnecessary user confusion and domain name disputes among
domain name holders.

The IDN working group has completed working group last call for the
following internet-drafts:

1. Preparation of Internationalized Strings [STRINGPREP]
2. Internationalizing Host Names In Applications [IDNA]
3. Punycode version 0.3.3 [PUNYCODE]
4. A Stringprep Profile for Internationalized Domain Names [NAMEPREP]

These drafts specify that the intersystem protocols that make up the
domain name system infrastructure remain unchanged.  Instead, they
introduce internationalization (I18N) [Note1] in client software
(particularly via the IDNA protocol) using an ASCII Compatible Encoding
(ACE) known as Punycode.

The domain name protocols [STD13] also specify that characters are to
be interpreted so that upper and lower case Latin-based characters are
considered equivalent.  But with the introduction of Unicode characters
beyond US-ASCII, and the possibility to represent a single character in
multiple ways in ISO10646/Unicode [UNICODE], a normalization process,
known as Nameprep, has been proposed to handle the more complex
problems of character-matching for those additional characters.
Nameprep is also executed by client software as described in IDNA.

While Nameprep normalizes domain names so that the users have an
improved chance of getting the right domain name from information
provided in other forms, as required for I18N, Nameprep does not handle
any localization (L10N).

This becomes significant when a domain name holder attempts to use a
Unicode string forming a "name", "word", or "phrase" that may have
certain meaning in a certain language or when used as a domain name.
Such Unicode string may have different variants in the context of the
language or culture.

Generally, these localized variants in CJK can be classified into four
categories, as described by Halpern et al. [C2C]: [Note2]

a. Character (or Code) variants

Character (or Code) variants refer to variants that are generated by
character-by-character (or code-by-code) substitution.

An example in English would be "A" or "a" (U+0041 or U+0061).
Two examples in Chinese would be U+98DB *fei1* or U+98DE *fei1*
and U+6A5F *ji1* or U+673A *ji1*.

Note that this does not mean the choice between U+6A5F and U+673A is
always symmetric like the one between "A" and "a" -- it is a choice only
for Chinese but not for Japanese.

The variants for particular characters may be just to drop them. For
example, points and vowels characters in Hebrew (U+05B0 to U+05C4) and
Arabic (U+064B to U+0652) are optional; the variants for strings
containing them are constructed by simply dropping those points and
vowels.

Code variants may also occur when different code points are assigned to
what visually or abstractly are the "same" character, possibility due
to compatibility issues, type face differences or script range. For
example, LATIN CAPITAL LETTER A (U+0041) normally has an appearance
identical to GREEK CAPTIAL LETTER A (U+0391). CJK scripts have font
variants for compatibility (either U+4E0D or U+F967 may be used) and
"zVariant" (e.g. U+5154 and U+514E).

The difficulty lies in defining which characters are the "same" and
which are not.

b. Orthographic variants

Orthographic variants refer to variants that are generated by word-by-
word substitution.

An example in English would be "color" and "colour".

It is possible for some of these orthographic variants to be generated
by character variants. For example "airplane" in Chinese may be either
U+98DB U+6A5F *fei1 ji1* or U+98DE U+673A *fei1 ji1*.

Other orthographic variants may not be generated by character variants.
For example, in Chinese, both U+767C *fa1* and U+9AEE *fa4*
are related to U+53D1 *fa1 or fa4* depending on the word. For hair,
U+5934 U+53D1 *tou2 fa4*, the variant should be U+982D U+9AEE
*tou2 fa4* but not U+982D U+767C *tou2 fa1*.

c. Lexemic variants

Lexemic variants refer to variants that can be generated when language
is considered, by word-by-word substitution.

An example in English would be cab, taxi, or taxicab.

An example in Chinese would be U+8CC7 U+8A0A *zi1 xun4* or
U+4FE1 U+606F *xin4 xi1*.

Note that there is no relationship between U+8CC7 and U+4FE1 or U+8A0A
and U+606F, i.e., the sequence U+8CC7 U+606F *zi1 xi1* does not
exist in Chinese.

d. Contextual variants

Contextual variants refer to variants that are generated by word-by-
word substitutions with context considered.

In English, the word "plane" has different meanings and could be
replaced by with different equivalent words (synonyms) such as
"airplane" or "plane" (as in a flat-surface or device for smoothing
wood) depending on context.  And, of course, "plain", which is
pronounced the same way, and indistinguishable in speech-to-text
contexts such as computer input systems for the visually impaired, is a
different word entirely.

Similarly, the word U+6587 U+4EF6 *wen2 jian4* could be either
document U+6587 U+4EF6 *wen2 jian4* or data file U+6A94 U+6848
*dang3 an4* depending on context.

Although domain names were designed to be identifiers without any
language context, users have not been prevented from using strings in
domain names and interpreting them as "words" or "names". It is likely
that users will do this with IDN as well. Therefore, given the added
complications of using a much broader range of characters, precautions
will be required when deploying IDN to minimize confusion and fraud.

The intention of these guidelines is to provide advice about the
deployment of IDNs, with language consideration, but focusing only on
the category of character variants to increase the possibility of
successful resolution and reduced confusion while accepting inherent
DNS limitations.

2. Definitions

Unless otherwise stated, the definitions of the terms used in this
document are consistent with "Terminology Used in Internationalization
in the IETF" [I18NTERMS].

"FQDN" refers to a fully-qualified domain name and "domain name label"
refers to a label of a FQDN.

RFC3066 [RFC3066] defines a system for coding and representing
languages.

ISO/IEC 10646 is a universal multiple-octet coded character set that is
a product of ISO/IEC JTC1/SC2/WG2, Work Item JTC1.02.18 (ISO/IEC 10646).
It is a multi-part standard: Part 1, published as ISO/IEC 10646-
1:2000(E) covering the Architecture and Basic Multilingual Plane; Part
2, published as ISO/IEC 10646-2:2001(E) covers the supplementary
(additional) planes.

The Unicode Consortium publishes "The Unicode Standard -- Version 3.0",
ISBN 0-201-61633-5. In March 2002, Unicode Consortium published Unicode
Standard Annex #28.  That annex defines Version 3.2 of The Unicode
Standard, which is fully synchronized with ISO/IEC 10646-1:2000 (with
Amendment 1).

The term "Unicode character" is used here to refer to characters chosen
from The Unicode Standard Version 3.2 (and hence from ISO/IEC 10646).
In this document, the characters are identified by their positions (or
"code points"). The notation U+12AB, for example, indicates the
character at the position 12AB (hexadecimal) in the Unicode 3.2 table.

Similarly, "Unicode string" refers to a string of Unicode characters.
The Unicode string is identify by the sequence of the Unicode
characters regardless of the encoding scheme.

The term "IDN" is often used to refer to many different things: (a) an
abbreviation for "Internationalized Domain Name" (b) a fully-qualified
domain name that contains at least one label that contains characters
not appearing in ASCII (c) a label of a domain name that contains at
least one character beyond ASCII (d) a Unicode string to be processed
by Nameprep (e) an IDN Package (in this document context) (f) a
Nameprep processed string (g) a Nameprep and Punycode processed string
(h) the IETF IDN Working Group (g) ICANN IDN Committee (h) other IDN
activities in other companies/organizations etc.

Because of the potential confusion, this document shall use the term
"IDN" as an abbreviation for "Internationalized Domain Name" only.

And also, this document provides a guideline to be applied on a per
zone basis, one label at a time, the term "Internationalized Domain
Name Label" or "IDL" will be used instead.

In this document, the term "registration" refers to the process by
which a potential domain name holder requests that a label be placed in
the DNS, either as an individual name within a domain or as a sub-
domain delegation from another domain name holder. A successful
registration would then lead to the label or delegation records being
placed in the relevant zone file.  The guidelines presented here are
recommended for all zones, at any hierarchy level, in which CJK
characters are to appear, not just domains at the first or second level.

CJK characters are characters commonly used in Chinese, Japanese or
Korean language including but not limited to ASCII (U+0020 to U+007F,
Han Ideograph (U+3400 to U+9FAF and U+20000 to U+2A6DF), Bopomofo
(U+3100 to U+312F and U+31A0 to U+31BF), Kana (U+3040 to U+30FF), Jamo
(U+1100 to 11FF and U+3130 to U+318F), Hangul (U+AC00 to U+D7AF and
U+3130 to U+318F) and its respective compatibility forms.

3. Administrative Framework

Zone administrators are responsible for the administration of the
domain name labels under their control. A zone administrator might be
responsible for a large zone such as a Top Level Domain (TLD), generic
or country code, or a smaller one such as a typical second or third
level domain.  A large zone would often be more complex then a smaller
one (sometimes it is just larger).  However, normally, actual technical
administrative tasks -- such as addition, deletion, delegation and
transfer of zones between domain name holders -- are similar for all
zones.

At the same time, different zones may have different policies and
processes.  For example, a pay-per-domain policy and registry/registrar
model for .COM may not be applicable to such domains as .SG or .IBM.COM.
The latter, for example, has very restricted policies about who is
permitted to have a domain name label under IBM.COM, the types of
string that are permitted, and different procedures for obtaining those
string.

This document only provides guidelines for how CJK characters should be
handled within a zone, how language issues should be considered and
incorporated, and how domain name labels containing CJK characters
should be administered (including registration, deletion and transfer
of labels). It does not provide any guidance for handling of non-CKJ
characters or languages in zones.

Other IDN policies, as the creation of new TLDs, or the cost structure
for registrations, are outside the scope of this document.  Such
discussions should be conducted in forums outside the IETF as well.

Technical implementation issues are not discussed here either.  For
example, the decision as to whether various of the guidelines should be
implemented as registry or registrar actions is left to zone
administrators, possibly differing from zone to zone.

3.1. Principles underlying these Guidelines

In many places, this document would assumes "First-Come-First-Serve"
(FCFS) as a conflict policy in the event of a dispute although FCFS is
not listed as one of the principles. If other policies dominate
priorities and "rights", one can use these guidelines by replacing uses
of FCFS in this document by appropriate other policy rules specific to
the zone.  In other cases, some of these guidelines may not be
applicable although, some alternatives for determining rights to labels
-- such as use of UDRP or mutual exclusion -- might have little impact
on other aspects of these guidelines.

(a) Each IDL to be registered should be associated with one or more
languages.

Although some Unicode strings may be pure identifiers made up of an
assortment of characters from many languages and scripts, IDLs are
likely to be names or phrases that have certain meaning in some
language.  While a zone administration might or might not require
"meaning" as a registration criterion, the possibility of meaning
provides a useful tool when trying to avoid user confusion.

Zone administrators should administratively associate one or more
language with each IDL.  These associations should either be pre-
determined by the zone administrator and applied to the entire zone or
chosen by the registrants on a per-IDL basis.  The latter may be
necessary for some zones, but will make administration more difficult
and will increase the likelihood of conflicts in variant forms.

A given zone might have multiple languages associated with it, or have
no language specified at all, but doing so may provide additional
opportunities for user confusion, and is therefore not recommended.

The zone administrator must also verify the validity of the IDL
requested by using information associated with the chosen language and
possibly other rules as appropriate.

(b) When an IDL is registered, all of the character variants for the
associated language(s) should be reserved for the registrant.  Each
language associated with the IDL will lead to different character
variants.

IDL reservations of the type described here normally do not appear in
the distributed DNS zone file.  In other words, these reserved IDLs do
not resolve. Domain name holders could request these reserved IDLs to
be placed in the zone file and made active and resolvable as, e.g.,
aliases or synonyms.

Since different languages may imply different sets of variants, the
IDLs reserved for one IDL may overlap those reserved for another.  In
this case, the reserved IDLs should be bound to one registration or the
other, or excluded from both, according to the applicable registration
or dispute resolution policy for the zone.

(c) For a given base language, the IDL may have one or more recommended
variants that should be suggested to the domain name holder for active
registration as synonyms.

Some language rules may prefer certain variants over others. To
increase the likelihood of correct and predictable resolution of the
IDL by end-users, the recommended variants should be active.

(d) The IDL and its reserved variants with the language(s) association
must be atomic.

The IDL and its reserved variants for the associated language(s) are to
be considered as a single unit -- an "IDL Package". For a given IDL,
that IDL package is defined by these guidelines and created upon
registration.

The IDL Package is atomic: Transfer and deletion of IDL are performed
on the IDL Package as a whole. IDL, either active or reserved, within
the IDL Package must not be transferred or deleted individually.  I.e.,
any re-registration, transfers, or other actions that impact the IDL
should also impact the reserved variants.  Separate registration or
other actions for the variants are not possible if these guidelines are
to accomplish their purpose.

Conflict policy of the zone may result in violation of the IDL Package
atomicity. In such case, the conflict policy would take precedence.

3.2. Registration of IDL

Conforming to the principles described in 3.1, the registration of an
IDL would require at least two components, i.e., the character variant
tables for the language and the registration algorithm.

3.2.1. Language character variant table

Any lines starting with, or portions of lines after, the hash
symbol("#") are treated as comments. Comments have no significance in
the processing of the tables, nor are there any syntax requirements
between the hash symbol and the end of the line. Blank lines in the
tables are ignored completely.

Every language should have a character variant table provided by a
relevant group (or organization or other body) and based on established
standards. The group that defines a particular character variant table
should document references to the appropriate standards in beginning of
table, tagged with the word "Reference" followed by an integer (the
reference number) followed by the description of the reference.  For
example,

Reference 1 CP936 (commonly known as GBK)
Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt
Reference 3 List of Simplified character Table (Simplified column)
Reference 4 zSimpVariant in Unihan.txt
Reference 5 variant that exists in GB2312, common simplified hanzi

Each language character variant table must have a version number. This
is tagged with the word "Version" followed by an integer then followed
by the date in the format YYYYMMDD, where YYYY is the 4 digit Year, MM
is the 2 digit Month and DD is the 2 digit Day of the publication date
of the table

Version 1 20020701      # July 2002 Version 1

The table has three fields, separated by semicolons.  The fields are:
"valid code point"; "recommended variant(s)"; and "character
variant(s)".

Only code points listed in the "valid code point" field are allowed to
be registered as part of a IDL associated with that language.

There can be one or more "recommended variant(s)" (i.e., entries in the
"recommended variant(s)" column). If the "recommended variant(s)"
column is empty, then there is no corresponding variant.

The "character variant(s)" column contains all variants of the code
point, including but not limited to the code point itself and the
"recommended variant(s)".

If the variant is composed of a sequence of code points, then sequence
of code points is listed separated by a space in the "recommended
variant(s)" or "character variant(s)".

If there are multiple variants, each variant must be separated by a
comma in the "recommended variant(s)" or "character variant(s)".

Any code point listed in the "recommended variant(s)" column must be
allowed, by the rules for the relevant language, to be registered.
However, this is not a requirement for the entries in the "character
variant(s)" column; it is possible that some of those entries may not
be allowed to be registered.

Every code point in the table should have a corresponding reference
number (associated with the references) specified to justify the entry.
The reference number is placed in parentheses after the code point. If
there is more than one reference, then the numbers are placed within a
single set of parentheses and separated by commas.

3.2.2. Formal syntax

This section uses the IETF "ABNF" metalanguage [ABNF]

LanguageCharacterVariantTable = 1*ReferenceLine VersionLine 1*EntryLine
ReferenceLine = "Reference" SP RefNo SP RefDesciption [ Comment ] CRLF
RefNo = 1*DIGIT
RefDesciption = *[VCHAR]
VersionLine = "Version" SP VersionNo SP VersionDate [ Comment ] CRLF
VersionNo = 1*DIGIT
VersionDate = YYYYMMDD
EntryLine = VariantEntry/Comment CRLF
VariantEntry = ValidCodePoint [ "(" RefList ") ] ;" RecommendedVariant
";" CharacterVariant [ Comment ]
ValidCodePoint = CodePoint
RefList = RefNo  0*( "," RefNo )
RecommendedVariant = CodePointSet 0*( "," CodePointSet )
CharacterVariant = CodePointSet 0*( "," CodePointSet )
CodePointSet = CodePoint 0* ( SP CodePoint )
CodePoint = 4DIGIT [DIGIT] [DIGIT]
Comment = "#" *VCHAR

YYYYMMDD is an integer representing a date where YYYY is the 4 digit
year, MM is the 2 digit month and DD is the 2 digit day.

3.2.3. Registration Algorithm

(An explanation of these steps follows them)

1.      IN <= IDL to be registered and
        {L} <= Set of languages associated with IN
2.      {V} <= Set of version numbers of the language character
               variant tables derived from {L}
3.      NP(IN) <= Nameprep processed IN  and
        check availability of NP(IN).
        If not available, route to conflict policy.
4.      For each AL in {L}
4.1.      Check validity of NP(IN) in AL. If failed, stop processing.
4.2.      PV(IN,AL) <= Set of available Nameprep processed recommended
                       variants of NP(IN) in AL
4.3.      RV(IN,AL) <= Set of available Nameprep processed character
                       variants of NP(IN) in AL
4.4.    End of Loop
5.      {PV} <= Set of all PV(IN,AL) with optional processing.
6.      {ZV} <= {PV} set-union NP(IN)
7.      {RV} <= Set of all RV(IN,AL) set-minus {ZV}
8.      Create IDL Package for IN using IN, {L}, {V}, {ZV} and {RV}
9.      Put {ZV} into zone file

Explanation

Step 1 takes the IDL to be registered and the associated language(s) as
input to the process.

Step 2 extract the set of version numbers of the associated language(s)
tables.

Step 3 Nameprep processed the IDL.  If the Nameprep processed IDL is
already registered or reserved, then the conflict policy is applied
here. For example, if FCFS is used, the registration process would stop
here.

Step 4 goes through all languages associated with the proposed IDL,
checks for validity in each language, and generates the recommended
variants and the reserved variants.

In step 4.1, IDL validation is done by checking that every code point
in the Nameprep processed IDL is a code point allowed by the "valid
code point" column of the character variant table for the language. If
one or more code points are invalid, the registration process must stop
here.

Step 4.2 generates the list of recommended variants of the IDL by doing
a combination of all possible variants listed in "recommend variant(s)"
column for each code point in the Nameprep processed IDL. Generated
variants must be processed with Nameprep.  If any of the recommended
variants of the IDL is registered or reserved, then the conflict policy
will be applied although this does not prevent the IDL from being
registered. For example, if FCFS is used, then the conflicting
variant(s) will be removed from the list.

Step 4.3 generates the list of reserved variants by doing a combination
of all the possible variants listed in "character variant(s)" column
for each code point in the Nameprep processed IDL. Generated variants
must be Nameprep processed.  If any of the variants are registered or
reserved, then the conflict policy will apply here although this does
not prevent the IDL from being registered.  For example, if FCFS is
used, then the conflict variants will be removed from the list.

The "combination" in Step 4.2 and Step 4.3 could achieve by a recursive
function similar to the following pseudo code:

Function Combination(Str)
  F <= first codepoint of Str
  SStr <= Substring of Str, without the first code point
  NSC <= {}

  If SStr is empty Then
     For each V in (Variants of code point F)
       NSC = NSC set-union (the string with the code point V)
     End of Loop
  Else
    SubCom = Combination(SStr)
    For each V in (Variants of code point F)
      For each SC in SubCom
        NSC = NSC set-union (the string with the
         first code point V followed by the string SC)
      End of Loop
    End of Loop
  Endif

  Return NSC


Step 5 generates the list of all recommended variants for all language.
Optionally, the algorithm may reduce the list of recommended variants
by prompting the user to select the recommended variants.

Step 6 generates the list of variants including the Nameprep processed
IDL which to be activated and Step 7 generates the list of reserved
variants.

Then an "IDL Package" for IDL is created in Step 8 with the original
IDL, the associated language(s), all the list of activated IDLs and the
list of variants.  The version numbers of the language character
variants tables are also stored in the IDL Package.

Lastly, the activated IDLs are converted using ToASCII [IDNA] with
UseSTD13ASCIIRules on and then put into the zone file. If the IDL is a
subdomain name, it will be delegated. The activated IDLs may be
delegated to a different domain name server so long it is owned by the
same domain name holder.

3.3. Deletion and Transfer of IDL and IDL Package

In normal domain administration, every domain name label is independent
of all other domain name labels.  Registration, deletion and transfer
of domain name labels is done on a per domain name label basis.
Depending on the zone's administrative policies, aliases (e.g., "CNAME"
entries) may be bound to particular labels with rules about whether one
can be changed without the other.  Current policies in gTLDs generally
prohibit registration of such aliases, in part to avoid needing to form
and enforce policies about these change (or binding) rules.

However, with internationalization, each IDL is bound to a list of
variant IDLs (with the list depending on the associated language),
bound together in an IDL Package.

Because all variants of the IDL should belong to a single domain name
holder, the IDL Package should be treated as a single entity.
Individual IDL, either active or reserved, within the IDL Package must
not be deleted or transferred independently of the other IDLs.
Specifically, if an IDL is to be deleted or transferred, that action
must be taken only as part of an action that affects the entire IDL
Package.

If the local conflict policy requires IDL to be transferred and deleted
independently of the IDL Package, the conflict policy would take
precedence. In such event, the conflict policy should be associated
with a transfer or delete procedure taking IDL Package into
consideration.

When an IDL Package is deleted, all the active and reserved variants
would be available again.  IDL Package deletion does not change any
other IDL Packages, including IDL Packages that have variants that
conflict with the variants in the deleted IDL Package. This is to be
consistent with the atomicity and predictability of the IDL Package.

3.4. Activation and De-activation of IDL variants

As there are active IDLs and inactive IDLs within an IDL Package,
processes are required to activate or de-activate IDL variants in an
IDL Package.

The activation algorithm is described below:

1. IN <= IDL to be activated & PA <= IDL Package
2. NP(IN) <= Nameprep processed IN
3. If NP(IN) not in {RV} then stop
4. {RV} <= {RV} set-minus NP(IN) and {ZV} <= {ZV} set-union NP(IN)
5. Put {ZV} into the zone file

Similarly, the deactivation algorithm:
1. IN <= IDL to be deactivated & PA <= IDL Package
2. NP(IN) <= Nameprep processed IN
3. If NP(IN) not in {ZV} then stop
4. {RV} <= {RV} set-union NP(IN) and {ZV} <= {ZV} set-minus NP(IN)
5. Put {ZV} into the zone file

3.5. Adding/Deleting language(s) association

The list of variants is generated from the IDL and tables for the
associated languages.  If the language associations are changed, then
the lists of variants have to be updated.  On the other hand, the IDL
Package is atomic and the list of variants must not be changed after
creation.

Therefore, this document recommends deleting the IDL Package followed
by a registration with the new set of languages rather than attempting
to add or delete language(s) association within the IDL Package.  Zone
administrators may find it desirable to devise procedures to prevent
other parties from capturing the labels in the IDL Package during these
operations.

3.6. Versioning of the language character variant tables

Language character variants tables are subjected to changes over time
and the changes may or may not be backward compatible.  It is possible
that different version of the language character variants tables may
produce a different set of recommended variants and reserved variants.

New IDL Packages should use the latest version of the language
character variants tables.

Existing IDL Packages created using previous version of language
character variants tables are not affected when there a new version of
the character variants table is released.

4. Example of Guideline Adoption

To provide a meaningful example, some language character variant tables
have to be defined.  Assume, then, that the following four language
character variants tables are defined (note that these tables are not a
representation of the actual table and they do not contain sufficient
entries to be used in any actual implementation):

a) language character variants tables for zh-cn and zh-sg

Reference 1 CP936 (commonly known as GBK)
Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt
Reference 3 List of Simplified character Table (Simplified column)
Reference 4 zSimpVariant in Unihan.txt
Reference 5 variant that exists in GB2312, common simplified hanzi

Version 1 20020701 # July 2002

56E2(1);56E2(5);5718(2)                 # sphere, ball, circle; mass, lump
5718(1);56E2(4);56E2(2),56E3(2)         # sphere, ball, circle; mass, lump
60F3(1);60F3(5);                        # think, speculate, plan, consider
654E(1);6559(5);6559(2)                 # teach
6559(1);6559(5);654E(2)                 # teach, class
6DF8(1);6E05(5);6E05(2)                 # clear
6E05(1);6E05(5);6DF8(2)                 # clear, pure, clean; peaceful
771E(1);771F(5);771F(2)                 # real, actual, true, genuine
771F(1);771F(5);771E(2)                 # real, actual, true, genuine
8054(1);8054(3);806F(2)                 # connect, join; associate, ally
806F(1);8054(3);8054(2),8068(2)         # connect, join; associate, ally
96C6(1);96C6(5);                        # assemble, collect together


b) language variants table for zh-tw

Reference 1 CP950 (commonly known as BIG5)
Reference 2 zVariant, zTradVariant, zSimpVariant in Unihan.txt
Reference 3 List of Simplified Character Table (Traditional column)
Reference 4 zTradVariant in Unihan.txt

Version 1 20020701 # July 2002

5718(1);5718(4);56E2(2),56E3(2)         # sphere, ball, circle; mass, lump
60F3(1);60F3(1);                        # think, speculate, plan, consider
6559(1);6559(1);654E(2)                 # teach, class
6E05(1);6E05(1);6DF8(2)                 # clear, pure, clean; peaceful
771F(1);771F(1);771E(2)                 # real, actual, true, genuine
806F(1);806F(3);8054(2),8068(2)         # connect, join; associate, ally
96C6(1);96C6(1);                        # assemble, collect together

c) language variants table for ja

Reference 1 CP932 (commonly known as Shift-JIS)
Reference 2 zVariant in Unihan.txt
Reference 3 variant that exists in JIS X0208, commonly used Kanji

Version 1 20020701 # July 2002

5718(1);5718(3);56E3(2)                 # sphere, ball, circle; mass, lump
60F3(1);60F3(3);                        # think, speculate, plan, consider
654E(1);6559(3);6559(2)                 # teach
6559(1);6559(3);654E(2)                 # teach, class
6DF8(1);6E05(3);6E05(2)                 # clear
6E05(1);6E05(3);6DF8(2)                 # clear, pure, clean; peaceful
771E(1);771E(1);771F(2)                 # real, actual, true, genuine
771F(1);771F(1);771E(2)                 # real, actual, true, genuine
806F(1);806F(1);8068(2)                 # connect, join; associate, ally
96C6(1);96C6(3);                        # assemble, collect together

d) language variants table for ko

Reference 1 CP949 (commonly known as EUC-KR)
Reference 2 zVariant in Unihan.txt

Version 1 20020701 # July 2002

5718(1);56E2(1);56E3(2)                 # sphere, ball, circle; mass, lump
60F3(1);60F3(1);                        # think, speculate, plan, consider
654E(1);6559(1);6559(2)                 # teach
6DF8(1);6E05(1);6E05(2)                 # clear
771E(1);771F(1);771F(2)                 # real, actual, true, genuine
806F(1);8054(1);8068(2)                 # connect, join; associate, ally
96C6(1);96C6(1);                        # assemble, collect together

Example 1: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4*
           {L} = {zh-cn, zh-sg, zh-tw}

NP(IN) = (U+6E05 U+771F U+6559)
PV(IN,zh-cn) = (U+6E05 U+771F U+6559)
PV(IN,zh-sg) = (U+6E05 U+771F U+6559)
PV(IN,zh-tw) = (U+6E05 U+771F U+6559)
{ZV} = {(U+6E05 U+771F U+6559)}
{RV} = {(U+6E05 U+771E U+6559),
        (U+6E05 U+771E U+654E),
        (U+6E05 U+771F U+654E),
        (U+6DF8 U+771E U+6559),
        (U+6DF8 U+771E U+654E),
        (U+6DF8 U+771F U+6559),
        (U+6DF8 U+771F U+654E)}

Example 2: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4*
           {L} = {ja}

NP(IN) = (U+6E05 U+771F U+6559)
PV(IN,ja) = (U+6E05 U+771F U+6559)
{ZV} = {(U+6E05 U+771F U+6559)}
{RV} = {(U+6E05 U+771E U+6559),
        (U+6E05 U+771E U+654E),
        (U+6E05 U+771F U+654E),
        (U+6DF8 U+771E U+6559),
        (U+6DF8 U+771E U+654E),
        (U+6DF8 U+771F U+6559),
        (U+6DF8 U+771F U+654E)}

Example 3: IDL = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4*
           {L} = {zh-cn, zh-sg, zh-tw, ja, ko}

NP(IN) = (U+6E05 U+771F U+6559) *qing2 zhen1 jiao4*
Invalid registration because U+6E05 is invalid in L = ko

Example 4: IDL = (U+806F U+60F3 U+96C6 U+5718)
                 *lian2 xiang3 ji2 tuan2*
           {L} = {zh-cn, zh-sg, zh-tw}

NP(IN) = (U+806F U+60F3 U+96C6 U+5718)
PV(IN,zh-cn) = (U+8054 U+60F3 U+96C6 U+56E2)
PV(IN,zh-sg) = (U+8054 U+60F3 U+96C6 U+56E2)
PV(IN,zh-tw) = (U+806F U+60F3 U+96C6 U+5718)
{ZV} = {(U+8054 U+60F3 U+96C6 U+56E2),
        (U+806F U+60F3 U+96C6 U+5718)}
{RV} = {(U+8054 U+60F3 U+96C6 U+56E3),
        (U+8054 U+60F3 U+96C6 U+5718),
        (U+806F U+60F3 U+96C6 U+56E2),
        (U+806f U+60F3 U+96C6 U+56E3),
        (U+8068 U+60F3 U+96C6 U+56E2),
        (U+8068 U+60F3 U+96C6 U+56E3),
        (U+8068 U+60F3 U+96C6 U+5718)

Example 5: IDL = (U+8054 U+60F3 U+96C6 U+56E2)
                 *lian2 xiang3 ji2 tuan2*
             {L} = {zh-cn, zh-sg}

NP(IN) = (U+8054 U+60F3 U+96C6 U+56E2)
PV(IN,zh-cn) = (U+8054 U+60F3 U+96C6 U+56E2)
PV(IN,zh-sg) = (U+8054 U+60F3 U+96C6 U+56E2)
{ZV} = {(U+8054 U+60F3 U+96C6 U+56E2)}
{RV} = {(U+8054 U+60F3 U+96C6 U+56E3),
        (U+8054 U+60F3 U+96C6 U+5718),
        (U+806F U+60F3 U+96C6 U+56E2),
        (U+806f U+60F3 U+96C6 U+56E3),
        (U+806F U+60F3 U+96C6 U+5718),
        (U+8068 U+60F3 U+96C6 U+56E2),
        (U+8068 U+60F3 U+96C6 U+56E3),
        (U+8068 U+60F3 U+96C6 U+5718)}

Example 6: IDL = (U+8054 U+60F3 U+96C6 U+56E2)
                 *lian2 xiang3 ji2 tuan2*
           {L} = {zh-cn, zh-sg, zh-tw}

NP(IN) = (U+8054 U+60F3 U+96C6 U+56E2)
Invalid registration because U+8054 is invalid in L = zh-tw

Example 7: IDL = (U+806F U+60F3 U+96C6 U+5718)
                 *lian2 xiang3 ji2 tuan2*
           {L} = {ja,ko}

NP(IN) = (U+806F U+60F3 U+96C6 U+5718)
PV(IN,ja) = (U+806F U+60F3 U+96C6 U+5718)
PV(IN,ko) = (U+806F U+60F3 U+96C6 U+5718)
{ZV} = {(U+806F U+60F3 U+96C6 U+5718)}
{RV} = {(U+806F U+60F3 U+96C6 U+56E3),
        (U+8068 U+60F3 U+96C6 U+5718),
        (U+8068 U+60F3 U+96C6 U+56E3)}

i. Notes

1. The terms "i18n" and "l10n", sometimes used in upper-case form (i.e.,
"I18N" and "L10N"), have become popular in international standards
usage as abbreviations for "internationalization" and "localization",
respectively.  The abbreviations were derived by using the first and
last letters of the words, with the number of characters that appear
between them.  I.e., in "internationalization", there are 18 characters
between the initial "i" and the terminal "n".

2. Every human language is unique and therefore, every linguistic and
localization issue is also unique. It is difficult or impossible to
make comparisons across multiple languages or to classify them into
categories.  And any cross-language analogies are, by their very nature,
imperfect at best.

For example, to classify Traditional Chinese/Simplified Chinese as
upper/lower case makes as much sense as to classify TC/SC as "spelling
variant" like "color" and "colour". Both comparisons are potentially
useful but neither is completely correct.

3. The variants in CJK are very complex and require many different
layers of solution. This guideline is a one of the solution components,
but not sufficient, by itself, to solve the whole problem.

ii. Acknowledgements

The authors gratefully acknowledge the contributions of:

V.CHEN, N.HSU, H.HOTTA, S.TASHIRO, Y.YONEYA and other Joint Engineering
Team members at the JET meeting in Bangkok.

Yves Arrouye, an observer at the JET meeting, for his contribution on
the IDL Package.

Soobok LEE
L.M TSENG
Patrik FALTSTROM
Paul HOFFMAN
Erin CHEN
LEE Xiaodong
Harald ALVESTRAND

iii. Author(s)

James SENG
PSB Certification
3 Science Park Drive
#03-12 PSB Annex
Singapore 118233
Phone: +65 6885-1657
Email: jseng@pobox.org.sg

Kazunori KONISHI
JPNIC
Kokusai-Kougyou-Kanda Bldg 6F
2-3-4 Uchi-Kanda, Chiyoda-ku
Tokyo 101-0047
JAPAN
Phone: +81 49-278-7313
Email: konishi@jp.apan.net

Kenny HUANG
TWNIC
3F, 16, Kang Hwa Street, Taipei
Taiwan
TEL : 886-2-2658-6510
Email: huangk@alum.sinica.edu

QIAN Hualin
CNNIC
No.6 Branch-box of No.349 Mailbox, Beijing 100080
Peoples Republic of China
Email: Hlqian@cnnic.net.cn

KO YangWoo
PeaceNet
Yangchun P.O. Box 81 Seoul 158-600
Korea
Email: newcat@peacenet.or.kr

John C KLENSIN
1770 Massachusetts Ave, No. 322
Cambridge, MA 02140
USA
Email: Klensin+ietf@jck.com

iv.  Appendix A

[How to read the Han Ideograph provided in this document. --  Will
complete this section in next revision]

v. Normative References

[ABNF]      Augmented BNF for Syntax Specifications: ABNF, RFC 2234, D.
            Crocker and P. Overell, Eds., November 1997.

[I18NTERMS] Terminology Used in Internationalization in the IETF,
            draft-hoffman-i18n-terms-07.txt, September 2002,
            Paul Hoffman, work in progress

[RFC3066]   Tags for the Identification of Languages, RFC3066,
            Jan 2001, H. Alvestrand

[IDNA]      Internationalizing Domain Names in Applications,
            draft-ietf-idn-idna, Feb 2002, Patrik Faltstrom,
                Paul Hoffman, Adam M. Costella, work in progress

[PUNYCODE]  Punycode: An encoding of Unicode for use with IDNA,
                draft-ietf-idn-punycode, Feb 2002, Adam M. Costello,
                work in progress

[STRINGPREP]Preparation of Internationalized Strings,
                draft-hoffman-stringprep, Feb 2002, Paul Hoffman,
                Marc Blanchet, work in progress

[NAMEPREP]  Nameprep: A Stringprep Profile for Internationalized
                Domain Names, work in progress, draft-ietf-idn-nameprep,
                Feb 2002, Paul Hoffman, Marc Blanchet, work in progress

[UNIHAN]    Unicode Han Database, Unicode Consortium
            ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt

[UNICODE]   The Unicode Consortium, "The Unicode Standard -- Version
                3.0", ISBN 0-201-61633-5. Unicode Standard Annex #28,
            (http://www.unicode.org/unicode/reports/tr28/) defines
                Version 3.2 of The Unicode Standard.

[ISO7098]   ISO 7098;1991 Information and documentation -- Romanization
            of Chinese, ISO/TC46/SC2.

vi. Non-normative References

[IDN-WG]    IETF Internationalized Domain Names Working Group,
            idn@ops.ietf.org, James Seng, Marc Blanchet.
            http://www.i-d-n.net/

[STD13]     Paul Mockapetris, "Domain names - concepts and facilities"
                (RFC 1034) and "Domain names - implementation and
                specification" (RFC 1035), STD 13, November 1987.

[C2C]       Pitfalls and Complexities of Chinese to Chinese Conversion,
            http://www.cjk.org/cjk/c2c/c2c.pdf, Jack Halpern, Jouni
            Kerman

vii. Other Issues

It is possible that many variants generated may have no meaning in the
associated language or languages.  The intention is not to generate
meaningful "words" but to generate similar variants to be reserved.

The language Character Variants tables are critical to the success of
the guideline.  A badly designed table may either generate too many
meaningless variants or may not generate enough meaningful variants.
The principles to be used to generate the tables are not within the
scope of this document, nor are the tables themselves.

This document recommends against registration of IDL in a particular
language until the language character variants table for that language
is available.

Outstanding Issues

(1)     Erin suggested (if I (JcK) correctly understood her) that, if
multiple languages are associated with a given name, the recommended
variant list for a given code point be treated as the intersection of
the variant lists for each of the languages, not the union.  As I
understand the current algorithm, it effectively takes the union.
Taking the intersection has the technical advantage that it would
significantly reduce the number of variant strings that must be
reserved.  It also has the policy advantage of discouraging people
from registering with multiple languages if they don't need to -
otherwise, we will have everyone trying to register in all of the
possibly-relevant languages, which would make this effort a good deal
less effective than it might be.

Taking the intersection is also consistent with a rule that appears to
exist now.  As shown in Example 3, if an attempt is made to register a
name and associate it with multiple languages, it must be valid in all
of those languages or the registration attempt will fail.  So we
intersect the validity criteria on a language basis, and should
probably intersect the variants.

But that is an algorithm change, since we have to extract the variant
lists for each code point for each language, take the intersection,
and then process against that, rather than against each language in
turn.

[JS - I disagree in taking the intersection of the set. No doubt by
doing intersection we will reduce the abuse of specifying multiple
language to increase the set of reserved variants, our goal is
precisely to reserve as much variants as possible for the domain name
holder, not vice versa.

Suppose we have a string ABC with variants ABD ACD ABF in Chinese, ABE
ACD in Japanese and CBD ACD in Korean.

Assuming a registrant register ABC in CJK, right now he will get the
reserved set of {ABC, ACD, ABF, ABE, CBD}.

On the other hand, if we do intersection, this set will be reduced to
{ACD}, leaving other variants like ABF, ABE and CBD open for potential
conflict. And the only way he can protect this confusion is to
individually register ABF, ABE and CBD manually individually,
something we trying to prevent.]

[Further explanation by Erin:

I'm sorry maybe my previous suggestion is not clear enough.

I mean if multiple languages are associated with a given nanme, the
range of valid code point sould be the intersection of all the
associated languages.

But, if multiple languages are associated with a given nanme, the
recommended variants should be take the union and put into zone file.
The same, the character variant code also sould be take the union for
each of the languages.]

(2)     A note went by indicating that the plan was to drop the Han
characters from the IETF-submission version of this document.  We can
post I-Ds in PDF and publish RFCs in PDF and/or Postscript, as long as
we provide ASCII.   I find having the Han characters very useful, and
trust that those of you who can read them find them even more so.  So
I would suggest that we hand off the pair of an ASCII document (with
the Han characters removed) and a PDF document (that looks like the
Word text we have been looking it) to the I-D editor.  I've got full
Acrobat here and can presumably produce the thing if needed.

(3)     We still need to sort out the issue of whether reserving a
variant that may (in a current or future table) conflict with another
character, with the possibility of activating it is an invitation to
cybersquatting and other abuses.  That isn't clear, let me try an
illustration: suppose we have a character X, with variants A, B, and C,
and a character Y, with variants D and C.  Now, if Y is registered
first, then its package includes {Y*, D, C}, using the symbol "*" to
denote an active name.  When X is registered, its package consists of
{X, A, B}.  X's owner can't reserve or activate C, since it was
reserved to Y.  But much of the reason for doing all of this work was
the concern that C can be confused with either Y or X.  So doesn't
this create an opportunity for Y to threaten, or extort money from, X
by threatening to activate C?

[JS -- The conflict of X & Y over C in this case could be resolved by
existing conflict policy. The revised guideline now makes it possible
to modify the IDL Package in the event of dispute]

That problem gets worse, I think, if Erin's suggestion in (1) is not
adopted.  And I continue to believe that the only solution that will
work is to prevent anyone from activating C.  Or, more generally, at
any given time, there will be a set of language variant tables that
will be considered valid by the administrator of a particular zone.
The zone administrator would take the union of all of those tables,
using the 'valid code point' as the key as usual, and then permanently
reserve any character that appeared most than once in a variant column.
Small matter of programming.

(4)     In page 9, on the paragraph starting with "The character
variant(s) column contains ..."

Page: 21
This seems to be saying that the code points listed in the third
column will always be a proper superset of the union of the first and
second columns.  If that is correct, it violates a fundamental
principle that I was taught about  good programming and systems design
-- minimization of duplication of information, since such duplicates
are error-prone.   And, if I have not interpreted the intent correctly,
the text needs to be fixed.  Somehow.

[JS -- correct, it is duplicated. The duplication is bad from
system design view but it makes it 'complete' and easy to explain.]
Document	Document type	This is an older version of an Internet-Draft that was ultimately published as RFC 3743. Expired & archived
	Select version	00 01 02 03 04 05 RFC 3743
	Compare versions
	Author
	RFC stream	Independent Submission
	Other formats	txt pdf bibtex bibxml
	Additional resources