Network Working Group                                       Chris Newman
Internet-Draft                                          Sun Microsystems
Intended Status: Proposed Standard                         Martin Duerst
                                                Aoyama Gakuin University
                                                        Arnt Gulbrandsen
                                                  Oryx Mail Systems GmbH
                                                              March 2007


         i;basic - Registration of Unicode Collation Algorithm
                draft-gulbrandsen-collation-basic-01.txt


Status of this Memo

    By submitting this Internet-Draft, each author represents that any
    applicable patent or other IPR claims of which he or she is aware
    have been or will be disclosed, and any of which he or she becomes
    aware will be disclosed, in accordance with Section 6 of BCP 79.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-
    Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet-Drafts as
    reference material or to cite them other than as "work in progress".

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt.  The list of Internet-
    Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

    This draft expires in September 2007.


Copyright Notice

    Copyright (C) The IETF Trust (2007).


Abstract

    The Unicode Collation Algorithm is a widely usable collation
    covering all of Unicode. It produces tolerable results for many
    locales as-is, and can be further improved using locale-specific



Newman et al.            Expires September 2007                 [Page 1]


Internet-draft                                                March 2007


    tables. This document registers the UCA in the IETF's collation
    registry.


Table of Contents

    1. Conventions Used in This Document  . . . . . . . . . . . . . .  2
    2. i;basic: The Unicode Collation Algorithm . . . . . . . . . . .  2
    3. Registration . . . . . . . . . . . . . . . . . . . . . . . . .  4
    4. Security Considerations  . . . . . . . . . . . . . . . . . . .  5
    5. IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  5
    6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  5
    7. References . . . . . . . . . . . . . . . . . . . . . . . . . .  5
     7.1. Normative References  . . . . . . . . . . . . . . . . . . .  5
    8. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . .  6


1.  Conventions Used in This Document

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in [RFC2119].


2.  i;basic: The Unicode Collation Algorithm

    The basic collation is intended to provide tolerable results for a
    number of languages for all three operations (equality, substring
    and ordering) so it is suitable as a mandatory-to-implement
    collation for protocols which include ordering support.  The
    ordering operation of the basic collation is the Unicode Collation
    Algorithm [UCAv14].

    The equality and substring operations are created as described in
    UCAv14 section 8.  While that section is informative to UCAv14, it
    is normative to this collation specification.

    This collation is based on Unicode version 3.2, with the following
    tables relevant:

    1. For the normalization step,
       http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt is
       used.  Column 5 is used to determine the canonical decomposition,
       while column 3 contains the canonical combining classes necessary
       to attain canonical order.

    2. The table of characters which require a logical order exception
       is a subset of the table in



Newman et al.            Expires September 2007                 [Page 2]


Internet-draft                                                March 2007


       http://www.unicode.org/Public/3.2-Update/PropList-3.2.0.txt and
       is included here:

            0E40..0E44    ; Logical_Order_Exception
            # Lo   [5] THAI CHARACTER SARA E..
            #          THAI CHARACTER SARA AI MAIMALAI
            0EC0..0EC4    ; Logical_Order_Exception
            # Lo   [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI

            # Total code points: 10

    3. The table used to translate normalized code points to a sort key
       is http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt.


    UCAv14 includes a number of configurable parameters and steps
    labelled as potentially optional.  The following list summarizes the
    defaults used by this collation:

    -  The logical order exception step is mandatory by default to
       support the largest number of languages.

    -  Steps 2.1.1 to 2.1.3 are mandatory as the repertoire of the basic
       collation is intended to be large.

    -  The second level in the sort key is evaluated forwards by
       default. This can be changed using the "direction2" variable.

    -  The variable weighting uses the "non-ignorable" option by
       default.

    -  The semi-stable option is not used by default.

    -  Support for one level of collation is the default behavior, ie.
       the collation is case-insenstive and ignores accents. This can be
       changed using the "strength" variable.

    -  If the collation is adjusted to be case-sensitive, the
       "casefirst" variable can be used to determine whether upper case
       sorts before or after lower case.

    -  No preprocessing step is used by the basic collation prior to
       applying the UCAv14 algorithm.  Note that an application protocol
       specification MAY require pre-processing prior to the use of any
       collations.

    -  The equality and substring algorithms use the "Whole Characters
       Only" feature described in UCAv14 section 8 by default.



Newman et al.            Expires September 2007                 [Page 3]


Internet-draft                                                March 2007


    The "version" variable specifies the version of the Unicode
    Collation Algorithm to use. The default is 4.1.0. UCA versions older
    than 4.1.0 (14) are not legal. UCA versions newer than 4.1.0 are
    legal, although not defined at the time of writing.

    The exact collation identifier with these defaults is "i;basic".
    When a specification states that the basic collation is mandatory-
    to-implement, only this specific identifier is mandatory-to-
    implement.

    The default weighting option is "non-ignorable".  The "semi-stable"
    sort key option is not used by default.

    Sort keys are generated as described in section 4.3 of the UCA
    specification. (Note that the result is not a string of characters.)




































Newman et al.            Expires September 2007                 [Page 4]


Internet-draft                                                March 2007


3.  Registration

    <?xml version='1.0'?>
    <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
    <collation rfc="XXXX" scope="i18n" intendedUse="common">
      <identifier>i;basic</identifier>
      <title>Basic</title>
      <operations>equality order substring</operations>
      <specification>RFC XXXX</specification>
      <owner>IETF</owner>
      <submitter>chris.newman@sun.com</submitter>
      <variable>
        <name>version</name>
        <default>4.1.0</default>
      </variable>
      <variable>
        <name>direction2</name>
        <default>forwards</name>
        <value>forwards</name>
        <value>backwards</name>
      </variable>
      <variable>
        <name>strength</name>
        <default>1</name>
        <value>1</name>
        <value>2</name>
        <value>3</name>
        <value>4</name>
        <value>5</name>
      </variable>
      <variable>
        <name>casefirst</name>
        <default>off</name>
        <value>off</name>
        <value>upper</name>
        <value>lower</name>
      </variable>
    </collation>


4.  Security Considerations

    This document raises no security issues that are not already
    described in [RFC4790].


5.  IANA Considerations




Newman et al.            Expires September 2007                 [Page 5]


Internet-draft                                                March 2007


    The IANA is requested to add the above i;basic registration to the
    collation registry, http://www.iana.org/assignments/collation/.


6.  Acknowledgements.

    This document was split off from [RFC4790] during its time as a
    draft. Many of the people acknowledged in that RFC helped with this:
    Brian Carpenter, John Cowan, Dave Cridland, Mark Davis, Spencer
    Dawkins, Lisa Dusseault, Lars Eggert, Frank Ellermann, Philip
    Guenther, Tony Hansen, Ted Hardie, Sam Hartman, Kjetil Torgrim
    Homme, Michael Kay, John Klensin, Alexey Melnikov, Jim Melton and
    Abhijit Menon-Sen.


7.  References

7.1. Normative References

    [RFC2119]  Bradner, "Key words for use in RFCs to Indicate
               Requirement Levels", RFC 2119, Harvard University, March
               1997.

    [RFC4790]  Newman, Duerst, Gulbrandsen, "Internet Application
               Protocol Collation Registry", RFC 4790, February 2007.

    [UCAv14]   Davis, Whistler, "Unicode Collation Algorithm version
               14", May 2005,
               <http://www.unicode.org/reports/tr10/tr10-14.html>.


8.  Authors' Addresses

    Chris Newman
    Sun Microsystems
    3401 Centrelake Dr., Suite 410
    Ontario, CA  91761
    US

    Email: chris.newman@sun.com











Newman et al.            Expires September 2007                 [Page 6]


Internet-draft                                                March 2007


    Martin Duerst
    Aoyama Gakuin University
    5-10-1 Fuchinobe
    Sagamihara
    Kanagawa
    229-8558
    Japan

    Phone: +81 42 759 6329
    Fax: +81 42 759 6495
    Email: duerst@it.aoyama.ac.jp
    Web: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/

    Note: Please write "Duerst" with u-umlaut wherever possible, for
    example as "D&amp;#252;rst" in XML and HTML.


    Arnt Gulbrandsen
    Oryx Mail Systems GmbH
    Schweppermannstr. 8
    D-81671 Muenchen
    Germany

    Fax: +49 89 4502 9758

    Email: arnt@oryx.com


Intellectual Property Statement

    The IETF takes no position regarding the validity or scope of any
    Intellectual Property Rights or other rights that might be claimed
    to pertain to the implementation or use of the technology described
    in this document or the extent to which any license under such
    rights might or might not be available; nor does it represent that
    it has made any independent effort to identify any such rights.
    Information on the procedures with respect to rights in RFC
    documents can be found in BCP 78 and BCP 79.

    Copies of IPR disclosures made to the IETF Secretariat and any
    assurances of licenses to be made available, or the result of an
    attempt made to obtain a general license or permission for the use
    of such proprietary rights by implementers or users of this
    specification can be obtained from the IETF on-line IPR repository
    at http://www.ietf.org/ipr.

    The IETF invites any interested party to bring to its attention any
    copyrights, patents or patent applications, or other proprietary



Newman et al.            Expires September 2007                 [Page 7]


Internet-draft                                                March 2007


    rights that may cover technology that may be required to implement
    this standard.  Please address the information to the IETF at ietf-
    ipr@ietf.org.


Copyright Statement

    Copyright (C) The IETF Trust (2007).

    This document is subject to the rights, licenses and restrictions
    contained in BCP 78, and except as set forth therein, the authors
    retain all their rights.


Disclaimer of Validity

    This document and the information contained herein are provided on
    an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
    REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
    IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
    WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
    WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
    ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
    FOR A PARTICULAR PURPOSE.


Acknowledgment

    Funding for the RFC Editor function is currently provided by the
    Internet Society.





















Newman et al.            Expires September 2007                 [Page 8]


Internet-draft                                                March 2007


          (RFC Editor: Please delete everything after this point)


Open Issues

    This -00 draft is published in order to establish version history.
    Several necessary changes have NOT been made.

    The Unicode version choice need consideration. 3.2 seems old? And
    can the ten-element table be dropped - why is it there?

    The variable names should be aligned with what
    http://unicode.org/reports/tr35/#Collation_Elements describes. IMO
    the best thing to do is to copy the CLDR names.

    The variable defaults need to be considered when doing the above
    rename.


Changes in -01:

    - Better title (suggested by Martin Duerst).

    - Struck the "uv" variable, merged "uv" and "variable", and aligned
      the result with the UCA version variable (as explained in
      http://unicode.org/reports/tr35/#Collation_Elements). Starting
      version changed to 4.0, since that's the oldest version for which
      the two can be merged.

    - Changed the default strength to 1, and called strength strength
      instead of matchLevel since that's what the UCA calls it and it
      seems sensible.

    - Added the casefirst variable from the UCA. (Several others
      variables were not added, as I'm uncertain of the right names and
      default variables.)

Changes in -00:

    - No substantive changes from draft-newman-i18n-comparator.











Newman et al.            Expires September 2007                 [Page 9]