draft-alvestrand-i18n-howto-01

Internet-Draft                                       H. Alvestrand
draft-alvestrand-i18n-howto-01.txt                   Cisco Systems
Target Category: Informational                       November 2001
                                                 Expires: May 2002

















Protocol Redesigner's Handbook ? volume i18n


Guidelines for internationalization of protocols




Status of this Memo

     The file name of this memo is draft-alvestrand-i18n-howto-01.txt

     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC 2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.

Discussion on this draft should be directed to the mailing list intloc-
discuss@ops.ietf.org. This is NOT an open mailing list.
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

Abstract

This document attempts to give guidelines for the people who have to
deal with existing protocols where issues of  languages and character
sets were not considered from the beginning, and tries to help them a
little along the way. Some of the advice might also be useful for
people designing new protocols.

With new protocols, the document might help in getting the
internationalization right in the first attempt; at this stage, we all
know that protocols MUST be internationalized.











































draft-alvestrand-i18n-howto-01.txt                       [Page 2]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-02.txt                Expires May 2002



Protocol Redesigner's Handbook ? volume i18n.....................1

Guidelines for internationalization of protocols.................1

Status of this Memo..............................................1

Abstract.........................................................2

1. Introduction..................................................3

2. Classes of information........................................4

3. Designing Internet internationalization.......................6

 3.1  Basic concepts for the Internet...........................6

 3.2  Internationalization components outside IETF scope........7

 3.3  Operations likely to be impacted by internationalization..7

 3.4 How to tell whether you have a script problem or a language
 problem........................................................9

4. Specific sorting, matching and canonicalization options......10

 4.1  Internationalized encodings..............................11

 4.2  Normalization............................................12

5. Tricks to shoehorn stuff into older protocols................13

6. Security Considerations......................................16

7. Acknowledgements.............................................17

8. Author's Address.............................................17

9. References...................................................17




1. Introduction



Human beings on our planet have, past and present, used a number of
languages.




draft-alvestrand-i18n-howto-02.txt                       [Page 3]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-02.txt                Expires May 2002

These have been represented in a number of media using a variety of
encoding systems, most commonly in scripts using some kinds of
characters.

These days, humans want to use the Internet to communicate between
themselves, and to interact with information stores on the Internet,
and see no reason to learn a new language in order to do so.

This means that they have to use Internet protocols to communicate. And
they will want to represent the scripts they are used to from off the
Internet when they use the Internet protocols.

And they expect the Right Thing to happen.

This document talks about what doing the Right Thing means.


2. Classes of information

Most protocols are designed with pieces that belong in various
categories:

<<this section should have examples>>

A. Protocol elements, defined by the protocol designer, which should
  never be shown to the user, and are never changed.
  Examples: Verbs in the SMTP protocol [RFC2822], SNMP object
  identifiers [SNMP]

B. Managed-namespace identifiers, defined by some orderly process,
  intended to be used by any protocol user anywhere, often through
  interfaces that hide the actual values, but sometimes directly.
  Examples: Language tags [RFC3066], URI schemes [URLREG]

C. Global-scope identifiers, intended for visibility to any user who has
  an use for them anywhere, but not completely managed by a central
  authority
  Examples: DNS names, URLs, IP addresses, user@domain email addresses

D. Local-scope identifiers, intended for visibility to a small set of
  users, but may be visible in several contexts
  Reasonable usage of such identifiers means that it is possible to
  appeal to some shared context in order to decide what it "means"
  Examples: login account names, filenames within a directory, port
  numbers on a host

E. Data elements, intended for visibility within a certain context only
  Examples: Text of email messages, Web page content, instant messages,
  subject lines in mail

Internationalizing an identifier or a data element in this context
means making it capable of representing information relevant to any


draft-alvestrand-i18n-howto-01.txt                       [Page 4]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

user, no matter which script or language this user uses. This may
involve dealing with character representation, processing rules,
language tagging, language negotiation or other functions as
appropriate.

For each element to be considered, there are 3 alternatives:

1. State that the element is immutable, invisible and inviolable, and
  therefore internationalization is irrelevant (and the
  protocol/product designer should try REALLY hard to make sure the
  user never knows or needs to know the value)

2. State that the element has to be in a very limited representation
  (such as the A-Z 0-9 character repertoire) so that it can be globally
  recognized and entered (822 headers, language tags)
  (the protocol designer might reasonably want the user to get at the
  value of the element, but should not depend on the user associating
  anything meaningful with the identifier)

3. State that the element is a textual element for which the user
  decides the appropriate content.  Basically, it has to be
  internationalized.



Internationalization requirements started out with data content (MIME
for email, for instance), and are working their way up the chain. For a
long time (see [IAB-ARCH], for instance), we thought that global-scope
identifiers like DNS names should be kept in category 2 (limited
repertoire), but increasing pressure from the community of people who
do not use ASCII in their daily lives has led to a reconsideration here
(IDN).

The current thinking of the group discussing this document, which is
suggested as IETF policy, is that protocol elements (A) and most if not
all managed-namespace identifiers (B) should be treated according to
alternative 2 above; their values should be binary, numeric or
invariant-subset ASCII. This makes testing and debugging easier, and
does not limit the expressive power of any protocol.

Note: Experience in the IETF is that implementers are lousy at hiding
things from the users, and users are often very fond of finding the
things implementers think should be hidden; that most people now know
that http:// means "you can look it up in a browser" is unsurprising;
the colloquial use of "402" (the HTTP error code for "document does not
exist") as a synonym for "not where he should be" is perhaps more so.








draft-alvestrand-i18n-howto-01.txt                       [Page 5]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

3. Designing Internet internationalization

3.1 Basic concepts for the Internet

The fundamental difference between common
internationalization/localization and Internet protocol
internationalization is this:

ON THE INTERNET, THE TWO ENDS OF THE COMMUNICATION CANNOT BE ASSUMED TO
BE IN THE SAME PLACE.

This means, in particular, that:

. The two ends of the communication may not share a common external
  context such as a "locale"; quite commonly, the two ends are in
  different countries, and may not even know (or care) what country the
  other end of the conversation is in.

. The two ends of the communication do not necessarily have ANY common
  knowledge except for the implementation of the protocol. With
  implementations in local networks, not even Internet access can be
  assumed, so even references to Internet-accessible resources are not
  guaranteed to work.

This means that:

. ALL information required for correct operation of the protocol must
  be specified in the protocol documentation, or be carried in the
  communication between the parties

. When user preferences are involved, and multiple values are possible,
  the specification must guarantee a least common subset of
  identifiers, and properly handle the enumeration of identifiers (for
  instance by IANA registration).

One note that has more to do with psychology of developers than with
correct specification:

It is better to fill in a field than to specify a default in a protocol
specirfication.

At times, one has had protocols that stated a "default value", and that
one added a parameter to change this value. Sometimes, for instance
with the HTTP content-type field, which had a "charset" parameter for
the "text/html" type, implementers reinterpreted the absence of a
parameter as "anything goes", and let their implementations ship
anything they wanted without labeling it, leaving the recipient to
guess at charsets. This had predictably dire consequences, and has led
some people to believe that it is better to "waste" the bytes required
to always specify explicitly what a parameter is, instead of relying on
a default.


draft-alvestrand-i18n-howto-01.txt                       [Page 6]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

When discussing internationalization, it is also very important to use
common terminology. The terminology of this field is littered with
seemingly simple words that are used for different things by different
people, with "character set", "script" and "language" being high on the
list of abused terms. Refer to [Hoffman].

3.2 Internationalization components outside IETF scope

Internationalizing a program or a service involves much more than the
protocols. But these other matters are not IETF issues, and do not
impinge upon the IETF standards process except indirectly.

In particular:

. The IETF does not standardize user interfaces. This means that input
  methods, display methods and display characteristics are out of scope
  for the IETF. (However, information about such methods and
  characteristics may at times have to be communicated using parameters
  of IETF protocols.)

. The IETF does not standardize APIs, except for the rare case of an
  API to a protocol

This also means that the presentation of data, and conversions upon
data performed in order to do presentation, is outside the scope of
IETF standards, while conversions upon data in order to do protocol
operations are in scope (and may possibly be reused for presentation
purposes). The IETF standards are chiefly concerned with communicating
the data needed, not how the data are presented. But the separation can
be unenforceable at times; we have a long history of defining data
representations "as seen by the user" ? see, for instance, RFC 1685,
which talks about how to write down X.400 email addresses.

3.3 Operations likely to be impacted by internationalization

A basic level of internationalization is text representation. A
protocol where it is not possible to send an Arabic letter SAD
(U+0635), and let the recipient recognize this as such, is useless for
communication in Arabic.

This was addressed in RFC 2277, "IETF Policy on Character Sets and
Languages".

This is sufficient for handling text where that text is not treated
further by the protocol endpoint entities.

But there are a number of common operations that require the protocol
designer to do more thinking and specification when dealing with an
internationalized context:




draft-alvestrand-i18n-howto-01.txt                       [Page 7]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

. Equality tests ? for instance deciding whether a typed string is
  identical to an username, or (even worse) a password

. Matching. If the protocol has any operation where one party gives a
  text element, and the other party performs an action based on the
  content of that text element, matching must take place. This needs
  specification.
  Typical sources of confusion include:

  . What characters match (does a SPACE match a NON-BREAKING SPACE?
     Does A match a? Does LATIN LETTER A match GREEK LETTER ALPHA?)

  . What, if anything, is used as "units" in a match? The concept of
     "word" can get very tricky with languages like Thai, which often
     do not use word separator characters.

  . How many characters there are. This is especially a problem when
     one uses "regular expressions", which can specify (for instance)
     "A and B, with exactly one character between them" ? does A
     followed by COMBINING RING ABOVE followed by B match or not?

. Collation (sometimes called sorting). If the protocol requires
  elements to appear in a consistent order, collation needs
  specification. Collation will often need far more information than
  matching in order to provide the results the user expects; a
  collation based on codepoint value ("binary sort") is useless to the
  user except for the rare case where he does not care what the order
  is, as long as it is consistent.
  A common example is the case sensitivity problem; on Unix with the
  "C" locale, "Bread" is sorted before "apples", while under Windows,
  "Bread" is sorted after "apples", because Windows disregards case
  when sorting file names.

. Canonical forms. If the protocol ever expects to binary compare two
  objects for equality, or compute checksums over the objects as done
  for digital signatures, the implementations will often want to
  increase the probability that if a human looking at the data in the
  object thinks that it is unchanged, it actually compares equal. The
  most common method of doing this is to define a single "canonical"
  form for the data.

. Field truncation. In single-byte encodings, one is guaranteed that a
  field value produced by truncating a longer value is at least a valid
  string. With multibyte encodings, this is not the case; with
  variable-length encodings like UTF-8, there is no way to know without
  inspecting the string where legal truncation points may be. (In UTF-8
  one can find a legal point by inspecting relatively few octets around
  the cut point; in ISO 2022 based encodings, it may require
  significantly more effort)

. Checks for legal and illegal characters. In some cases, one wants to
  specify things like "no spaces". One then has to consider whether


draft-alvestrand-i18n-howto-01.txt                       [Page 8]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

  this means no SPACE (U+0020) no space (Unicode class Sp) or no
  separators (a class that includes TAB, for instance).

. Bi-directional issues. If a protocol element (for instance an URI or
  a domain name) contains multiple elements of different
  directionality, what is the directionality of the separator elements?
  (This makes display REALLY awkward?.)
  An example treatment of this problem can be found in [IRI-BIDI].

3.4 How to tell whether to identify a script or a language

In many applications, the application is well served as long as a
string can be entered, stored and displayed correctly to the end user.
In other applications, there is some degree of interaction between the
meaning of the string and the action to be applied to it; in these
cases information about language is critical to make a correct
decision.

Approaches to language identification usually fall into 3 categories:

  . Guess the language (this requires a reasonably large chunk of text
     for accurate determination; with closely related languages, such
     as Norwegian and Danish, the required chunk can be in the hundreds
     of words)

  . Let a recipient (human) user identify the language, and apply the
     appropriate action manually

  . Make the application language independent, dealing with "words",
     and let the user define (for instance by configuration or by
     choice of words in search interfaces) what words should be
     considered.

Which one is appropriate depends on context.

Typical operations where language information is needed:

  - Dispatching on language: Trying to route an incoming query to a
     person who can understand it.

  - High quality display ? due to the nature of the Han unification
     performed in Unicode, some native speakers claim that one must use
     different fonts for representing the same character codepoint in
     Japanese and in Chinese. The same problem occurs in some languages
     for the Cyrillic fonts.

  - Text to speech processing

  - Selecting an appropriate name ? "Feuerwehr" versus "Fire station"
     in a German airport; "Bruxelles" versus "Brussels" on a map of
     Belgium


draft-alvestrand-i18n-howto-01.txt                       [Page 9]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

Things to consider when you decide what language information you need:

  - How much does it matter if you don't know the language?

  - How precise do you need language to be? If you mark something as
     "US English", will the Right Thing happen when the recipient
     understands only "English"? If you mark it as "Nynorsk" (language
     code "nn"), will the recipient who indicates a desire for
     Norwegian ("no") or English ("en") see the right content?

Examples of things that are really script issues:

  - Displaying <alpha> and <omega> on either side of a picture ? as
     long as the correct shapes are generated, the user does not care
     which language they are considered to be in

  - I have a business card with <Alef Sad> on it, and the keyboard's
     keycaps have ASCII legends, and I don't know how to use it to
     enter Arabic characters

HTML 4.01 Section 8.1, "Specifying the language of content: the lang
attribute",

http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1

gives a reasonable treatment of language tagging in the context of
HTML.

Many problem areas that turned out to have a script solution can be
regarded as solved (at protocol level) when the carrier is defined to
support ISO 10646.


4. Specific choices in sorting, matching and canonicalization options

The cardinal rule of protocol internationalization should be:

DO NOT INVENT ANYTHING IF YOU CAN AVOID IT.

There are a number of ready-made things available, and a number of
pitfalls that these things have already dealt with.
However, there is no substitute for actually understanding the tools
you are using.

(specifics: Unicode identifier definition, UTF-8, ACAP/IMAP comparator
registry, IDN nameprep, UTR-15 canonicalization, case-
folding?..suggestions!)







draft-alvestrand-i18n-howto-01.txt                      [Page 10]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

4.1 Internationalized encodings

When you transport I18N script across the wire, you don't actually
transport the script itself. You are transporting the bits which
represent the script.

How the bits are assembled and disassembled from scripts are dependent
on character sets and encodings.



I18N is not just a simple "8-bit clean" problem.

ISO10646 is a character set with a very large number of characters
(94.000+ of which have defined meanings in Unicode 3.1) and thus "8-
bit" is technically not sufficient. An encoding is how you transport an
I18N script through your constrained environment.



It is STRONGLY recommended that ISO10646, and ONLY that, be used as a
reference character repertoire.



When one encoding that is easy to retrofit into an ASCII/8-bit
environment is desired, and variable length encoding is acceptable,
UTF-8 is the preferred encoding.



In other contexts, a four-octet encoding, possibly supplemented by a
compression function, might be appropriate (UCS-4/UTF-32BE). This MUST
ONLY be used in big-endian order. (Note that functions that involve
encryption almost always include a compression function.)



UTF-16 suffers from the endianness problem (UTF-16BE vs UTF-16LE), and
from the likelihood of badly implemented surrogate support; UTF-16 is
NOT RECOMMENDED.



Having two encodings defined inside a single protocol is a REALLY,
REALLY BAD IDEA. DO NOT DO THIS.

If you allow multiple encodings for a piece of text, the encoding must
be labelled. The MIME protocol has shown that, while adequate, this is
a bad idea. Sending software will use obscure encoding that the
receiving software cannot handle. Worse yet, sending software will
encode something with an obscure label for which there is a more common
equivalent, but this still prevents the receiver from interpreting it.

draft-alvestrand-i18n-howto-01.txt                      [Page 11]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

Using a single encoding avoids this problem.

4.2 Normalization

Normalization is needed when you want canonical forms of scripts where
one gets string input from multiple sources and want to compare them or
show them to each other, e.g. in cases when you need to do matching on
"functional equality", comparison or sorting of I18N elements. If
normalization is needed, a good starting reference would be Unicode
UTR-15



UTR-15 specifies multiple forms of normalization; this document
recommends normalization form C when dealing with text, and form KC (or
equivalently ? restricting the character repertoire to some subset of
that which is invariant under KC normalization) when limiting
namespaces for identifiers.



ISO/IEC 10646 contains characters that look similar or identical to
each other. For example, U+0041 (LATIN CAPITAL LETTER A) looks just
like U+0391 (GREEK CAPITAL LETTER ALPHA) in most fonts; there are
literally hundreds of other examples. In some cases, characters that
have very similar meaning but different looks can be normalized with
minimal loss of functionality, but full normalization to prevent
visually-similar characters is not feasible without losing character
meaning and thus possibly confusing typical users.



Note that normalization is not enough to convert the matching problem
into a binary comparision problem; see section 3.3



Do remember that normalization is an one-way function which will not
preserve the original form.

4.3 Choosing limited character sets for "names"

In quite a few cases, there is a need to support a limited character
set for something like "names", where more characters than ASCII are
needed, but large swaths of special things (spaces, punctuation and so
on).

There is a lot of work already done on this; in particular, the Unicode
"identifier" definition [REFHERE] and the limited range of characters
used in the IDN domain name definition [REFHERE] are candidates.



draft-alvestrand-i18n-howto-01.txt                      [Page 12]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

4.4 Comparator functions

As alluded to above, deciding how to compare two strings is a hard
task.

What's more, the number of ways in which people want to compare strings
is growing, not shrinking.

This means that within a protocol that is intended to serve many
purposes, you may need a means to name ways of comparing strings. This
need has been seen before, and attempts to fill it include:

  . The ACAP/IMAP comparator registry [REFHERE]

  . The ISO Cultural Conventions registry [REFHERE]

The target of the latter is far wider than the issue of string
comparators, but it also includes this.


5. Tricks to shoehorn stuff into older protocols

Very rarely is the protocol redesigner given a clear slate, upon which
he can deploy properly targeted internationalization.

Most of his effort must be spent in figuring out how to create
modifications to the protocol that allow the protocol to offer the
features requested by the international user community, while still not
causing undue disruption for users who use older versions of the
protocol.

5.1 Redefining "text" as UTF-8

Most protocols with text in them defined without thought for
internationalization have one of three definitions of text:

  . ASCII

  . Latin1 (ISO 8859-1) ? this is common for protocols developed in
     Western Europe

  . Unspecified octets said to carry text



The last category may in practice be like the first, because nothing
but ASCII was ever used, or the first may be like the last, because
people were quietly ignoring the "ASCII only" requirement.

In theory, one can shoehorn internationalized text into the first and
last case by defining that any non-ASCII byte be considered part of an
UTF-8 character (an extension), and into the last case by defining that

draft-alvestrand-i18n-howto-01.txt                      [Page 13]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

only UTF-8 is legal to carry (a restriction). In practice, the issue is
fuzzier.

  . What will be the reaction of old implementations on seeing
     extended characters? Ignore, barf or crash?

  . To what degree will old implementations send non-ASCII, non-UTF-8
     data to new implementations?
     What will happen when they do?

In protocols that do version negotiation, there is a theoretical answer
that says that you "just" move to a new version of the protocol, and
negotiation will take care of it. However, this is not trivial:

  . When version upgrade has never been done before, the negotation
     machinery is untested.

  . When version upgrade has been common, implementations may choose
     to ignore a "minor" version number difference.

  . When the strings involved are identifiers, communication between
     old and new versions is troublesome: what should one do when an
     identifier cannot be represented in the old version of the
     protocol, yet needs to be referred to?

  . When protocol violations, such as putting Latin-1 in an ASCII-only
     field, has been common in an old version, how should the new
     version behave when faced with such violations?

The problems are endemic to any protocol with versions, but are often
brought to the fore by internationalization.
This has tempted many to go the route of  "just" declaring a different
interpretation of strings, without changing the protocol version number
or doing option negotiation to enable the feature.

The case of Latin-1 (or, equivalently, Shift-JIS) is especially
troublesome, because there are byte sequences that can be interpreted
either as UTF-8 or as Latin-1. This means that even implementations
ready to tackle both encodings can be "fooled" into displaying
incorrect text to their users. This is worrying.

In protocols with "feature negotiation", such as SMTP or LDAP, the
problem of versioning grows more complex: Any extension must be
considered for its interaction with any other extension ? does the
"character set" option interact with the "regexp search" option? With
the "return results later" option? With the "foobar" ooption?

The effort of evaluating ? and implementing ? an option can quickly
turn into a function of the square of the number of options.





draft-alvestrand-i18n-howto-01.txt                      [Page 14]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

5.2 Example retrofits

More examples of protocol internationalization can be found in [I18N-
CASES].

5.2.1 MIME

The Multipurpose Internet Mail Extensions are probably the most widely
deployed set of retrofits of internationalization in a preexisting
protocol.

It delivered:

  . The ability to have multiple character sets in mail bodies

  . The ability to have multiple character sets in parts of mail
     headers

It failed to deliver:

  . An unique encoding of a text to a transferred string; the sender
     can make multiple encodings from the same message body. This has
     implications for attempts to use digital signatures, among other
     things.

  . A language tagging ability for mail header components. A later
     attempt to add this has failed to see visible deployment.

In some areas, it seems that MIME has delivered "labeled non-
interoperability", giving senders the ability to specify what it sends,
but not providing a means to fit that to a generally accepted subset,
or to limit the sending to what can usefully be understood by the
recipient. But it has been very widely deployed, and has improved
interoperability among internationalized mail software enormously.

A more thorough analysis is given in [I18N-CASES].

5.2.2 SNMP version 3/SMI version 2

In the original Simple Network Management Protocol, a lot of fields
were labeled "text".

In the US context, these were naturally considered ASCII; in other
contexts, usually ASCII was used, but on occasion, other charsets such
as Latin-1 or iso-2022-jp could be found.

In the course of moving SMI version 2 to Draft and Standard, two
considerations were added:

  . A DISPLAY-HINT called "u" was added, indicating that the expected
     display format of the variable was as an UTF-8 string.


draft-alvestrand-i18n-howto-01.txt                      [Page 15]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

  . An understanding that putting text that was neither ASCII nor UTF-
     8 into a text variable was not consistent with the protocol

In the course of updating older MIBs, there was extensive discussion
about whether to add new variables with display-hint UTF-8 or to
redefine variables that had previously been understood as "text, any
charset" or "ascii" to be UTF-8.

<<here I need the MIB folks to tell me what was actually done!!!!!>>

<<examples wanted!>>

STILL BRAIN DUMP:



Beware of third answers to what has previously been binary questions
(history: NIS yes/no hostname answer did really rotten job on TEMPFAIL)

Undisplayable characters ? hieroglyphs at the user interface.

Both in names and other contexts ?names are worse.

The copy/paste problem ? including where the paste buffer is in the
brain of the user.

"there are things better done in protocol/servers, and things better
done in UI/client software/user brain, and the harder problem is
realizing which category they belong to".


6. Security Considerations

The security implications of improperly done internationalizations can
be considerable.

For instance:

. If one does not specify whether input lengths are counted in
  characters or octets, buffer overflows are likely.

. If multiple representations of the same character are allowed,
  multiple items can appear to the user to have the same name, even
  though they are distinct. This can be used as an attack.
  (Note that this is hard to avoid ? see section 4.2 for more on this)

. Signature failures (erroneous success or erroneous failure) due to
  improper canonicalization are a security problem, too; a server
  canonicalizing a name before comparing will never be able to match on
  a certificate containing an uncanonicalized name, for instance.




draft-alvestrand-i18n-howto-01.txt                      [Page 16]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

. Code being forced down "interesting" code paths because a string is
  used in normalized form in part of the code and unnormalized
  elsewhere. (example: the overlong UTF-8 code sequence, where one
  encodes leading zeroes so that (for instance) a carriage return can
  be slipped past the code that checks that a command line is just one
  line. This was the reason for outlawing overlong UTF-8 sequences in
  the Unicode Standard, version 3.1, section D.36)


7. Acknowledgements

This document has benefited from many rounds of review and comments in
various fora of the IETF and the Internet working groups.

Any list of contributors is bound to be incomplete; please regard the
following as only a selection from the group of people who have
contributed to make this document what it is today.

In alphabetical order:

Martin Duerst (apologies for the lack of internationalization)
Patrik Faltstrom (aftloi)
Paul Hoffman,
John Klensin,
James Seng (aftloi)


8. Author's Address

Harald Tveit Alvestrand
Cisco Systems
Weidemanns vei 27
7043 Trondheim
NORWAY

EMail: Harald@Alvestrand.no
Phone: +47 73 50 33 52


9. References



[ISO 639]

     ISO 639:1988 (E/F) - Code for the representation of names of
     languages - The International Organization for Standardization,
     1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology
     (principles and coordination).

     Note that a new version (ISO 639-1:2000) is in preparation at the
     time of this writing.

draft-alvestrand-i18n-howto-01.txt                      [Page 17]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

[ISO 639-2]

     ISO 639-2:1998 - Codes for the representation of names of
     languages -- Part 2: Alpha-3 code  - edition 1, 1998-11-01, 66
     pages, prepared by a Joint Working Group of ISO TC46/SC4 and ISO
     TC37/SC2.



[ISO 3166]

     ISO 3166:1988 (E/F) - Codes for the representation of names of
     countries - The International Organization for Standardization,
     3rd edition, 1988-08-15.

[RFC 1521]

     Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for
     Specifying and Describing the Format of Internet Message Bodies",
     RFC 1521, Bellcore, Innosoft, September 1993.

[RFC 2026]

     The Internet Standards Process -- Revision 3. S. Bradner. October
     1996.

[RFC 2028]

     The Organizations Involved in the IETF Standards Process. R.
     Hovey, S. Bradner. October 1996.

[RFC 2119]

     Key words for use in RFCs to Indicate Requirement Levels. S.
     Bradner. March 1997.

[RFC 2234]

     Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P.
Overell, November 1997.

[RFC 2616]

     Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys,
     J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. June
     1999.

[RFC 2860]

     Memorandum of Understanding Concerning the Technical Work of the
     Internet Assigned Numbers Authority. B. Carpenter, F. Baker, M.
     Roberts. June 2000.


draft-alvestrand-i18n-howto-01.txt                      [Page 18]


Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-01.txt                Expires May 2002

[IRI-BIDI]

     Internet Identifiers and Bidirectionality. Martin Duerst. Work In
     Progress
     (draft-duerst-iri-bidi-00.txt)






















draft-alvestrand-i18n-howto-01.txt                      [Page 19]

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00 01
	Compare versions
	Author	Harald T. Alvestrand Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml