IETF Policy on Character Sets and Languages
RFC 2277
Document | Type |
RFC - Best Current Practice
(January 1998; Errata)
Also known as BCP 18
Was draft-alvestrand-charset-policy (individual)
|
|
---|---|---|---|
Last updated | 2013-03-02 | ||
Stream | Legacy | ||
Formats | plain text html pdf htmlized with errata bibtex | ||
Stream | Legacy state | (None) | |
Consensus Boilerplate | Unknown | ||
RFC Editor Note | (None) | ||
IESG | IESG state | RFC 2277 (Best Current Practice) | |
Telechat date | |||
Responsible AD | (None) | ||
Send notices to | (None) |
Network Working Group H. Alvestrand Request for Comments: 2277 UNINETT BCP: 18 January 1998 Category: Best Current Practice IETF Policy on Character Sets and Languages Status of this Memo This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (1998). All Rights Reserved. 1. Introduction The Internet is international. With the international Internet follows an absolute requirement to interchange data in a multiplicity of languages, which in turn utilize a bewildering number of characters. This document is the current policies being applied by the Internet Engineering Steering Group (IESG) towards the standardization efforts in the Internet Engineering Task Force (IETF) in order to help Internet protocols fulfill these requirements. The document is very much based upon the recommendations of the IAB Character Set Workshop of February 29-March 1, 1996, which is documented in RFC 2130 [WR]. This document attempts to be concise, explicit and clear; people wanting more background are encouraged to read RFC 2130. The document uses the terms 'MUST', 'SHOULD' and 'MAY', and their negatives, in the way described in [RFC 2119]. In this case, 'the specification' as used by RFC 2119 refers to the processing of protocols being submitted to the IETF standards process. Alvestrand Best Current Practice [Page 1] RFC 2277 Charset Policy January 1998 2. Where to do internationalization Internationalization is for humans. This means that protocols are not subject to internationalization; text strings are. Where protocol elements look like text tokens, such as in many IETF application layer protocols, protocols MUST specify which parts are protocol and which are text. [WR 2.2.1.1] Names are a problem, because people feel strongly about them, many of them are mostly for local usage, and all of them tend to leak out of the local context at times. RFC 1958 [RFC 1958] recommends US-ASCII for all globally visible names. This document does not mandate a policy on name internationalization, but requires that all protocols describe whether names are internationalized or US-ASCII. NOTE: In the protocol stack for any given application, there is usually one or a few layers that need to address these problems. It would, for instance, not be appropriate to define language tags for Ethernet frames. But it is the responsibility of the WGs to ensure that whenever responsibility for internationalization is left to "another layer", those responsible for that layer are in fact aware that they HAVE that responsibility. 3. Definition of Terms This document uses the term "charset" to mean a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry [REG]. (Note that this is NOT a term used by other standards bodies, such as ISO). For a definition of the term "coded character set", refer to the workshop report. A "name" is an identifier such as a person's name, a hostname, a domainname, a filename or an E-mail address; it is often treated as an identifier rather than as a piece of text, and is often used in protocols as an identifier for entities, without surrounding text. 3.1. What charset to use All protocols MUST identify, for all character data, which charset is in use. Alvestrand Best Current Practice [Page 2] RFC 2277 Charset Policy January 1998 Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text. Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy; such a violation would need a variance procedure ([BCP9] section 9) with clear and solid justification in the protocol specification document before being entered into or advanced upon the standards track. For existing protocols or protocols that move data from existing datastores, support of other charsets, or even using a default other than UTF-8, may be a requirement. This is acceptable, but UTF-8Show full document text