Unicode Format for Network Interchange
RFC 5198
Document | Type |
RFC - Proposed Standard
(March 2008; Errata)
Obsoletes RFC 698
Updates RFC 854
Was draft-klensin-net-utf8 (individual in app area)
|
|
---|---|---|---|
Authors | Michael Padlipsky , John Klensin | ||
Last updated | 2020-01-21 | ||
Stream | IETF | ||
Formats | plain text html pdf htmlized with errata bibtex | ||
Reviews | |||
Stream | WG state | (None) | |
Document shepherd | No shepherd assigned | ||
IESG | IESG state | RFC 5198 (Proposed Standard) | |
Consensus Boilerplate | Unknown | ||
Telechat date | |||
Responsible AD | Chris Newman | ||
Send notices to | (None) |
Network Working Group J. Klensin Request for Comments: 5198 M. Padlipsky Obsoletes: 698 March 2008 Updates: 854 Category: Standards Track Unicode Format for Network Interchange Status of This Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Abstract The Internet today is in need of a standardized form for the transmission of internationalized "text" information, paralleling the specifications for the use of ASCII that date from the early days of the ARPANET. This document specifies that format, using UTF-8 with normalization and specific line-ending sequences. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirement for a Standardized Text Stream Format . . . . 2 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5 5. Applicability and Stability of this Specification . . . . . . 7 5.1. Use in IETF Applications Specifications . . . . . . . . . 7 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14 Appendix D. A Note about Related Future Work . . . . . . . . . . 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Normative References . . . . . . . . . . . . . . . . . . . . . . 15 Informative References . . . . . . . . . . . . . . . . . . . . . 16 Klensin & Padlipsky Standards Track [Page 1] RFC 5198 Network Unicode March 2008 1. Introduction 1.1. Requirement for a Standardized Text Stream Format Historically, Internet protocols have been largely ASCII-based and references to "text" in protocols have assumed ASCII text and specifically text in Network Virtual Terminal ("NVT") or "Network ASCII" form (see Appendix A and Appendix B). Protocols and formats that have moved beyond ASCII have included arrangements to specifically identify the character set and often the language being used. In our more internationalized world, "text" clearly no longer equates unambiguously to "network ASCII". Fortunately, however, we are converging on Unicode [Unicode] [ISO10646] as a single international interchange character coding and no longer need to deal with per- script standards for character sets (e.g., one standard for each of Arabic, Cyrillic, Devanagari, etc., or even standards keyed to languages that are usually considered to share a script, such as French, German, or Swedish). Unfortunately, though, while it is certainly time to define a Unicode-based text type for use as a common text interchange format, "use Unicode" involves even more ambiguity than "use ASCII" did decades ago. Unicode identifies each character by an integer, called its "code point", in the range 0-0x10ffff. These integers can be encoded into byte sequences for transmission in at least three standard and generally-recognized encoding forms, all of which are completely defined in The Unicode Standard and the documents cited below: o UTF-8 [RFC3629] defines a variable-length encoding that may be applied uniformly to all code points. o UTF-16 [RFC2781] encodes the range of Unicode characters whose code points are less than 65536 straightforwardly as 16-bit integers, and provides a "surrogate" mechanism for encoding larger code points in 32 bits. o UTF-32 (also known as UCS-4) simply encodes each code point as a 32-bit integer. Older forms and nomenclature, such as the 16-bit UCS-2, are now strongly discouraged. As with ASCII, any of these forms may be used with different line- ending conventions. That flexibility can be an additional source of confusion with, e.g., index (offset) references into documents based on character counts. Klensin & Padlipsky Standards Track [Page 2] RFC 5198 Network Unicode March 2008Show full document text