UTF-7 - A Mail-Safe Transformation Format of Unicode
RFC 1642
Document | Type |
RFC - Experimental
(July 1994; No errata)
Obsoleted by RFC 2152
Was draft-goldsmith-mime-utf7 (individual)
|
|
---|---|---|---|
Authors | Mark Davis , David Goldsmith | ||
Last updated | 2013-03-02 | ||
Stream | Legacy | ||
Formats | plain text html pdf ps htmlized bibtex | ||
Stream | Legacy state | (None) | |
Consensus Boilerplate | Unknown | ||
RFC Editor Note | (None) | ||
IESG | IESG state | RFC 1642 (Experimental) | |
Telechat date | |||
Responsible AD | (None) | ||
Send notices to | (None) |
Network Working Group D. Goldsmith Request for Comments: 1642 M. Davis Category: Experimental Taligent, Inc. July 1994 UTF-7 A Mail-Safe Transformation Format of Unicode Status of this Memo This memo defines an Experimental Protocol for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E) jointly define a 16 bit character set (hereafter referred to as Unicode) which encompasses most of the world's writing systems. However, Internet mail (STD 11, RFC 822) currently supports only 7- bit US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends Internet mail to support different media types and character sets, and thus could support Unicode in mail messages. MIME neither defines Unicode as a permitted character set nor specifies how it would be encoded, although it does provide for the registration of additional character sets over time. This document describes a new transformation format of Unicode that contains only 7-bit ASCII characters and is intended to be readable by humans in the limiting case that the document consists of characters from the US-ASCII repertoire. It also specifies how this transformation format is used in the context of RFC 1521, RFC 1522, and the document "Using Unicode with MIME". Motivation Although other transformation formats of Unicode exist and could conceivably be used in this context (most notably UTF-1 and UTF-8, also known as UTF-2 or UTF-FSS), they suffer the disadvantage that they use octets in the range decimal 128 through 255 to encode Unicode characters outside the US-ASCII range. Thus, in the context of mail, those octets must themselves be encoded. This requires putting text through two successive encoding processes, and leads to a significant expansion of characters outside the US-ASCII range, putting non-English speakers at a disadvantage. For example, using Goldsmith & Davis [Page 1] RFC 1642 UTF-7 July 1994 UTF-FSS together with the Quoted-Printable content transfer encoding of MIME represents US-ASCII characters in one octet, but other characters may require up to nine octets. Overview UTF-7 encodes Unicode characters as US-ASCII, together with shift sequences to encode characters outside that range. For this purpose, one of the characters in the US-ASCII repertoire is reserved for use as a shift character. Many mail gateways and systems cannot handle the entire US-ASCII character set (those based on EBCDIC, for example), and so UTF-7 contains provisions for encoding characters within US-ASCII in a way that all mail systems can accomodate. UTF-7 should normally be used only in the context of 7 bit transports, such as mail and news. In other contexts, straight Unicode or UTF-8 is preferred. See the document "Using Unicode with MIME" for the overall specification on usage of Unicode transformation formats with MIME. Definitions First, the definition of Unicode: The 16 bit character set Unicode is defined by "The Unicode Standard, Version 1.1". This character set is identical with the character repertoire and coding of the international standard ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; Subset=300; Implementation Level=3. Note. Unicode 1.1 further specifies the use and interaction of these character codes beyond the ISO standard. However, any valid 10646 BMP (Basic Multilingual Plane) sequence is a valid Unicode sequence, and vice versa; Unicode supplies interpretations of sequences on which the ISO standard is silent as to interpretation. Next, some handy definitions of US-ASCII character subsets: Set D (directly encoded characters) consists of the following characters (derived from RFC 1521, Appendix B): the upper and lower case letters A through Z and a through z, the 10 digits 0-9, and the following nine special characters (note that "+" and "=" are omitted): Goldsmith & Davis [Page 2] RFC 1642 UTF-7 July 1994 Character ASCII & Unicode Value (decimal) ' 39 ( 40 ) 41 , 44 - 45 . 46 / 47 : 58Show full document text