www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF and U+FFFE and U+FFFF

Since there appears to be some confusion on the status of U+FFFE and U+FFFF, I
thought I'd quote from the FAQ on the Unicode website itself, at URL:
http://www.unicode.org/faq/utf_bom.html

"Q: What is a UTF?

"A: A Unicode transformation format (UTF) is an algorithmic mapping from every
Unicode code point (except surrogate code points) to a unique byte sequence. The
ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the
two terms are merely synonyms for the same concept. 

"Each UTF is reversible, thus every UTF supports lossless round tripping:
mapping from any Unicode coded character sequence S to a sequence of bytes and
back will produce S again. To ensure round tripping, a UTF mapping  must also
map all code points that are not valid Unicode characters to unique byte
sequences. These invalid code points are the 66 noncharacters (including FFFE
and FFFF), as well as unpaired surrogates."


The phrase "every Unicode code point (except surrogate code points)" implies
that surrogate codepoints (those in the range 0xD800 to 0xDFFF) need not be
encodable (although, curiously, the second paragraph says "as well as unpaired
surrogates" which seems to contradict this). It's a non-sequiter, however, since
surrogate codepoints CANNOT be expressed in UTF-16. The phrase "including FFFE
and FFFF" is quite unambiguous, however.

Jill
Jul 12 2004