diff options
-rw-r--r-- | Doc/library/unicodedata.rst | 6 | ||||
-rw-r--r-- | Doc/reference/expressions.rst | 8 |
2 files changed, 12 insertions, 2 deletions
diff --git a/Doc/library/unicodedata.rst b/Doc/library/unicodedata.rst index 017d4ee785..ec788c5f06 100644 --- a/Doc/library/unicodedata.rst +++ b/Doc/library/unicodedata.rst @@ -107,7 +107,7 @@ the following functions: based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as - the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). + the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates @@ -126,6 +126,10 @@ the following functions: (NFKC) first applies the compatibility decomposition, followed by the canonical composition. + Even if two unicode strings are normalized and look the same to + a human reader, if one has combining characters and the other + doesn't, they may not compare equal. + .. versionadded:: 2.3 In addition, the module exposes the following constant: diff --git a/Doc/reference/expressions.rst b/Doc/reference/expressions.rst index 3364fd6995..a1c4185dd3 100644 --- a/Doc/reference/expressions.rst +++ b/Doc/reference/expressions.rst @@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type: * Strings are compared lexicographically using the numeric equivalents (the result of the built-in function :func:`ord`) of their characters. Unicode and - 8-bit strings are fully interoperable in this behavior. + 8-bit strings are fully interoperable in this behavior. [#]_ * Tuples and lists are compared lexicographically using comparison of corresponding elements. This means that to compare equal, each element must @@ -1328,6 +1328,12 @@ groups from right to left). cases, Python returns the latter result, in order to preserve that ``divmod(x,y)[0] * y + x % y`` be very close to ``x``. +.. [#] While comparisons between unicode strings make sense at the byte + level, they may be counter-intuitive to users. For example, the + strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently, + even though they both represent the same unicode character (LATIN + CAPTITAL LETTER C WITH CEDILLA). + .. [#] The implementation computes this efficiently, without constructing lists or sorting. |