(back to the topic of Unicode, sorry)
First, about highlight received by UTF-8 capable functions, I wasn’t only referring to the String documentation proper, but also to the larger standard library documentation, manual and more generally available learning resources.
Now for specifics, and restricting to the official documentation. On the byte/character front, the sentence from the String doc you’re quoting is rather clear indeed, but it is assigning a non-standard meaning to an otherwise-known word (“character”). Which is problematic because this non-standard meaning is used pervasively in places that are not hierarchically below the String module (like in the doc-string of Stdlib.input, or much earlier in the language manual). You have to know already that there is this definition, in this specific place in the documentation.
On highlight received by UTF functions: the UTF codecs exist, but they are not alluded to from anywhere; not even in the header of String in the paragraph about UTF-8 (so upon reading that paragraph I might be tempted to believe that, in the purest C/OCaml tradition, UTF-8 is allowed but no function is actually provided to deal with it). In Bytes there is not even a discussion about Unicode, and the UTF functions are found below a pack of unsafe arcane stuff. Also, there is no UTF-capable input/output function.
Luckily some bits of my concerns with the documentation are quick to fix, so I turned the easy bits into a constructive PR. But I don’t have much time for a more in-depth redaction.
I would have loved to contribute Unicode support to re (the regex library). I find this to be one of the significant missing pieces w.r.t. Unicode (because you can’t just implement it yourself). Unfortunately I expect it to be a (fun but) somewhat complex task, and my spare time is allotted to other things.
Perhaps Python has other issues I’m not aware of, but it seems to me that the specific issue you’re pointing is that Unicode literals '\uXXXX' have the surprising semantics of allowing to create UTF-16 surrogates (i.e. invalid code points). I would expect the syntax '\uD800' to trigger an error. Isn’t it easy to fix?
(Unicode) string-indexing presupposes that you’re representing your unicode string as an array of unicode code-points, right? That’s wasteful of memory, isn’t it?
(Diverging from the initial topic more and more:) Not necessarily. As I said, Python’s str has constant-time indexing thanks to a fixed-with encoding, but it adjusts the character width depending on contained data (see the PEP); so either Latin-1, UCS-2 or UCS-4, i.e. 1 byte, 2 bytes or 4 bytes per code point. Since code points greater than U+FFFF are rare (essentially: rare/ancient CJKV ideograms, antique or endangered writing systems, or fancy emojis), you’d rarely (if ever) resort to UCS-4. Even if you do need the full Unicode range, and space is a concern, you may implement a denser packing than UCS-4 (because Unicode code points are in fact 21-bit integers, not 32-bit). Also, one can imagine using a variable-with encoding such as UTF-16, but with each string you would maintain the set of (codepoint-wise) indexes where several “coding units” are used; so codepoint-wise indexing would be logarithmic-time in the worst case, and constant-time in the common case where you have no (or no more than a fixed number of) large characters. You can do the same with UTF-8 if you’re anticipating that most of your characters would be ASCII. If you’re serious about large portions of text you want to move around, copy, cut, concatenate, share… you would use ropes or something like that, so you would end up with that kind of indexing-by-searching-in-a-tree, anyway.