"OCaml -- first impressions"

Indeed you can now iterate over code points, but your indices should iterate over scalar values. Since apparently in 2017 they didn’t fix c) that’s the mess you get:

> python3
Python 3.6.1 (default, Apr  4 2017, 09:40:21) 
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\uD800' # unpaired surrogate
'\ud800'
>>> '\uD800'[0]
'\ud800'
>>> '\uD800'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
>>> '\uD83D\uDC2B' # paired surrogates representing U+1F42B
'\ud83d\udc2b'
>>> '\uD83D\uDC2B'[0]
'\ud83d'
>>> '\uD83D\uDC2B'[1]
'\udc2b'
>>> '\uD83D\uDC2B'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
>>> '\U0001F42B'
'🐫'
>>> '\U0001F42B'[0]
'🐫'

So your python3 string doesn’t represent a sequence of Unicode scalar values, which as a programmer, is the minimal model you’d like to be able to work with (Swift made a bolder, more programmer friendly move for certain scripts as you index by grapheme clusters).

2 Likes