An Uutf decoder works fine for this. It will do, encoding guessing, line normalization and scalar value position tracking for you. Some people prefer to use lexer generators, I don’t but you can have a look at ulex if you are into that sort of things.
Not more or less than in other languages I’m afraid. It’s not because that other language has a unicode string data structure that it’s not a broken one…