Unicode handling in Sexp.to_string

Hello! I am trying to understand how the ppx_sexp_conv handles Unicode characters. Consider the following type

type token = TEXT of string [@@deriving sexp]

If I run the following commands in the toplevel I get what I expect.

# #require "ppx_jane";;
# let x = TEXT "hello";;
# Sexp.to_string (sexp_of_token x);;
- : string = "(TEXT hello)"

Similiarly, if run the same command except I define x using unicode string literals I get the same output.

# let y = TEXT "\u{0068}\u{0065}\u{006C}\u{006C}\u{006F}";;
# Sexp.to_string (sexp_of_token y);;
- : string = "(TEXT hello)"

However if I try to put in the character あ / U+3042 I get an output that I don’t really understand.

# let y = TEXT "\u{3042}";;
val y : token = TEXT "あ"
# Sexp.to_string (sexp_of_token y);;
- : string = "(TEXT\"\\227\\129\\130\")"

Maybe I am missing something really obvious but I tried to decode those code points and it doesn’t resolve to the character I put in. Would really appreciate if someone could shed some light on this!

This seems to be simply the UTF8 encoding of your string:

# let a = "\227\129\130";;
val a : string = "あ"

# String.is_valid_utf_8 a;;
- : bool = true

# String.get_utf_8_uchar a 0 |> Uchar.utf_decode_uchar |> Uchar.to_int |> Printf.sprintf "0x%x";;
- : string = "0x3042"

1 Like

Thanks! I didn’t know you could define a string like that. I was using some online decoder and messed up there somehow.

can be represented as the UTF-8 codepoints E3 81 82. After you convert this to decimal you get 227 129 130, this is what your code snippet also represented them as.