Js_of_ocaml - unescaping text for dynamic html generation

I’ve been trying out js_of_ocaml, and it’s been going pretty well so far, but I have run into an issue that I can’t seem to get around.

I want to convert a string (obtained dynamically from a Yojson json response) to a HTML element and insert it into a Tyxml structure. Unfortunately, no matter what I try, the HTML tags end up being encoded - i.e the final output text would be:

Some text... <a  href="http://www.fsf.org">fsf</a>

Which means that the browser doesn’t render the link correctly, and instead displays the tags explicitly.

To generate the node, I was initially using:

div [txt text]

where text is the extracted text from the JSON response.

Seeing as this didn’t work, I then tried using the Unsafe.data operation which is documented as inserting a raw string:

div [Unsafe.data text]

but the issue persists.

I’ve also tried using Js.decodeURI and Js.unescape to preprocess the string I extract from the JSON (in case the issue arises from JSON encoding), but this doesn’t help.

Rather annoyingly when I try and print the text to console it seems to auto-hide the escape sequences, meaning that I can’t actually work out if the error is due to Tyxml or due to the JSON text itself.

Is there something I’m doing wrong? or is this operation not supported by Tyxml?

Unsafe.data should work for including arbitrary non-escaped HTML (I use it everyday with the output of omd and pandoc).

If the entities like < are displayed by the browser when using div [txt text] is means that they are already there in the text (Tyxml.Html.txt escapes everything properly) → it’s likely that the entities are already in the JSON you receive.

You can also try what happens with things div [Unsafe.data (text ^ " <b>hello world</b>")]

div [Unsafe.data (text ^ "<b> hello </b>")] ends up also escaping the <b> tags.

This still happens if "<b> hello </b>" is the only argument passed to Unsafe.data or if running Unsafe.data ("<div><b>hello world!</b></div>").

For reference, I am using Tyxml_js.To_dom.of_element to convert the Tyxml term to a js_of_ocaml dom element at a later point - is this the cause of the issue?

Additionally, I don’t think this is an aliasing issue due to some other module shadowing Unsafe, because using merlin locate, the data function aliases to a function named Xml.encodedpcdata.

Digging more into js_of_ocaml code, I’ve found that Html.Unsafe.data happens to call Xml.encodedpcdata which is implemented in javascript using document.createTextElement which then causes the raw text to be escaped using html entities - this seems to violate the documentation, which suggests that Html.Unsafe.data should insert unescaped data.

Is there a way to create HTML elements without using calling out to document.createTextElement?

Ok solved. Turns out it was more of a Javascript problem than a Tyxml problem - in order to convert a string into a Tyxml element, the best way to do it seems to be through the facilities provided by the browser - i.e create a Dom_html element with the string as its contents, and then cast this element back to a Tyxml element.

For future reference, if someone else runs into this problem, the following function solved my problems:

  let string_to_html txt =
    let module T = Tyxml_js in
    let div = Dom_html.createDiv Dom_html.window##.document in
    div##.innerHTML := (Js.string txt);
    T.Of_dom.of_div div in
1 Like

What you want is to parse the HTML syntax and turn it into HTML trees. Unsafe.data creates text nodes. It’s a mistake to assume that HTML textual syntax and HTML trees are the same, and Tyxml will not do the conversion automagically for you. It happens to work when the backend is the textual representation (aka, the Tyxml.Html module), but the API as a whole provides no such guarantee, and naturally it doesn’t work with Tyxml_js.

You can do this conversion in OCaml land in a generic way with lambdasoup+tyxml. For DOM trees specifically, you can indeed use innerHTML in JS land as you did. In all cases, it’s a code injection waiting to happen, so I suggest you be careful and sanitize your inputs. :slight_smile:

2 Likes

Thanks for thelambdasoup reference - it seems to be a better, more generic solution than the one I had found, and would definitely be my first choice in the future.

With regards to the code injection, I’m working under the assumption that the external api that I’m calling will already have validated and sanitized the html it sends back to me, but I guess at some point I should add some sanitization.