String option vs strings which can be empty?

It’s usually quite easy to decide in favor of using an option type when we are talking about numbers (as no value is different than say the value 0), but regarding strings, this choice is not really that obvious.

I’d like some opinions on why would one ever want to use a string option, taking into consideration performance, ease of coding and possibility for bugs.

Consider a binding to a C string. That binding may allow NULL, NULL is different from the empty C string (a pointer to a \x00 byte).

So in this case you really want a string option, None for NULL, and (Some "") for the empty C string.

In general deciding whether to use string vs string option needs to be framed by this question: Do I need to distinguish between absence of a string and presence of a string that can be empty ? What kind of difficulties may clients run into if they can’t make the difference ?

2 Likes

This case definitely fits the bill and I’d think that it will be obvious to most people. My question was more in the spirit of “should I feel guilty if I don’t do it in my case”, which, from what you write seems to be quite a definite “no”.

Also, what about performance? Is it possible that string option performs better than empty strings (if I have many such cases in my software)?

No. It depends on the context and since I don’t know about it’s not a definite “no”.

Performance in the abstract is meaningless, depends on the operations you do, the actual data and your hardware; there’s no substitute for measurements. The only thing for sure is that with a string option you will end up allocating more but whether this is a problem in practice depends, again, on all of the above.

The big advantage of moving the ‘empty’ case into an option type is that you’re offloading work into the type system, and the type system is what checks you’re not doing anything wrong. There’s no guarantee that you’ll always handle the empty string case, but in an option type, the type system forces you to check for it.

1 Like

Furthermore, there’s a significant difference in many situations between the empty string and the non-existent string.

Say you have a function which extracts a field from a database row. Saying that the row exists but the field “” is quite different from saying the row does not exist.

Note that even C, which isn’t exactly a paragon of clean type systems, distinguishes these cases.

2 Likes

To add on to this answer, the problem described here is called the Semipredicate Problem.

1 Like

The thing is that in my case it all comes from user input. The user may choose to write any of these:

2018-05-03 event_name "description"
2018-05-03 event_name ""
2018-05-03 event_name

The latter two being equivalent semantically and (I think) whatever implementation I choose, I’ll have to normalize to one of the two.

So I’ll either have

| Result.Error -> return None
| Result.Ok "" -> return None
| Result.Ok s -> return (Some s)

or

| Result.Error -> return ""
| Result.Ok s -> return s  (* this includes the parsed empty string "" *)

It’s not really obvious to me that I gain something from the option, as having an empty string for description is a non-issue by design (I can’t change the requirements for the already generated files)

So the only real consideration here for me is performance, as most of the entries don’t have a description. But from what I measured till now, there isn’t really any difference in performance.

Still depends a lot on what you need to do. If you normalize directly you won’t be able to round trip and if the files need programmatic editing their layout might change which might confuse/annoy users.

Assuming this is in the context of a datatype with accessors you can store unormalized results and provide accessor for both unormalized and normalized results.

If none of the above matters just normalize during parsing.

1 Like