Why can’t I create a project with non-ASCII characters?

Is there a technical reason why one cannot create projects that use non-ASCII characters? Hasn’t at least Latin-1 been supported since the ’80s & Unicode in the ’90s & UTF-8 in the ’00s? I can understand a limitation on letters & not symbols/emoji, but it seems both restrictive & Anglocentric/non-inclusive. If I am making a project called “แมว”, shouldn’t I be able to at least label my project just that?

$ mkdir café && cd café
$ vi café.opam
$ grep "name" café.opam 
name: "café"
$ opam lint
$PWD/café.opam: Errors.
    error  3: File format error in 'name' at line 2, column 0: while expecting pkg-name: Invalid character in package name "caf\195\169"

You need a tad more infrastructure to deal with identifier matching, spoofing, rendering etc.

Why ? Limiting to letters is yet another centrism, not all writing systems are alphabetic.

1 Like

Can’t look-alike characters like a and а be confusing?

They can.

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色.

Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

GitHub - whitequark/ocaml-m17n: Multilingualization for the OCaml source code

Seemed to take a reasonable approach of rejecting mixing writing systems without separation. Even limiting to a singular script per name seems reasonable.

Limiting to letters is yet another centrism, not all writing systems are alphabetic.

My bad for using an imprecise, colloquial definition of “letter”–no intent of excluding abugidas, abjads, logographies, etc. Perhaps character? Non-punctuation symbols? Ideograms? Writing system codepoint? Surely, there’s a more reasonable cut-off than a 1969 American Standard limited to 7 bits per character…

Which brings us back to “you need a tad more infrastructure” – and Unicode understanding (even for end users)…

Let’s not pretend it doesn’t open a can of worms; not to mention interoperability concerns with other package systems, once e.g. your opam package becomes say a debian or nix package.

By the way you can perfectly create projects with non US-ASCII characters, it’s just that if you want to publish it with opam you will need an ASCII rendering of its name.

1 Like

This points me in the right direction actually. I had no plans to upload a project (not library) to opam, but dune is giving me a similar error. The difference here is…

  let of_string x =
    match
      OpamStd.String.fold_left (fun acc c ->
          if acc = Some false then acc else match c with
            | 'a'..'z' | 'A'..'Z' -> Some true
            | '0'..'9' | '-' | '_' | '+' -> acc
            | _ -> Some false)
        None x
    with
    | Some false ->
      failwith
        (Printf.sprintf "Invalid character in package name %S" x)
    | None ->
      failwith
        (Printf.sprintf "Package name %S should contain at least one letter" x)
    | Some true ->
      x

in which dune does not allow even using dune build if a name is not a valid name for an opam library. So using example abugida characters from above…

$ dune init project แมว
dune: NAME argument: expected a valid dune atom
Usage: dune init project [OPTION]… NAME [PATH]
Try 'dune init project --help' or 'dune --help' for more information.

or even manually making the dune-project file with string-quoted names

$ dune build
File "dune-project", line 19, characters 7-45:
19 |  (name "\224\185\129\224\184\161\224\184\167")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: "\224\185\129\224\184\161\224\184\167" is an invalid opam package
name.
Package names can contain letters, numbers, '-', '_' and '+', and need to
contain at least a letter.
Hint: p_________ would be a correct opam package name

OR if trying to have the project name use non US-ASCII, but the package name needs to (Romanization via ISO_11940-2) for opam compat for reasons I don’t personally understand as someone new to the ecosystem

$ dune build
File "dune-project", line 18, characters 0-161:
18 | (package
19 |  (name "maeo")
20 |  (synopsis "A short synopsis")
21 |  (description "A longer description")
22 |  (depends ocaml dune)
23 |  (tags
24 |   (topics "to describe" your project)))
Error: when a single package is defined, it must have the same name as the
project name: แมว

By the way you can perfectly create projects with non US-ASCII characters, it’s just that if you want to publish it with opam you will need an ASCII rendering of its name.

Should this then be considered a bug in dune??

Depends a bit on what dune uses name for.

I don’t know, I don’t use dune but in the rare cases I had to, I always found it’s interaction and relationship with opam to be quite badly designed (e.g. IIRC I was quite annoyed that it insisted you had to define an opam package for your project).

Dune uses names to correspond to OCaml modules. Since OCaml modules are used as identifiers in the language, they can’t have non-ASCII names.

2 Likes

Ah, ha. Which goes back to why m17n is a ppx supporting UTF-8 since the language itself can’t support non-ASCII?

Is there any sort of workaround?

Not sure what your aim is. But if you simply want to have your directory called แมว, as far as I build, it seems dune is fine with it.

Also there are description fields in strings where you can use any language.

Aim is having a project name match the project spelling–be that แมว, mötley_crüe, or þórr without having to resort to Romanizations or other compromises on the branding/origin/heritage. Mélange has to be melange does it not?

Dune descriptions all get character escaped if you run its built-in formatter so you can no longer read it, but I don’t think you have use dune for *.opam files.

I did not realize that dune character-escapes Unicode inside string fields like synopsis etc. (while opam does not). That sounds like a bug that should be (maybe has been) filed with the dune project.

Keep in mind this is vendored code from OPAM and not the actual code used for validation of OPAM packages in dune.

That won’t be possible as neither Dune nor OPAM support non-ASCII characters. Whether they should is another debate.

That is correct, you don’t have to, you can use Dune without generating OPAM files. In fact, this has been the original modus operandi, generating OPAM files from a dune-project file was only added in a latter release.

1 Like