Code review: parsing a string which is in a given format

patrick-nicodemus · April 21, 2024, 9:41pm

A header line in the Org-mode markdown syntax (commonly used by Emacs users, as Emacs has a major mode which supports working with this markdown format)
is described in the documentation as follows:

A Heading is an unindented line structured according to the following pattern:

STARS KEYWORD PRIORITY COMMENT TITLE TAGS

where

STARS refers to a sequence of asterisks
KEYWORD is either the string “DONE” or the string “TODO”
PRIORITY is of the form “[#X]” where ‘[’, ‘#’, and ‘]’ are the literal characters and X is any alphanumeric character
COMMENT is the string “COMMENT”
TITLE is a string which does not start or end with spaces and contains no line breaks
TAGS is a list of strings “:TAG1:TAG2:TAG3:...:” where TAGi are strings containing a mix of alphanumeric characters as well as any of the special characters ‘%’, ‘@’, ‘_’, and ‘#’

All of the fields are optional except STARS.

I want to parse an Org header line and return the data. The code that I wrote for this is long and verbose and I am looking for constructive criticism.

  module Heading : sig
    type keyword =
      | TODO
      | DONE
    val pp_keyword : Format.formatter -> keyword -> unit

    val reserved_keywords : string list

    type comment = Comment 

    type priority = char

    type tag = string

    val reserved_tags : tag list

    (** Raises an exception if the input does not contain a space.
        Otherwise returns the index of the first space. *)
    val stars : string -> int
    (* type timestamp = int * int * int * string *)

    type line_data = int * keyword option * priority option * comment option * string * tag list
    val parse_headingline : string -> int * keyword option * priority option * comment option * string * tag list
    val is_headingline : string -> bool
  end = struct
    type keyword =
      | TODO
      | DONE
    
    type comment = Comment

    let reserved_tags = ["ARCHIVED"]
    let reserved_keywords = [ "TODO"; "DONE"]

    let stars str = String.index str ' '
    
    let is_headingline =
      let header_regexp = Str.regexp {|\*+ |} in
      fun str -> Str.string_match header_regexp str 0

    let count_and_strip_stars : string -> int * string =
      fun str ->
      let n = stars str in
      let strip_stars = Core.String.drop_prefix str n |> Core.String.lstrip in
      (n, strip_stars)

    let keyword_and_strip_keyword : string -> keyword option * string =
      fun str ->
      let keyword = List.find_opt
                      (fun keyword -> StringLabels.starts_with ~prefix:keyword str) reserved_keywords in
      let open Core.String in 
      match keyword with
      | Some "TODO" -> (Some TODO, drop_prefix str (String.length "TODO")  |> lstrip)
      | Some "DONE" -> (Some DONE, drop_prefix str (String.length "DONE") |> lstrip)
      | None -> (None, lstrip str)
      | _ -> failwith "Something is broken, the list reserved_keywords doesn't recognize this"

    let priority_regexp = Str.regexp {|\[#[A-Za-z0-9]\]|}

    let priority_and_strip_priority : string -> char option * string =
      let open Core.String in 
      fun str -> if (Str.string_match priority_regexp str 0) then
                   let priority = str.[2] in
                   (Some priority, lstrip (drop_prefix str 4))
                 else
                   (None, lstrip str)

    let comment_regexp = Str.regexp_string "COMMENT"

    let comment_and_strip_comment : string -> comment option * string =
      let open Core.String in
      fun str -> if (Str.string_match comment_regexp str 0) then
                   (Some Comment, lstrip @@ drop_prefix str (length "COMMENT"))
                 else
                   (None, lstrip str)

    let tag_regexp = Str.regexp {|:\([A-Za-z0-9_@#%]+:\)+ *$|}

    let tag_and_strip_tags : string -> string * string list =
      fun str ->
      let open Core.String in
      try
        (let _ = Str.search_forward tag_regexp str 0 in
         (* if (Str.search_forward tag_regexp str 0) then *)
         let tag_string = Str.matched_string str  in
         let match_index = Str.match_beginning () in
         let title = prefix str match_index in
         let tags = String.split_on_char ':' tag_string
                    |> List.filter (fun a -> not @@ Core.String.is_empty a) in
         (String.trim title, tags))
        with
          Not_found -> (str, [])

    type priority = char [@@deriving show]
    type tag = string [@@deriving show]
    (* Format.formatter -> tag -> unit *)
    type line_data = int * keyword option * priority option * comment option * string * tag list

    let parse_headingline (str: string) : line_data =
      if (not @@ is_headingline str) then
        raise (Invalid_argument "Str should be of the form \"**** \"")
      else
        let n_stars, str1 = count_and_strip_stars str in
        let keyword, str2 = keyword_and_strip_keyword str1 in
        let priority, str3 = priority_and_strip_priority str2 in
        let comment, str4 = comment_and_strip_comment str3 in
        let title, tags = tag_and_strip_tags str4 in
        (n_stars, keyword, priority, comment, title, tags)
  end

patrick-nicodemus · April 21, 2024, 9:54pm

After I did that, my next task was to extend the code to support the following. The next line after a heading can optionally be a “planning” line. A planning line is a single line containing of a series of tokens KEYWORD: TIMESTAMP where KEYWORD is one of the following string literals: SCHEDULED, CLOSED or DEADLINE and TIMESTAMP is of the form <YYYY-MM-DD XXX> where XXX is one of the following strings: “Sun”, “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”.

A planning line can contain duplicate entries, in which case only the last entry counts.

Last, I edited it to support “property tables.” A property table can follow a heading line and (optionally) a planning line as well as whitespace lines. It looks roughly like

:properties:
:`key1`: `value1`
:`key2`: `value2`
:`key3`: `value3`
:end:`

where properties and end are the actual strings “properties” and “end” and keyi, valuei are strings (keyi should not contain whitespace and valuei should not contain a newline)

Here is a more complete presentation of the code that deals with both the planning line and the property line.

module Org = struct
  module Heading : sig
    type keyword =
      | TODO
      | DONE
    val pp_keyword : Format.formatter -> keyword -> unit

    val reserved_keywords : string list

    type comment = Comment 
    val pp_comment : Format.formatter -> comment -> unit

    type priority = char
    val pp_priority : Format.formatter -> priority -> unit

    type tag = string
    val pp_tag : Format.formatter -> tag -> unit

    val reserved_tags : tag list

    (** Raises an exception if the input does not contain a space.
        Otherwise returns the index of the first space. *)
    val stars : string -> int
    (* type timestamp = int * int * int * string *)

    type line_data = int * keyword option * priority option * comment option * string * tag list
    module Planning : sig
      type timestamp = int * int * int * string [@@deriving show]
      val is_planning_line : string -> bool
      val get_scheduled : string -> timestamp option
      val get_deadline : string -> timestamp option
      val get_closed : string -> timestamp option
    end

    open Planning

    type t = {
        stars : int;
        keyword : keyword option;
        priority : priority option;
        comment : comment option;
        title : string;
        tags : tag list;
        scheduled : timestamp option;
        deadline : timestamp option;
        closed : timestamp option;
        property_drawer : (string * string) list;
        section : string list
      } [@@deriving show]

    val parse_headingline : string -> int * keyword option * priority option * comment option * string * tag list
    val is_headingline : string -> bool
    val get_property_drawer : string list -> (string * string) list * string list
    
    type planning_data = Planning.timestamp option * Planning.timestamp option * Planning.timestamp option 
  end = struct
    type keyword =
      | TODO
      | DONE [@@deriving show]
    
    type comment = Comment [@@deriving show]

    let reserved_tags = ["ARCHIVED"]
    let reserved_keywords = [ "TODO"; "DONE"]

    let stars str = String.index str ' '
    
    let is_headingline =
      let header_regexp = Str.regexp {|\*+ |} in
      fun str -> Str.string_match header_regexp str 0

    let count_and_strip_stars : string -> int * string =
      fun str ->
      let n = stars str in
      let strip_stars = Core.String.drop_prefix str n |> Core.String.lstrip in
      (n, strip_stars)

    let keyword_and_strip_keyword : string -> keyword option * string =
      fun str ->
      let keyword = List.find_opt
                      (fun keyword -> StringLabels.starts_with ~prefix:keyword str) reserved_keywords in
      let open Core.String in 
      match keyword with
      | Some "TODO" -> (Some TODO, drop_prefix str (String.length "TODO")  |> lstrip)
      | Some "DONE" -> (Some DONE, drop_prefix str (String.length "DONE") |> lstrip)
      | None -> (None, lstrip str)
      | _ -> failwith "Something is broken, the list reserved_keywords doesn't recognize this"

    let priority_regexp = Str.regexp {|\[#[A-Za-z0-9]\]|}

    let priority_and_strip_priority : string -> char option * string =
      let open Core.String in 
      fun str -> if (Str.string_match priority_regexp str 0) then
                   let priority = str.[2] in
                   (Some priority, lstrip (drop_prefix str 4))
                 else
                   (None, lstrip str)

    let comment_regexp = Str.regexp_string "COMMENT"

    let comment_and_strip_comment : string -> comment option * string =
      let open Core.String in
      fun str -> if (Str.string_match comment_regexp str 0) then
                   (Some Comment, lstrip @@ drop_prefix str (length "COMMENT"))
                 else
                   (None, lstrip str)

    let tag_regexp = Str.regexp {|:\([A-Za-z0-9_@#%]+:\)+ *$|}

    let tag_and_strip_tags : string -> string * string list =
      fun str ->
      let open Core.String in
      try
        (let _ = Str.search_forward tag_regexp str 0 in
         (* if (Str.search_forward tag_regexp str 0) then *)
         let tag_string = Str.matched_string str  in
         let match_index = Str.match_beginning () in
         let title = prefix str match_index in
         let tags = String.split_on_char ':' tag_string
                    |> List.filter (fun a -> not @@ Core.String.is_empty a) in
         (String.trim title, tags))
        with
          Not_found -> (str, [])

    type priority = char [@@deriving show]
    type tag = string [@@deriving show]
    (* Format.formatter -> tag -> unit *)
    type line_data = int * keyword option * priority option * comment option * string * tag list

    module Planning = struct
      type timestamp = int * int * int * string [@@deriving show]
      (* let planning_line_regexp = String.regexp "\(SCHEDULED:\)\|\(SCHEDULED:\)" *)
      let is_planning_line str =
        let trim = String.trim str in
        StringLabels.starts_with ~prefix:"SCHEDULED:" trim ||
        StringLabels.starts_with ~prefix:"DEADLINE:" trim ||
        StringLabels.starts_with ~prefix:"CLOSED:" trim

      let timestamp1 = {|<\([0-9]+\)-\([0-9]+\)-\([0-9]+\) \([A-Za-z]+\).*>|}
      let get_timestamp : string -> string -> (int * int * int * string) option =
        fun planning_keyword ->
        let r = (Str.regexp ( planning_keyword ^ ": *"^ timestamp1 ^ ".*")) in
        fun str ->
        let str = String.trim str in
        try
          let index = Str.search_backward r str (String.length str) in
          let str = Core.String.drop_prefix str index in
          let year = int_of_string (Str.replace_first r {|\1|} str) in
          let month = int_of_string (Str.replace_first r {|\2|} str) in
          let day = int_of_string (Str.replace_first r {|\3|} str) in
          let weekday = (Str.replace_first r {|\4|} str) in
          Some (year, month, day, weekday)
        with
          Not_found -> None

      let get_scheduled = get_timestamp "SCHEDULED"
      let get_deadline = get_timestamp "DEADLINE"
      let get_closed = get_timestamp "CLOSED"
     end

    include Planning

    let get_property_drawer ell =
      (Core.List.drop_while ~f:(fun a -> Core.String.is_empty (String.trim a)) ell)
      |> function
      | [] -> [], ell
      | hd :: tail ->
         if Base.String.Caseless.equal (String.trim hd) ":PROPERTIES:" then
           let front, tail = Base.List.split_while
                            ~f:(fun a -> not @@ Base.String.Caseless.equal (String.trim a) ":END:")
                            tail
           in
           (* assert (List.exists (fun a -> Base.String.Caseless.equal (String.trim a) ":END:") tail); *)
           match tail with
           | [] ->  failwith "Formatting error, :PROPERTIES: has no accompanying :end: block"
           | _ :: tail ->
              let property_drawer = List.map (fun a ->
                 let a = (String.trim a) in 
                 let r = Str.regexp {|:\([A-Za-z0-9]+\):\(.*\)|} in
                 if (Str.string_match r a 0) then
                   let key = Str.global_replace r {|\1|} a in
                   let value = String.trim (Str.global_replace r {|\2|} a) in
                   (key,value)
                 else (failwith "Malformed :PROPERTIES: block.")
                                     ) front
             in (property_drawer, tail)
         else [], ell

    type t = {
        stars : int;
        keyword : keyword option;
        priority : priority option;
        comment : comment option;
        title : string;
        tags : tag list;
        scheduled : Planning.timestamp option;
        deadline : Planning.timestamp option;
        closed : Planning.timestamp option;
        property_drawer : (string * string) list;
        section : string list
      } [@@deriving show]

    let parse_headingline (str: string) : line_data =
      if (not @@ is_headingline str) then
        raise (Invalid_argument "Str should be of the form \"**** \"")
      else
        let n_stars, str1 = count_and_strip_stars str in
        let keyword, str2 = keyword_and_strip_keyword str1 in
        let priority, str3 = priority_and_strip_priority str2 in
        let comment, str4 = comment_and_strip_comment str3 in
        let title, tags = tag_and_strip_tags str4 in
        (n_stars, keyword, priority, comment, title, tags)

    type planning_data = Planning.timestamp option * Planning.timestamp option * Planning.timestamp option
  end

Once this is done I can parse (at least these) parts from an org file as follows.

  type doctree = Doctree of (Heading.t * doctree) list [@@deriving show]

  let[@tail_mod_cons] rec separate (lines : string list) : doctree =
    (* let stars str = String.index str ' ' in *)
    match lines with
    | hd :: tl ->
       let stars_hd = Heading.stars hd in
       let a, b = Base.List.split_while
                    ~f:(fun a -> (not @@ Heading.is_headingline a) ||
                                   stars_hd < Heading.stars a) tl
       in
       let section, subheadings =
         Base.List.split_while
           ~f:(fun x -> not @@ Heading.is_headingline x) a in
       let planning, rest =
         match section with
         | [] -> (None, None, None), []
         | line :: tail ->
            let open Heading.Planning in
            (* let b =  in *)
            if is_planning_line line then
              let schedule = get_scheduled line in
              let deadline = get_deadline line in
              let closed = get_closed line in
              (schedule, deadline, closed), tail
            else
              (None, None, None), line :: tail
       in
       let properties, rest = Heading.get_property_drawer rest in
       let (n_stars, keyword, priority, comment, title, tags) = Heading.parse_headingline hd in
       let scheduled, deadline, closed = planning in 
       (* let scheduled, deadline, closed = properties in *)
       Doctree ( ( {
             stars = n_stars;
             keyword = keyword;
             priority = priority;
             comment = comment;
             title = title;
             tags = tags;
             scheduled = scheduled;
             deadline = deadline;
             closed = closed;
             property_drawer = properties;
             section = rest;
                   },
                   separate subheadings)
          :: let Doctree z = separate b in z)
    | [] -> Doctree []

Thanks for any feedback!

darrenldl · April 22, 2024, 12:54am

I think I share the same style for structuring my parsers and I also break the parsing problem down at similar granularity, so I don’t have much comment on the overall structuring.

For the technical stuff:

Avoid Str, since it uses global state. Use re package if you really need it, but at a glance, it seems that angstrom (which is what I typically use for almost everything) would have managed and allowed for easier compositions of parsers.
Similarly, angstrom can replace a lot of the string normalisation stuff you’re doing via String or Core.String

lindig · April 22, 2024, 8:05am

As an alternative, have you considered writing a parser using OCamlLex? For highly structured input, I believe it is a good approach because regular expressions can be named and composed. The generated code can be easily called from outside OCaml code.

chshersh · April 22, 2024, 8:33am

I really like parser combinators for defining readable parsers of such problems! In OCaml, you can use angstrom as one of such libraries.

I also like, how the format can directly translate to code. For example, this definition:

Can be written almost one-to-one using monadic parsers.

First, you define a data type representing your mode:

type keyword =
  | TODO
  | DONE

type heading =
  { stars: int;
    keyword: keyword;
    priority: char;
    comment: string;
    title: string;
    tags: string list;
  }

Then, the top-level main parser just combines multiple parsers to construct the output:

let heading_p =
    let ( let* ) = Angstrom.(>>=) in
    let* stars = stars_p in
    let* keyword = keyword_p in
    let* priority = priority_p in
    let* comment = comment_p in
    let* title = title_p in
    let* tags = tags_p in
    Angstrom.return { stars; keyword; priority; comment; title; tags }

This is a boring part but I really like how readable it is
The main logic will be happening inside those parsers for each field.

For example, for STARS it can look like this:

let stars_p = 
  let ( let* ) = Angstrom.(>>=) in
  (* parsing one or more star characters *)
  let* stars = Angstrom.(many_1 (char '*')) in
  let stars_count = List.length stars in
  (* parsing zero or more trailing spaces and ignoring them *)
  let* _ = Angstrom.(many (char ' ')) in
  Angstrom.return stars_count

lukstafi · April 22, 2024, 8:41am

Nitpick: Angstrom already contains this:

let ( let+ ) = ( >>| )
let ( let* ) = ( >>= )
let ( and+ ) = both

lukstafi · April 22, 2024, 8:46am

I never used it, but if I was tempted to use ocamllex I would now try sedlex instead – to keep my source code more compact (fewer files). Of course, unless parsing was a significant part of the project, then I would use Menhir. I only started using Angstrom a couple days ago but I think going forward it would be my default when regular expressions get too non-trivial but I don’t want to devote whole files to parsing.

chshersh · April 22, 2024, 12:20pm

Good to know! Thanks for the tip
Time to refactor my entire project

yawaramin · April 22, 2024, 1:49pm

I think Patrick said in the first message that all the fields except STARS were optional.

patrick-nicodemus · April 22, 2024, 3:45pm

There are a lot of good answers here, thank you all. I will definitely look at angstrom, I will also try to look at ocamllex if I have time to have a sense of what functionality is provided in the standard library.

chshersh · April 22, 2024, 9:28pm

My bad!

In that case, the change is to make relevant fields optional in the heading type, using the option combinator where needed.

lukstafi · April 24, 2024, 11:05am

I only learned now, the hard way, that ( let* ), (let+), (and+) are only available since Angstrom version 0.15.

Topic		Replies	Views
Recommendations for parsing structured text files with specific annotations/commands Ecosystem library , markdown	6	908	August 28, 2022
Parsing question - keywords Learning parsing	0	102	September 25, 2024
Please comment my code ... thanks! Learning	4	234	June 17, 2024
Can't get my semicolons right, I guess. Noob question-alert Learning	14	716	June 16, 2024
Need help with formatting lists Learning	3	1820	August 6, 2019

Code review: parsing a string which is in a given format

Related topics