While writing software there is often a need to write a parsed for a certain format. Recently, I was writing a parser for an IR of a language that my team was working on.
When I first got the task the suggestion was to regular expressions - because they are very powerful. However, I went with nom
instead and I am very happy I did. Here is why.
The reasons parser-combinators are better then regular expression based parsers:
-
The functions you write have readable names. Which makes the code way more readable.
-
You have control over the minuscule implementation details and can optimize for performance.
-
It is way easier to evolve the format. Just keep the deprecated format under the
alt()
combinator. -
You can get way better parsing errors. It is way nicer to have "expected identifier" errors than just a plain old "expression did not match".
Here are some specific gotchas and hints around using nom
crate.
Most importantly, do not use the older macros format. It is still there, but functions are way easier to work with, once you get a hang of them.
If you see a nom
tutorial using named!
- just close the tutorial. Find a newer one.
Do use the verbose errors in the unit tests and everywhere performance is not super critical. Combined with the context()
combinator you would be getting
great error messages describing exactly what went wrong, where and what was expected. Verbose errors do have performance overhead, but it is fairly small in most cases.
Beware of big alt()
and especially permutation()
combinators. Those are performance killers. A nested alt()
with a hundred branches was the biggest block on
my first flamegraph when I looked into optimizing the parser. In my case, alt()
was recognizing our supported operators by the first keyword (using tag()
).
Replacing that with a simple alphanumeric1()
parser to catch the keyword and a match statement to branch into its dedicated parser improved performance by about 30%.
When writing parser function, you can pass them into other parsers instead of invoking. Do not shy of it, even if looks complicated at first. Combined with the power of Rust compiler you would be able to express complex patters with a fairly simple code.
I'd like to share one snippet of code to demonstrate the last point. Here is what I have to parse vectors of any type, that has a parser:
/// Parses a vector of items, using the supplied inner parser.
fn vector<'a, F: 'a, O, E: 'a + ParseError<&'a str> + ContextError<&'a str>>(
inner: F,
) -> impl FnMut(&'a str) -> IResult<&'a str, Vec<O>, E>
where
F: FnMut(&'a str) -> IResult<&'a str, O, E>,
{
delimited(tag("["), separated_list0(ws(tag(",")), inner), tag("]"))
}
Having that helper, I can then parse any vector with an extremely readable vector(parse_user)
, vector(double)
, etc.
Finally, I promised to tell how I was able to write a parser that is many times faster than msgpack
. It is fairly simple, actually. Having control over the IR, I was
able to specify that optional line breaks are treated as valid "split points". Since our application had to parse a big chunk of IR at the start time and there was nothing
else useful for it to do until it is fully parsed, I was simply splitting the input by line breaks into the number of chunks equal to the number of CPUs available and
parsing those in paralled. I was able to do that due to the control over the IR format. We always had this control, but it is not always possible to express this to
the generic deserialization crates, such as msgpack
.