Skip to main content
  1. projects/
  2. Regular expression engine built with finite state automata/

My custom regular expression parser & engine

·584 words·3 mins

Source Code
#

The source code is publicly available on my own git server

Why?
#

I was inspired to try and make my own automaton engine after learning about them in a course offered by the ANU, COMP1600.

Prior to COMP1600, I had already learned a bit about regular grammars and regular expressions in a compiler construction course, but that course didn’t go in much depth about how they might be practically implemented.

COMP1600 showed me that it is actually incredibly simple to make a regular expression engine that will match strings in linear time complexity with respect to the string length thanks to a concept known as deterministic finite automata (DFAs).

What is this?
#

If you don’t know what regular expressions are, leave now and return when you do.

This project is currently a collection of implementations of finite automata, along with a regular expression compiler. I will primarily be highlighting its ability to compile regular expressions into easily computable structures in memory known as automaton.

Essentially, you give a regular expression, and it spits out an automaton that can recognise the regular language described by that expression. We can think of the regex string as a kind of “source code” for a machine that accepts or rejects other strings; this “machine” is called an automaton. Specifically for regular expressions and languages, these automaton can be described with a finite memory known ahead of time, and are hence called finite state automaton (FSA).

This project utilises 3 different forms of FSA in the compilation process from regular expression to the final deterministic finite automaton (DFA). See the compilation pipeline

Automaton Included
#

Currently, only finite state automaton are included, no turing automaton. (yet!)

Automaton type Primary data structure struct Typical use
\(\epsilon \operatorname{NFA}^*\) Graph (using the petgraph library) GraphENFA Programmatically Building Automaton
\(\epsilon \operatorname{NFA}\) HashMap ENFA Easy to build automaton
\(\operatorname{NFA}\) HashMap NFA Less problematic to convert to DFA
\(\operatorname{DFA}\) HashMap DFA Complete control over a deterministic automaton

Regex Compilation Pipeline
#

The typical compilation pipeline for a regular expression \(R_L\) defining the regular langauge \(L\) in the automaton library is the following: $$ R_L \mapsto \epsilon \operatorname{NFA}^*_L \mapsto \epsilon \operatorname{NFA}_L \mapsto \operatorname{NFA}_L \mapsto \operatorname{DFA}_L $$ See the automaton included for info on these data structures.

This is exactly what the regex binary does.

The mapping \(R_L \mapsto \epsilon\text{NFA}^*_L\) is not necessarily trivial, the language of regular expressions is unfortunately not regular (it would have been really nice to describe the language of regular expressions exactly with a regular expression). I implemented an LL(1) predictive recursive descent parser1 to parse the regular expression into an AST, which is then converted into an \(\epsilon\text{NFA}^*_L\) through a set of algorithms described in the COMP1600 lectures.

Conversion
#

Conversion between various finite state automaton is achieved through the algorithms described in the lectures, with not much care for the memory complexity of the resulting conversion. Conversions are also not perfectly reversible, converting an \(\text{NFA}\) to a \(\text{DFA}\) and then back will just yield a \(\text{DFA}\) in the struct of an \(\text{NFA}\).

Demo
#

See the demo I have hosted here:

Regex engine demo
Simple demo of the regex engine with associated automatons


  1. In all honesty I’m not sure if it is really LL(1), I think it may technically be nondeterministic given it has some notion of context. As of writing, it has been months since I actually implemented it and did any formal theory on grammars so I’m a bit vague. ↩︎