Stateless Syntax Highlighter

 

About a year ago I got the dumb idea of creating a syntax highlighter that uses ECMA—48 SGR sequences to enable syntax highlighting on the command line. I searched for a while to see if I could cure a little NIH Syndrome and didn't find anything, so I ventured off on my own and these are the hurdles I ran into and the program I've got up until now if you want to test it yourself (in both C++ and PHP)

So for those of you wondering what stateful vs. stateless matters in a syntax highlighter here's a small example: Say I'm highlighting a string "Hello World!". This is all well and good, except what happens if the language I'm highlighting uses C-style comments, ie., // comment and we run across a string: "Hello World//other stuff". With a stateful highlighter we can easily differentiate between a string and a comment because we've entered a string state and we don't exit that state until we reach an unescaped delimiter. However, with a stateless highlighter there is no way to differentiate between the two (without a whole lot of unnecessary logic) so we end up with the string before the // highlighted as a string, and the bit after it highlighted as a comment, which is obviously bad.

Now, I mentioned in the beginning that the highlighter uses ECMA—48 SGR sequences to create color in the command line. For those unfamiliar with how this works there is a special series of characters called a CSI or Control Sequence Introducer, which is an escape character (either \E or ^[), followed by [, a number between 0 and 49 (with a few left out) and ended with a lowercase "m", looks a bit like this in code ^[[31m(This produces the color red in the foreground). With that boring description over I'll get into the problem; ECMA-48 SGR sequences don't have matching endings for control sequences, just a reset. If you don't understand what I mean, to put it in even terms in HTML there is a start tag and an end tag. In an SGR there is only the control sequence and another control sequence ^[[0m which resets all currently assigned attributes(colors). If a sequence is started and not canceled then the color will run, or "bleed" until it runs into another color or reset.

The problem arises when you combine the two factors: statelessness and a reset sequence. If there is no way to exit a state then there is no way to reset a sequence in one pass. So, as you might have guessed (probably not), I solve this issue by taking many passes over the code to first highlight it and then many subsequent passes to find bleeding colors and place a reset sequence before them. This may sound slow but it ends up being faster than you'd think. It did take a wee bit of debugging to get it to not be in an infinite loop when setting reset sequences but I cleared that up.

If you've got this far you're either A) wayyy too bored for your own good, or B) techy enough to interested. I'm going to cater to group B at this point as I explain the nitty-gritty of how it works. The most interesting bit I stumbled upon when writing this was when I wrote the string matching expression. Turns out its quite tricky to get strings to match when you take into account escape sequences so I ended up with this bad boy:

([\'"])(.*?)(.{0,2})(?<![^\\\]\\\)(\1)
With this expression it finds the starting delimiter, as little text as possible up until 2 characters and the original starting delimiter. There is a negative look-behind on the third matching group which checks to see if the 2 characters before the delimiter are an escape character for the delimiter or an escaped escape character ie., \\' vs blah\'

That's it for now, I'm going to take a few posts to explain all of the details of the script as there is a lot more than this and I don't feel like making this 10 pages long in the first go. So if you want to take a look at the script up to this point you can download it here, it is well commented to an extent but if you have questions just keep taking a peek at this blog for updates as I explain in more depth.