I want you to read that title as literally as possible. Harper is now capable of evolution.
This past week, I've been working on a system that should allow us to handle more complex grammatical cases and contexts, faster. I believe it will improve our ability to add new grammatical rules to Harper by somewhere between 500% and 1,000%.
To top it off, this system does it without slowing Harper itself down or increasing the memory footprint.
Let's get into it.
There are several unique methodologies at play when Harper goes about grammar checking. Which strategy depends on the grammatical rule in question. Today, we're interested in expression rules.
For the curious, I have recently written a reflection on expression rules, as well as a guide for anyone interested in producing them. This post, however, will not recount information I've already written on this blog.
By count, expression rules make up the majority of grammatical rules Harper is currently capable of detecting. This is because they are fast, easy to write, and most importantly, easy to review.
There are, however, occasional hiccups that I encounter when tackling a problem. The English language is tricky and often it contradicts itself. I will often try to write a rule which covers a certain case, only to find that it doesn't cover all cases. I can iterate, but it often becomes tedious and time-consuming.
Last week, I threw in the towel. I was tired of iterating ceaselessly towards a goal, only to have a new one to tackle after that. So I decided I would let the computer iterate for me.
Harper's expressions are essentially small programs which are able to identify the locations of given patterns in natural language. They are constructed at runtime, but they run exceedingly fast because they tend to be amenable to modern branch prediction. We can use this fact to our advantage.
When generating an expression that detects a particular grammatical rule, the new system (which I've called The Ripper) follows three steps.
That's it! We're essentially treating expressions as living creatures and subjecting them to artificial selection. It works remarkably well.
Since these datasets are handcrafted (or generated by an LLM), they don't need to be large. Plus, the expressions themselves are quite fast to generate and test, so we can do so at an exceptional rate.
My laptop is able to churn through about 90 thousand candidates per second, allowing us to converge on an acceptable result in just a few minutes. Given more time, it's able to produce an expression rule that is more accurate than what I could write myself.
I intend to spend some time optimizing the process, particular for the human element. I'd like to be able to create batches of these datasets and let The Ripper take care of them all at once, overnight or on a beefy server in the cloud.
I'd also like to set up automated workflows for piping data from an LLM directly into the Ripper. Ideally, I want this system to get to a point where I can feed information from a style guide into an LLM and get a guaranteed functioning Harper expression rule out of it.