Stupid-Simple Spell-Check

A sign advertising for antiques, but misspelled.

For the last month, I’ve been spend­ing a lot of time re­plac­ing one key com­po­nent of my writ­ing and pro­gram­ming en­vi­ron­ment: my gram­mar checker.

Up un­til now, I’ve been us­ing the epony­mous LanguageTool via ltex-ls and nvim-lspconfig. Don’t get me wrong, these tools are re­ally good, and I would rec­om­mend them to any­one and every­one. How­ever, they come with a few key an­noy­ances.

LanguageTool Grievances

Performance

LanguageTool is slow. I’m not ex­actly sure why. Every time I would run LanguageTool over my Markdown or LaTeX\LaTeX doc­u­ments (which are rea­son­ably sized), I would have to wait sev­eral sec­onds be­fore even the rudi­men­tary spell-check­ing would show up.

Additionally, I would find ltex-ls reg­u­larly be­com­ing the most mem­ory-hun­gry ap­pli­ca­tion on my lap­top, of­ten ex­ceed­ing 4 gi­ga­bytes.

After hours of scour­ing their code base, I have come to no bet­ter ex­pla­na­tion than that it is writ­ten in Java. There are a cou­ple ques­tion­able al­go­rith­mic de­ci­sions in there as well.

Download Size

As I said: LanguageTool is re­ally quite good. How­ever, to get every­thing it can of­fer, you need to not only in­stall a Java Runtime Environment (>150 MB on my sys­tem), the ac­tual .jar file (>300 MB), but you also need to down­load a 16 GB n-gram dataset.

Grammarly Grievances

But Elijah,” I hear you say, just use Grammarly!”

No. I’m not go­ing to drop $12 a month for some­thing even slower and worse. Not to men­tion how they are likely go­ing to use my work to train their large lan­guage mod­els. Gram­marly is a great prod­uct, just not for me.

The Algorithm

Now that I’ve thor­oughly ex­plained my rea­son­ing for im­ple­ment­ing a new gram­mar checker (one that I’m call­ing Harper), I’d like to re­count my first, ad­mit­tedly naive, at­tempt at spellcheck­ing.

The first idea we need to get a grip on is Levenshtein edit dis­tance. In essence, edit dis­tance is the least num­ber of sin­gle-char­ac­ter ed­its (insertions, dele­tions or re­place­ments) nec­es­sary to turn one word into an­other. For ex­am­ple, the edit dis­tance be­tween cat” and bat” is one; the only edit in­volves re­plac­ing the c” with a b”.

Similarly, the edit dis­tance be­tween kitten” and sitting” is three: re­move the g”, re­place the sec­ond i” with an e” and re­place the s” with a k”. For this naive spellcheck­ing, we aren’t too con­cerned with the ex­act ed­its (atomic er­rors) that oc­cur in a given mis­spelling, only the mag­ni­tude of the er­ror.

From a high level view here’s how the al­go­rithm is go­ing to work:

  1. Determine whether a given word is mis­spelled. If not, exit.
  2. Calculate the Levenshtein edit dis­tance be­tween the mis­spelled word and all valid English words.
  3. Pick the three words with the short­est edit dis­tance and pre­sent them to the user as al­ter­na­tive spelling op­tions.

Step 1.

To de­ter­mine whether a given word is mis­spelled, we will need a list of all the valid words in the English lan­guage. Turns out, this is­n’t too easy. For to­day, we will just use a sub­set of the English lan­guage with this short list:

into
the
a
cat
tree
jumped

To check if a given word is within the list, we can place the list into a hash set, and grab it’s con­tents.

let words: Vec<String> = vec!["into", "the", "a", "cat", "tree", "jumped"]
    .into_iter()
    .map(|s| s.to_string())
    .collect();

let word_set: HashSet<String> = words.iter().cloned().collect();

let word = "thw";
let word_chars: Vec<_> = word.chars().collect();

if word_set.contains(word) {
    println!("It is a valid English word!");
    return;
}

println!("Are you sure you meant to spell \"{}\" that way?", word);

The Wagner-Fischer Algorithm

Now that we know our word is ac­tu­ally mis­spelled, we can move on to find­ing the cor­rect spelling. We need to find the edit dis­tance be­tween the mis­spelled word and all the words in our set.

To do this, we will be us­ing the Wagner-Fischer al­go­rithm.

// Computes the Levenstein edit distance between two patterns.
// This is accomplished via the Wagner-Fischer algorithm
fn edit_distance(source: &[char], target: &[char]) -> u8 {
    let m = source.len();
    let n = target.len();

    // Create an m-by-n matrix.
    let mut d = create_empty_matrix(m + 1, n + 1);

    // Since we know we can transform each word into the other by replacing
    // successive characters (or deleting them), we can fill the first column and
    // row with values from 0..m and 0..n, respectively.
    for i in 0..=m {
        d[i][0] = i as u8;
    }

    for i in 0..=n {
        d[0][i] = i as u8;
    }

    for j in 1..=n {
        for i in 1..=m {
            // The total edit distance of two given letter indices i and j, one from each word
            // will be the sum of the edit distances of prior combinations + whether the characters
            // at the two indices are equal.

            let cost = if source[i - 1] == target[j - 1] { 0 } else { 1 };
            d[i][j] = (d[i - 1][j] + 1)
                .min(d[i][j - 1] + 1)
                .min(d[i - 1][j - 1] + cost);
        }
    }

    // After all possible edits have been explored and minimized
    // the resulting minimum edit distance will be in the final item in the matrix.
    d[m][n]
}

// Create an empty matrix of size [m, n]
fn create_empty_matrix(m: usize, n: usize) -> Vec<Vec<u8>> {
    let mut d = Vec::with_capacity(m);

    for _ in 0..m {
        d.push(vec![0u8; n]);
    }

    d
}

This works pretty well. There are a num­ber of op­ti­miza­tions we could ap­ply to this func­tion alone. I’ll leave that as a prob­lem for the reader, since they aren’t par­tic­u­larly rel­e­vant to the meat of the larger al­go­rithm.

Steps 2 + 3

Now that we can de­ter­mine the edit dis­tance be­tween two words, we can per­form a brute-force search. In this short ex­am­ple, we’re go­ing to use sort_by_key to do this, since our data set is so small. If we were work­ing with a larger dic­tio­nary (say, the en­tire English lan­guage), there would be a num­ber of things we would need to do to re­duce time and mem­ory con­sump­tion.

let mut suggestions: Vec<(String, u8)> = words
    .into_iter()
    .filter_map(|possible_word| {
        let possible_chars: Vec<_> = possible_word.chars().collect();

        let dist = edit_distance(word_chars.as_slice(), &possible_chars);

        if dist <= 2 {
            Some((possible_word, dist))
        } else {
            None
        }
    })
    .collect();

suggestions.sort_by_key(|(_, d)| *d);

println!("Possible alternatives: ");

suggestions.iter().for_each(|(s, _)| println!("- {}", s));

If we run the whole pro­gram again, we get an out­put some­thing like:

Are you sure you meant to spell "thw" that way?
Possible alternatives:
- the

That looks pretty good to me!

If you would like to look at the whole pro­gram, and maybe try out your own in­puts, go right on ahead.