This is part three in a three-part series. Go to part one.
Expression rules (or more commonly, ExprLinter
s) are Harper rules that use declarative expressions to find and fix grammatical errors.
They're halfway between a "phrase correction" and manually implementing Linter
.
Make sure you properly set up your environment.
Before we get started, let's take a look at the ExprLinter
trait.
Here's what it looks like at the time of writing this post.
/// A trait that searches for tokens that fulfil [`Expr`]s in a [`Document`].
///
/// Makes use of [`TokenStringExt::iter_chunks`] to avoid matching across sentence or clause
/// boundaries.
#[blanket(derive(Box))]
pub trait ExprLinter: LSend {
/// A simple getter for the expression you want Harper to search for.
fn expr(&self) -> &dyn Expr;
/// If any portions of a [`Document`] match [`Self::expr`], they are passed through [`ExprLinter::match_to_lint`] to be
/// transformed into a [`Lint`] for editor consumption.
///
/// This function may return `None` to elect _not_ to produce a lint.
fn match_to_lint(&self, matched_tokens: &[Token], source: &[char]) -> Option<Lint>;
/// A user-facing description of what kinds of grammatical errors this rule looks for.
/// It is usually shown in settings menus.
fn description(&self) -> &str;
}
The structure of the trait reveals some of the behind-the-scenes work Harper is doing for you. There are three phases:
Expr
.
It will iterate through the document, looking for token sequences that match your expression.match_to_lint
.
From there, you can perform optional additional validation to confirm that the tokens really do represent a grammatical error.
If so, return None
.
Otherwise, return a Lint
with any suggestions that may fix the problem.Now that we've reviewed the essentials, let's implement an ExprLinter
.
Before we can write a single line of code, we need a grammatical rule of interest. I'm going to pay a visit to the Harper issue board.
After looking through a few options, I think #1513 is a good candidate. We are looking for missing prepositions between an adjective and a subject.
To get started, we'll create a file under harper-core/src/linting
called missing_preposition.rs
and add it to the parent Rust module.
I'll paste the template into the file:
pub struct MissingPreposition {
expr: Box<dyn Expr>,
}
impl Default for MissingPreposition {
fn default() -> Self {
let expr = todo!();
Self {
expr: Box::new(expr),
}
}
}
impl ExprLinter for MissingPreposition {
fn expr(&self) -> &dyn Expr {
self.expr.as_ref()
}
fn match_to_lint(&self, matched_tokens: &[Token], _source: &[char]) -> Option<Lint> {
unimplemented!()
}
fn description(&self) -> &'static str {
unimplemented!()
}
}
I like to start by building out a few test cases before working on the actual code. We get some for free from the GitHub issue:
#[test]
fn fixes_issue_1513() {
assert_lint_count(
"The city is famous its beaches.",
MissingPreposition::default(),
1,
);
assert_lint_count(
"The students are interested learning.",
MissingPreposition::default(),
1,
);
}
Obviously, these tests will fail if we try to run cargo test
, but at this point you should do so anyway to make sure your toolchain is working.
The heart of this grammatical rule is the Expr
(pronounced expression).
There are a number of ways to go about making one of these.
The simplest (and most common by far) is to put together a SequenceExpr
.
In our case, we're looking for missing prepositions between an adjective and a noun. A good expression to start with could look like:
impl Default for MissingPreposition {
fn default() -> Self {
let expr = SequenceExpr::default()
.then(UPOSSet::new(&[UPOS::ADJ]))
.t_ws()
.then(UPOSSet::new(&[UPOS::NOUN, UPOS::PRON, UPOS::PROPN]));
Self {
expr: Box::new(expr),
}
}
}
We're using a UPOSSet
here, which is another kind of Expr
that looks for specific parts of speech.
The name derives from the Universal Dependencies tag system.
Any tokens tagged with any of the options we've provided to the UPOSSet
will match.
However, it's easy to create an example that this expression matches against, but doesn't contain a grammatical error. We call this a false positive. Let's write one and add it to our test suite.
#[test]
fn allows_terrible_stuff() {
assert_no_lints(
"Either it was terrible stuff or the whiskey distorted things.",
MissingPreposition::default(),
);
}
From here, you should use your brain to continuously refine the expression into something that maintains a low false-positive rate while remaining useful. Here's what I settled on:
impl Default for MissingPreposition {
fn default() -> Self {
let expr = SequenceExpr::default()
.then(
AnchorStart.or(SequenceExpr::default()
.then(UPOSSet::new(&[UPOS::DET]))
.t_ws()),
)
.then(UPOSSet::new(&[UPOS::NOUN, UPOS::PRON, UPOS::PROPN]))
.t_ws()
.then(UPOSSet::new(&[UPOS::AUX]))
.t_ws()
.then(UPOSSet::new(&[UPOS::ADJ]))
.t_ws()
.then(UPOSSet::new(&[UPOS::NOUN, UPOS::PRON, UPOS::PROPN]))
.then_optional(AnyPattern)
.then_optional(AnyPattern);
Self {
expr: Box::new(expr),
}
}
}
Now that we have an effective expression as a base, let's fill out the remaining fields.
I found checking for an adposition reduced the false-positive rate, and it was easiest to add it to the match_to_lint
function.
impl ExprLinter for MissingPreposition {
fn expr(&self) -> &dyn Expr {
self.expr.as_ref()
}
fn match_to_lint(&self, matched_tokens: &[Token], _source: &[char]) -> Option<Lint> {
if matched_tokens.last()?.kind.is_upos(UPOS::ADP) {
return None;
}
Some({
Lint {
span: matched_tokens[2..4].span()?,
lint_kind: LintKind::Miscellaneous,
suggestions: vec![],
message: "You may be missing a preposition here.".to_owned(),
priority: 31,
}
})
}
fn description(&self) -> &'static str {
"Locates potentially missing prepositions."
}
}
That's it! We've written our rule.
Don't forget to register your rule and add some more tests before opening PR. Make sure you take a look at the pull request to see the finished rule.
It didn't work for me, and if you reading this, it probably won't work for you either.
Reflecting on lessons learned from handling English edge cases in developing Harper's Chrome extension.
If there is one thing o3 is useful for, it's __finding canonical documentation for complex subjects.__