Transformation-based Learning for POS Tagging

CTLM

Harper is cur­rently un­der­go­ing some pretty rad­i­cal changes when it comes to its lan­guage analy­sis. These im­prove­ments will im­prove the out­put of our ex­ist­ing rule en­gine, in ad­di­tion to mak­ing en­tirely new cor­rec­tions pos­si­ble. This post will cover our ex­ist­ing NLP pipeline, the re­cent changes and im­prove­ments to our ma­chine learn­ing ap­proach, and what will come next.

While AI is a com­mon topic of dis­cus­sion on­line, I don’t hear much about ac­tual ma­chine learn­ing. In that light, I hope this post piques some­one’s in­ter­est.

What is POS Tagging?

POS (Part-of-speech) tag­ging is the first step of most NLP (Natural Language Processing) pipelines. For any gram­mar checker worth its salt, POS tag­ging is es­sen­tial. Apart from the ba­sic cor­rec­tions you’re ca­pa­ble of do­ing with sim­ple string ma­nip­u­la­tion, most gram­mar check­ing di­rectly or in­di­rectly de­pends on POS tag­ging. High-quality tag­ging re­sults in high-qual­ity sug­ges­tions.

What is POS tag­ging? It is the process of iden­ti­fy­ing which pos­si­ble de­f­i­n­i­tion of a word is be­ing used, based on the sur­round­ing con­text. For those un­fa­mil­iar with the ter­ri­tory, I’m cer­tain an ex­am­ple is the best way to ex­plain.

I am go­ing to go tan in the sun.”

Here we have a sim­ple, English sen­tence. In this case, it is clear the word tan” is be­ing used as verb. The lin­guists in the au­di­ence would point out that it is specif­i­cally in the first-per­son fu­ture tense. Consider this sim­i­lar sen­tence:

I am al­ready very tan, so I will stay in­side.”

In this sen­tence, the word tan” is be­ing used as an ad­jec­tive. How can we tell?

As in­tel­li­gent hu­mans, some of whom have been speak­ing English their en­tire lives, it is easy for us to de­ter­mine which words are serv­ing which roles. It’s not as easy for a com­puter to do the same. From an al­go­rith­mic stand­point, there are a num­ber of ways to go about it, each with dif­fer­ing lev­els of machine learn­ing” re­quired.

Before this week, Harper pri­mar­ily took a dic­tio­nary-based ap­proach. In short: we ship a dictionary” of English words to the user’s ma­chine and use hash table lookups to de­ter­mine the pos­si­ble roles each word could as­sume. The au­thors to our rule en­gine could then use rudi­men­tary de­duc­tive rea­son­ing to nar­row the pos­si­bil­i­ties down. This strat­egy is re­mark­ably ef­fec­tive and it has scaled to tens of thou­sands of users with sur­pris­ingly few hic­cups.

That said, there are edge-cases and sys­tems (which I’ll cover next week when I dis­cuss chunk­ing) which re­quire ex­treme speci­ficity from POS tags. My mis­sion: im­prove our POS tag­ging to in­crease the con­fi­dence of Harper’s out­put and open the door for more ad­vanced al­go­rithms.

Why Transformation-based Learning?

The lit­er­a­ture high­lights four un­der­ly­ing ma­chine learn­ing model strate­gies that seem to work well for POS tag­ging.

  • Hidden Markov Models (probabilistic mod­els that pre­date the mod­ern deep learn­ing era)
  • Maximum Entropy Models (statistical mod­els closely re­lated to lo­gis­tic re­gres­sion).
  • Transformer-based Models (which use deep neural net­works)
  • Transformation-based Rule Models (which are based on learned rules)

While I heav­ily con­sid­ered us­ing a neural net­work (either via an HMM or MEM), I dis­carded the tech­nol­ogy for three rea­sons.

  • TRMs are typ­i­cally more ac­cu­rate (barely; mea­sured in ba­sis points).
  • TRMs are more amenable to fine-tun­ing.
  • TRMs are ex­cep­tion­ally low-la­tency and can be com­pressed quite small.

Transformation-based learn­ing is re­mark­ably sim­ple. It boils down to just four steps:

  • Use a sim­ple, sto­chas­tic model to la­bel your data. This can be as sim­ple as tag­ging each to­ken (or other dis­crete com­po­nent) with that vari­ant’s most com­mon tag. It does­n’t need to su­per ac­cu­rate, just enough to es­tab­lish a base­line.
  • Identify the er­rors be­tween the tags in your canon­i­cal data and that which pro­duced by your base­line model.
  • Using a fi­nite list of hu­man-de­fined tem­plates, gen­er­ate can­di­date rules that trans­form the out­put of the base­line model into some­thing else. This is where the term transformation-based” comes from.
  • Apply each of the can­di­date rules to the base­line mod­el’s out­put. Check if the re­sult is more ac­cu­rate than be­fore. If so, save the rule for fu­ture use.

These saved can­di­dates be­come your model.

POS-Tagging us­ing Transformation-based Learning

Let’s ap­ply these steps to build a POS-tagging sys­tem.

For our base­line model, we will just as­sign each word in our dataset the most com­mon POS tag as­so­ci­ated with that word. If the word is tan”, we’ll as­sign it’s most com­mon POS tag (verb). It will of­ten be in­cor­rect, but those cases will be han­dled by our rules.

To iden­tify the base­line mod­el’s er­rors, we’ll use an off-the-shelf tree-bank from the Universal Dependencies pro­ject.

Our rule tem­plates will take the form of these PatchCriteria. By ini­tial­iz­ing our can­di­dates with any one of these enum vari­ants and ini­tial­iz­ing the child vari­ables to ran­dom val­ues, we can cover a good num­ber of cases.

#[derive(Debug, Clone, Serialize, Deserialize, Hash, PartialEq, Eq)]
pub enum PatchCriteria {
    WordIsTaggedWith {
        /// Which token to inspect.
        relative: isize,
        is_tagged: UPOS,
    },
    AnyWordIsTaggedWith {
        /// The farthest relative index to look
        max_relative: isize,
        is_tagged: UPOS,
    },
    SandwichTaggedWith {
        prev_word_tagged: UPOS,
        post_word_tagged: UPOS,
    },
    WordIs {
        relative: isize,
        word: String,
    },
    /// Not applicable to the Brill Tagger, only the chunker
    NounPhraseAt {
        is_np: bool,
        relative: isize,
    },
    Combined {
        a: Box<PatchCriteria>,
        b: Box<PatchCriteria>,
    },
}

Finally, we’ll ap­ply each of the hun­dreds of thou­sands of can­di­dates to our tree­bank to see if the re­sult of their trans­for­ma­tions have a lower er­ror rate than the base­line.

Here are a cou­ple of the can­di­dates we found:

[
  {
    "from": "PRON",
    "to": "SCONJ",
    "criteria": {
      "Combined": {
        "a": {
          "WordIs": {
            "relative": 0,
            "word": "that"
          }
        },
        "b": {
          "WordIsTaggedWith": {
            "relative": -1,
            "is_tagged": "VERB"
          }
        }
      }
    }
  },
  {
    "from": "PART",
    "to": "ADP",
    "criteria": {
      "Combined": {
        "a": {
          "WordIs": {
            "relative": 1,
            "word": "there"
          }
        },
        "b": {
          "AnyWordIsTaggedWith": {
            "max_relative": -4,
            "is_tagged": "NOUN"
          }
        }
      }
    }
  }
]

That’s the whole process! With it, I was able to bring our pre­vi­ous ac­cu­racy all the way up to 95% (from 40%) with­out a mean­ing­ful change in lint­ing la­tency or com­piled bi­nary size.