Transformation-based Learning for POS Tagging

CTLM

Harper is cur­rently un­der­go­ing some pretty rad­i­cal changes when it comes to its lan­guage analy­sis. These im­prove­ments will im­prove the out­put of our ex­ist­ing rule en­gine, in ad­di­tion to mak­ing en­tirely new cor­rec­tions pos­si­ble. This post will cover our ex­ist­ing NLP pipeline, the re­cent changes and im­prove­ments to our ma­chine learn­ing ap­proach, and what will come next.

While AI is a com­mon topic of dis­cus­sion on­line, I don’t hear much about ac­tual ma­chine learn­ing. In that light, I hope this post piques some­one’s in­ter­est.

What is POS Tagging?

POS (Part-of-speech) tag­ging is the first step of most NLP (Natural Language Processing) pipelines. For any gram­mar checker worth its salt, POS tag­ging is es­sen­tial. Apart from the ba­sic cor­rec­tions you’re ca­pa­ble of do­ing with sim­ple string ma­nip­u­la­tion, most gram­mar check­ing di­rectly or in­di­rectly de­pends on POS tag­ging. High-quality tag­ging re­sults in high-qual­ity sug­ges­tions.

What is POS tag­ging? It is the process of iden­ti­fy­ing which pos­si­ble de­f­i­n­i­tion of a word is be­ing used, based on the sur­round­ing con­text. For those un­fa­mil­iar with the ter­ri­tory, I’m cer­tain an ex­am­ple is the best way to ex­plain.

I am go­ing to go tan in the sun.”

Here we have a sim­ple, English sen­tence. In this case, it is clear the word tan” is be­ing used as verb. The lin­guists in the au­di­ence would point out that it is specif­i­cally in the first-per­son fu­ture tense. Consider this sim­i­lar sen­tence:

I am al­ready very tan, so I will stay in­side.”

In this sen­tence, the word tan” is be­ing used as an ad­jec­tive. How can we tell?

As in­tel­li­gent hu­mans, some of whom have been speak­ing English their en­tire lives, it is easy for us to de­ter­mine which words are serv­ing which roles. It’s not as easy for a com­puter to do the same. From an al­go­rith­mic stand­point, there are a num­ber of ways to go about it, each with dif­fer­ing lev­els of machine learn­ing” re­quired.

Before this week, Harper pri­mar­ily took a dic­tio­nary-based ap­proach. In short: we ship a dictionary” of English words to the user’s ma­chine and use hash table lookups to de­ter­mine the pos­si­ble roles each word could as­sume. The au­thors to our rule en­gine could then use rudi­men­tary de­duc­tive rea­son­ing to nar­row the pos­si­bil­i­ties down. This strat­egy is re­mark­ably ef­fec­tive and it has scaled to tens of thou­sands of users with sur­pris­ingly few hic­cups.

That said, there are edge-cases and sys­tems (which I’ll cover next week when I dis­cuss chunk­ing) which re­quire ex­treme speci­ficity from POS tags. My mis­sion: im­prove our POS tag­ging to in­crease the con­fi­dence of Harper’s out­put and open the door for more ad­vanced al­go­rithms.

Why Transformation-based Learning?

The lit­er­a­ture high­lights three un­der­ling ma­chine learn­ing model strate­gies that seem to work well for POS tag­ging.

  • Hidden Markov Models (which are tra­di­tion­ally deep neural net­works)
  • Maximum Entropy Models (which are tra­di­tion­ally shal­low neural net­works)
  • Transformation-based Rule Models (which are based on learned rules)

While I heav­ily con­sid­ered us­ing a neural net­work (either via an HMM or MEM), I dis­carded the tech­nol­ogy for three rea­sons.

  • TRMs are typ­i­cally more ac­cu­rate (barely; mea­sured in ba­sis points).
  • TRMs are more amenable to fine-tun­ing.
  • TRMs are ex­cep­tion­ally low-la­tency and can be com­pressed quite small.

Building the Model

I took a su­per­vised learn­ing ap­proach here, mak­ing use of open-source datasets from or­ga­ni­za­tions like Universal Dependencies. Since step one in any such en­deavor is to ob­tain the data. Conveniently, these datasets in­clude pre­tagged cor­pus’, which I in­gested eas­ily us­ing rs_conllu.

Once that was done, I cre­ated a bench­mark for Harper’s ex­ist­ing POS tag­ger. I found that it scored about a 40% ac­cu­racy when 100% cer­tainty was re­quired. When lower lev­els of cer­tainty were needed, I found it per­formed a bit bet­ter. Either way, there was plenty of room for im­prove­ment.

Transformation-based learn­ing is re­mark­ably sim­ple.

  • Provide a base pass over the data us­ing a sim­ple learn­ing tech­nique. In our case, we as­sign the most com­mon POS tag for a given word from the cor­pus. Tan”, for ex­am­ple, might be most fre­quently used as a verb, so we’ll start by tag­ging it as so.
  • Generate a list of patch rules” for the data. In a nut­shell, these are sim­ple cri­te­rion paired with POS tran­si­tions. For ex­am­ple, each time we see a to­ken marked as an ad­po­si­tion sand­wiched be­tween a noun and a verb, mark it as a sub­or­di­nat­ing con­junc­tion in­stead.
  • Apply each of these patch rules over the base pass and check if the tag­ger’s per­for­mance im­proves. If so, add it to an on­go­ing list of winners”.
  • Loop steps 2 and 3 un­til you reach a sat­is­fy­ing level of per­for­mance.

That’s the whole process! With it, I was able to bring our pre­vi­ous ac­cu­racy all the way up to 95% (from 40%) with­out a mean­ing­ful change in lint­ing la­tency or com­piled bi­nary size.