Continuations on Transformation-based Learning

Steps near Clear Creek. Taken by me.

The most com­mon type of ma­chine learn­ing out there takes the form of some kind of neural net­work. Inspired by how our own brains work, these sys­tems act as func­tion ap­prox­i­ma­tions. They are great, but they come with a few key pit­falls.

First and fore­most, they start out with very lit­tle baked-in un­der­stand­ing of the con­text they live in. This is fine—usu­ally enough data can be pro­vided to bridge the gap. It does mean, how­ever, that they spend an in­or­di­nate amount of time learn­ing the fun­da­men­tals of their field. This trans­lates to a larger model size and longer in­fer­ence time (especially at the edge).

Secondly, most neural net­works are ini­tial­ized with ran­dom­ness, which re­sults in ex­tremely high en­tropy. High en­tropy means that these mod­els can­not be com­pressed eas­ily (if at all).

This dis­qual­i­fies neural net­works from many as­pects of Harper’s ar­chi­tec­ture. Harper tries to be fast and small, so it can be shipped and run wher­ever our users are. Neural net­works (especially in the world of nat­ural lan­guage pro­cess­ing) are nei­ther fast nor small.

This is why we’ve tak­ing an al­ter­na­tive ap­proach to ma­chine learn­ing, as ev­i­denced by last week’s post on trans­for­ma­tion-based learn­ing.

Transformation-Based Learning: A Refresher

Transformation-based learn­ing is re­mark­ably sim­ple. It boils down to just four steps:

  • Use a sim­ple, sto­chas­tic model t la­bel your data. This can be as sim­ple as tag­ging each to­ken (or other dis­crete com­po­nent) with that vari­ant’s most com­mon tag. It does­n’t need to su­per ac­cu­rate, just enough to es­tab­lish a base­line.
  • Identify the er­rors be­tween the tags in your canon­i­cal data and that which pro­duced by your base­line model.
  • Using a fi­nite list of hu­man-de­fined tem­plates, gen­er­ate can­di­date rules that trans­form the out­put of the base­line model into some­thing else. This is where the term “trans­for­ma­tion-based” comes from.
  • Apply each of the can­di­date rules to the base­line mod­el’s out­put. Check if the re­sult is more ac­cu­rate than be­fore. If so, save the rule for fu­ture use.

These saved can­di­dates be­come your model.

If you’re in­ter­ested how this could be ap­plied to POS tag­ging, I’ve since up­dated my orig­i­nal post on the sub­ject to bet­ter ex­plain the process. I’d rec­om­mend tak­ing a look.

Nominal Phrase Chunking

It’s of­ten use­ful, es­pe­cially when build­ing a gram­mar checker, to be able to iden­tify the sub­jects and ob­jects of sen­tences. Suppose, for ex­am­ple, that we want to in­sert the miss­ing Oxford comma in a list of fruits: I like ap­ples, ba­nanas and or­anges”. In this triv­ial ex­am­ple, this can be done with POS tag­ging. If we have more com­plex sub­jects, like in the phrase I like green ap­ples, de­li­ciously per­ni­cious ba­nanas and fresh or­anges,” POS tag­ging starts to fall apart. Identifying multi-to­ken sub­jects is the job of a nom­i­nal phrase chun­ker.

I’ve been want­ing to build a nom­i­nal phrase chun­ker for a while, but haven’t had the tools to do so. Now that I have pipeline in place (from last week), it should be rel­a­tively straight­for­ward.

For the pur­poses of this model, we’ll be tag­ging each to­ken with a boolean; it is ei­ther a mem­ber of a noun phrase, or it is not.

I started by as­sign­ing each to­ken to a nom­i­nal phrase if a POS tag­ger marks it as a noun. This is our base­line model. It per­forms poorly be­cause the re­sult­ing nom­i­nal phrases do not in­clude de­ter­min­ers or any ad­jec­tives.

Similar to our POS tag­ging model, I used a Universal Dependencies tree­bank to de­ter­mine the ac­cu­racy of our base­line. After gen­er­at­ing can­di­date rules us­ing the same patch tem­plates as the POS tag­ging sys­tem and run­ning them against the tree­bank, I have a model with a 90% ac­cu­racy.

I feel as though there should be more de­tails to share, but that was pretty much it. I spent a good amount of time op­ti­miz­ing the train­ing code. There’s still a lot of work left to do to in­cor­po­rate it into the rest of Harper. I am also un­sat­is­fied with the mod­el’s cur­rent ac­cu­racy. To get it closer to 100%, I sus­pect I’ll need to do a good amount of data clean­ing.