Markov Chains are the Original Language Models

An old car interior

Heads Up: This ar­ti­cle is a re­pub­lished (with some tweaks on spelling, gram­mar and lay­out) ver­sion of an ar­ti­cle I wrote in my se­nior year of high school for my Linear Algebra class. As such, the pub­lish date is not quite cor­rect.

The AI Buzz is Boring Now

I’ve come to the con­clu­sion that there are four stages to the cur­rent AI hype cy­cle in an in­di­vid­ual per­son’s brain, at least as it per­tains to large lan­guage mod­els. At the very least, these are the stages I went through.

Stage One: Amazement

Wow! This is so cool! I can con­verse with a com­puter just like a real per­son!”

This is where all the sci­ence fic­tion fan­tasies come to fruition. The pos­si­bil­i­ties seem end­less. We can all kick back and re­lax now, right?

Stage Two: Frustration

Hmm… This is­n’t as ef­fec­tive as I orig­i­nally thought.”

It seems like the brand-new tech­nol­ogy is re­ally only ap­plic­a­ble to the kinds of work no one wants to do any­way. What it is able to do does­n’t pro­vide too much value to you. It gets in­for­ma­tion and logic wrong of­ten enough that it can­not be trusted for just about any­thing.

Stage Three: Confusion

After stage two, you start to for­get about it. But the hype is in­escapable. Your friends bring it up. Your par­ents ask you about it when you go home for the hol­i­days. Even your den­tist tries to ex­tol its virtues.

Even if you moved on it, no one else did. Could that mean that you were wrong?

Stage Four: Boredom

At this point the rate of new lan­guage mod­els ap­pear­ing has be­come faster than rate of new JavaScript frame­works (and just as an­noy­ing). You want to go back to your roots and start from scratch. You want the free­dom of know­ing the whole stack from start to fin­ish. You don’t want any of the in­ef­fec­tive magic.

This is where I am right now. Want to go back to my roots. Some peo­ple work on old cars, even though they are less ef­fi­cient. At the same time though, they are more fun to work on than new cars. I’ve de­cided to look into Markov chains.

Markov Chains

Below is a demon­stra­tion of my im­ple­men­ta­tion of auto-com­ple­tion us­ing Markov Chains.

Though it is writ­ten in Rust and com­piled to WebAssembly, it is not par­tic­u­larly ef­fi­cient. To find out why, con­tinue down the page to my de­tailed ex­pla­na­tion of the im­ple­men­ta­tion.

Controls

You may use ei­ther Choose Word” or your right ar­row key [→] to let the sys­tem choose the next word. Alternatively, you can tap any of the [Possible Next Words] to do so your­self.

Explanation

Markov chains, named af­ter their in­ven­tor, Andrey Markov, are of­ten used to model se­quences of prob­a­bilis­tic events. That is, sys­tems that can­not be mod­eled de­ter­min­is­ti­cally.

Example

Alice is at the gro­cery store. For every hour she is there, she has a 70% chance of leav­ing and go­ing to the plan­e­tar­ium. Conversely, she has a 30% chance of stay­ing. If Alice is al­ready at the plan­e­tar­ium, she has a 10% chance of leav­ing and go­ing to the gro­cery store and a 90% chance of stay­ing. We can rep­re­sent these prob­a­bil­i­ties as a table, where each col­umn be­longs to a start lo­ca­tion, and each row be­longs to a end lo­ca­tion:

Start at Grocery Store Start at Planetarium
End at Grocery Store 30% 10%
End at Planetarium 70% 90%

If we al­ready know Alice’s lo­ca­tion for sure, we can sim­ply per­form table lookups to pre­dict her most likely next move. For ex­am­ple, we know she is at the gro­cery store right now. So by look­ing at row 2, col­umn 1, we can be 70% con­fi­dent she will be at the plan­e­tar­ium next hour. How­ever, this does­n’t work if we aren’t sure of her lo­ca­tion, or we want to pre­dict more than one hour in ad­vance. How do we pre­dict her next move if we aren’t cer­tain of her cur­rent lo­ca­tion? In the lat­ter case, we might ex­press her cur­rent lo­ca­tion as an­other table.

Location % Alice Present
Grocery Store 25%
Planetarium 75%

How do we es­ti­mate Alice’s lo­ca­tion in this new plane of pos­si­bil­ity? In par­tic­u­lar, how likely will Alice be at the Planetarium next hour? Since there is a 25% prob­a­bil­ity Alice is at the gro­cery store, we mul­ti­ply that with the pro­bil­ity of her tran­si­tion­ing to the Planetarium: 25%75%25\% * 75\%. Next, we add the re­sult with the prob­a­bil­ity of be­ing at the Planetarium mul­ti­plied with the prob­a­bil­ity of her stay­ing: 75%90%75\% * 90\%. In full, 25%75%+75%90%=85%25\% * 75\% + 75\% * 90\% = 85\%. To see the prob­a­bil­i­ties as a table:

Next Location Calculation % Alice Present
Grocery Store 25%30%+75%10%25\% * 30\% + 75\% * 10\% 15%
Planetarium 25%70%+75%90%25\% * 70\% + 75\% * 90\% 85%

The keen-eyed among you may have no­ticed that these op­er­a­tions look a lot like ma­trix mul­ti­pli­ca­tion. In­stead of a table, we may rep­re­sent these pos­si­ble tran­si­tions as a ma­trix TT, and the Alice’s cur­rent lo­ca­tion as a vec­tor s\vec{s}.

T=[0.30.10.70.9]T = \begin{bmatrix} 0.3 & 0.1 \\ 0.7 & 0.9 \end{bmatrix} s=[.25.75]\vec{s} = \begin{bmatrix}  .25 \\  .75 \\ \end{bmatrix}

Note: The lo­ca­tion of each el­e­ment re­mains the same as the table, even if we aren’t ex­plic­itly la­bel­ing the rows and columns.

Finding the next state ma­trix be­comes as easy as mul­ti­ply­ing the cur­rent lo­ca­tion vec­tor s\vec{s} by TT. To find fur­ther hours in the fu­ture, we do it more than once. For ex­am­ple, to es­ti­mate three hours in the fu­ture: TTTsTTT\vec{s}. We can con­dense this with an ex­po­nent: T3sT^3\vec{s} or gen­er­al­ize it to nn hours with: TnsT^n\vec{s}.

Application to Text-Completion

The prin­ci­ples above can be ap­plied to a va­ri­ety of prob­a­bilis­tic sit­u­a­tions. Most rela­vant to this par­tic­u­lar web­page, is text com­ple­tion. We want to es­ti­mate the most likely next word to the user. Given the last word, what are the most likely next words? First, we need a dic­tio­nary.

The Dictionary

It is triv­ial to build a dic­tio­nary from sam­ple text. For the pur­poses of the ex­pla­na­tion, we are go­ing to start with an ar­bi­trary dic­tio­nary.

Index Word
0 or­ange
1 fruit
2 pas­sion
3 cheese
4 not
5 is

Building the Transition Matrix

To build our tran­si­tion ma­trix, we need to count all the tran­si­tions that oc­cur be­tween pos­si­ble words in our dic­tio­nary. In the in­ter­est of per­for­mance, my im­ple­men­ta­tion con­verts the dic­tio­nary into a HashMap<String, usize>. Next, I go through the train­ing text and match each word to it’s in­dex in the dic­tio­nary, ef­fec­tively trans­form­ing the String into a Vec<usize>. For ex­am­ple, the phrase, passion fruit is not or­ange, cheese is or­ange,” be­comes, [ 2, 1, 5, 4, 0, 3, 5, 0 ]. Next, the im­ple­men­ta­tion it­er­ates through each el­e­ment in this vec­tor, count­ing each tran­si­tion. The counts are stored in an­other HashMap in the in­ter­est of per­for­mance, but is even­tu­ally con­verted into a ma­trix CC. Each row is the out­put word’s in­dex, and the col­umn is the in­put word’s in­dex. For ex­am­ple, the tran­si­tion "fruit" (index 1) -> "is" (index 5) oc­curs ex­actly once, so we record 1 in col­umn 1, row 5.

C=[000011001000000000100000000001010100]C = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 1 & 0 & 0 \end{bmatrix}

Not a very in­ter­est­ing ma­trix, is it?

Each el­e­ment needs to be con­verted into a prob­a­bil­ity. Take the sum of each col­umn:

[111112]\begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 2 \end{bmatrix}

Create a di­ag­o­nal ma­trix DD com­posed of 1col­umn sum\frac{1}{\text{column sum}}

C=[000010.5001000000000100000000000.5010100]C = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0.5 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.5 \\ 0 & 1 & 0 & 1 & 0 & 0 \end{bmatrix}

To fi­nal­ize our Markov (a.k.a. tran­si­tion) ma­trix MM, we sim­ply per­form:

M=DCM = DC

Using the tran­si­tion ma­trix

There are two pos­si­ble sit­u­a­tions: the user is in the process of typ­ing, or they have fin­ished their last word. The lat­ter is the eas­i­est to im­ple­ment. Scan the user’s text, and iso­late the last word. Perform a lookup on the word list to iden­tify it’s in­dex. Create a new vec­tor con­tain­ing 0s ex­cept for that in­dex, which should con­tain a 1. For ex­am­ple, if the last word was is’,

s=[000001]\vec{s} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix}

Run it through our tran­si­tion ma­trix:

Ms=[0.50000.50]M\vec{s} = \begin{bmatrix} 0.5 & 0 & 0 & 0 & 0.5 & 0 \end{bmatrix}

Meaning the most prob­a­ble next choices are at in­dices 0 and 4, which cor­re­spond to orange” and not” re­spec­tively. This is great for au­to­com­plete. We can sim­ply list the most prob­a­ble op­tions to the user.

Text-Generation and Steady State

It would be pretty neat if we could use this method to au­tomag­i­cally gen­er­ate text, right?

The Naive Solution

Each it­er­a­tion, choose the most likely word from the set. Maybe ran­dom­ize it a bit: choose a ran­dom word from the top 5 op­tions. Un­for­tu­nately, there is an is­sue. All Markov chains are guar­an­teed to con­verge on a spe­cific prob­a­bilis­tic state given enough it­er­a­tions. In or­der to get text gen­er­a­tion to work un­pre­dictably and with­out con­verg­ing, we need some­thing a bit more com­plex.

My Solution

Create a square di­ag­o­nal ma­trix RR with a side length equal to the length of s\vec{s}. Fill the di­ag­o­nal el­e­ments with ran­dom num­bers be­tween 00 and 11. Then choose the word whose in­dex cor­re­sponds with the high­est value of RsR\vec{s}