The Easiest Way to Run LLMs Locally

A Goofy Lookin' Llama

LLMs

Unless you’ve been liv­ing un­der a rock for the past year, you al­ready know what LLMs are. If you do hap­pen to be one of the lucky few un­aware of the cur­rent hype around these things, I’ll go through it real quick.

A large lan­guage model (or LLM) is a sta­tis­ti­cal model ca­pa­ble of predicting” a sub­se­quent word or let­ter, given a body of text. Es­sen­tially, it is a com­puter pro­gram ca­pa­ble of fill­ing in the blank. If you let it pre­dict the next word, then feed the re­sult back in, you can get some pretty hu­man-look­ing text.

Let’s Be Clear

I hold a lot of skep­ti­cism on the prac­ti­cal ap­pli­ca­tions of LLMs as a tool. As a blan­ket rule, I never use LLMs or any sim­i­lar tech­nol­ogy in my ed­u­ca­tion.

I know some peo­ple ask LLMs ques­tions like explain the fun­da­men­tal the­o­rem of cal­cu­lus to me like I’m five.” While they may get good re­sults for ques­tions, I do not want to lean on them as a crutch. Col­lege is not only an op­por­tu­nity to learn the raw ma­te­r­ial, but also an op­por­tu­nity to learn how to learn. If we know any­thing about LLMs, it’s that its abil­ity to an­swer com­plex ques­tions break down as you move to more spe­cial­ized classes.

Which is all to say: I did not in­ves­ti­gate this with the in­ten­tion of us­ing it as a tool, I just wanted to play around.

My Circumstance

I use arch, btw. While I en­joy the level of con­trol it pro­vides, I don’t think it’s for every­body. This is partly be­cause some things are quite dif­fi­cult to set up.

For ex­am­ple, GPU sup­port is lim­ited and finicky, es­pe­cially if you run an Intel Arc card, like I do. While it works per­fectly for some apps, like blender, it does­n’t work so well for other things. My card only has 3 GB of VRAM, so it would­n’t be able to fit most mod­els any­way.

So when I took on the task of run­ning an LLM on my lo­cal ma­chine, I started at look­ing at CPU-only so­lu­tions.

Initially, I tried to raw-dog llama.cpp. That worked but only so. The com­mand-line in­ter­face left a lot to be de­sired, and the process of down­load­ing and load­ing var­i­ous mod­els was te­dious and con­fus­ing.

Ollama

That’s when I dis­cov­ered Ollama. Installing it was as easy as run­ning:

sudo pacman -S ollama

To avoid wast­ing re­sources on mul­ti­ple in­stances of each model, Ollama uses a server ar­chi­tec­ture. You can start the server by run­ning

ollama serve

Then, you can down­load an start chat­ting with a model with:

ollama run llama2
# Or:
ollama run mistral

That’s It

That’s it! It re­ally is that sim­ple.

Again, you might have no rea­son to do any of this. Es­pe­cially if you are happy with the pri­vacy night­mare that is OpenAI, Google or Anthropic, or if you al­ready have a sys­tem that works for you.