Everything a Developer Should Know About Ollama - Part 1

thumbnail-ollama-part-1

Ollama Explained: Local LLMs, Model Packaging, and Alternatives

Ollama is one of the fastest ways to run AI models locally. In this first part, we will build the mental model: what Ollama is, what it is not, and how local generative models are packaged.

โฑ๏ธ Time to Complete

Around 8-10 minutes.

๐ŸŽฏ What youโ€™ll achieve / learn

  • Understand what Ollama does for developers
  • Learn why Ollama is not the same thing as Meta Llama
  • Compare Ollama with tools like LM Studio, llama.cpp, vLLM, and Open WebUI
  • Understand model weights, tokenizers, quantization, templates, and Modelfiles
  • Know what not to expect from Ollama before you use it in a real project

Ollama runtime stack

๐Ÿง  What is Ollama?

Ollama is a local model runner for LLMs. You install it on your machine, pull a model, and talk to that model from a terminal, desktop app, browser UI, editor extension, or HTTP client.

The core idea is simple:

  • ๐Ÿ“ฆ You choose a model from the Ollama model library
  • โฌ‡๏ธ Ollama downloads the model files
  • ๐Ÿ–ฅ๏ธ Ollama runs a local server
  • ๐Ÿ”Œ Your tools talk to that server, usually at http://localhost:11434

For local development, that is a very useful shape. You can prototype an AI feature without paying for every request, test prompts privately, run models offline, or build an app against a local API before deciding whether you need a hosted model.

Ollama is especially popular because it removes setup friction. Without a tool like Ollama, you may need to manually download model weights, pick a quantized file, configure a backend, remember chat templates, tune runtime parameters, and expose an API yourself.

๐Ÿงฐ Alternatives to Ollama

Ollama is not the only option. It is the convenient local runtime option. Other tools may be better depending on what you are building.

Ollama alternatives map

ToolBest forLink
LM StudioDesktop GUI for downloading and chatting with local modelslmstudio.ai
llama.cppLow-level C/C++ inference engine and toolinggithub.com/ggml-org/llama.cpp
vLLMHigh-throughput server inference, usually for bigger deploymentsvllm.ai
JanLocal AI desktop app with a user-friendly interfacejan.ai
Open WebUIWeb UI often used with Ollamaopenwebui.com
LocalAISelf-hosted OpenAI-compatible local AI APIlocalai.io
text-generation-webuiAdvanced local model playgroundgithub.com/oobabooga/text-generation-webui

Use Ollama when you want the fast path from "I have a laptop" to "I can call a local model from code".

Use something else when you need a heavy serving stack, advanced multi-GPU deployment, custom inference tuning, or a full desktop-first model management experience.

โš ๏ธ What not to expect from Ollama

Ollama is convenient, but it is not magic.

  • ๐Ÿงฑ Do not expect Ollama to make a weak machine run a huge model well. If you pull a 70B model on a laptop with limited RAM or VRAM, it may be slow or may not fit comfortably.
  • ๐Ÿงช Do not expect cloud-model quality from every local model. A small local model can be useful, fast, and private, but it may not reason like a frontier hosted model.
  • ๐Ÿ‹๏ธ Do not expect Ollama to train models from scratch. It runs, stores, imports, and customizes models. It is not a full training framework like PyTorch or Hugging Face Transformers.
  • ๐Ÿ” Do not expect built-in production security. By default Ollama is local. If you expose it to your network, you are responsible for firewalling, authentication, proxy rules, TLS, and access control.
  • ๐Ÿงฉ Do not expect Ollama to be a vector database, agent framework, evaluation platform, prompt-management system, or full RAG product. You can connect it to tools like LangChain, LlamaIndex, Qdrant, or Chroma, but Ollama itself is mainly the model runtime and API layer.

Also, do not assume "local" always means "private" in every mode. Local models run locally, but Ollama also has cloud-related features. Know which model you are using and where requests are going.

๐Ÿฆ™ Ollama is not Llama

This is a common naming confusion.

Llama is a family of language models from Meta. Examples include Llama 3.x style models.

Ollama is software for running models. It can run Llama-family models, but it can also run many non-Llama models, such as Google Gemma, Alibaba Qwen, Mistral, DeepSeek, Microsoft Phi, embedding models, and vision-capable models when supported.

Think of it like this:

  • Llama = one model family
  • Gemma/Qwen/Mistral/etc. = other model families
  • Ollama = the local runtime that can run many of them

When you run:

ollama run llama3.2

you are asking Ollama to run a model named llama3.2. Ollama is the tool. Llama is the model family.

๐Ÿ“ฆ How generative models are packaged

A generative model is not usually a single friendly .exe file. It is a bundle of parts that need to agree with each other.

Model packaging pipeline

The important pieces are:

  • Model weights: the learned numbers from training. These are the huge files.
  • Architecture: the model shape: layers, attention style, hidden size, tokenizer expectations, and so on.
  • Tokenizer: converts text into tokens the model understands, and converts output tokens back into text.
  • Quantization: compresses weights to smaller numeric formats, such as 4-bit or 8-bit, so models can run on consumer hardware.
  • Chat template: tells the model how to format system, user, and assistant messages.
  • Runtime parameters: context length, temperature, stop tokens, top-p, repeat penalty, GPU/CPU behavior.
  • License and metadata: what you are allowed to do with the model.

Ollama's role is to package and run those pieces behind a simple interface.

When you pull a model from Ollama, Ollama stores model blobs locally and records the model definition. When you run it, Ollama loads the correct files, applies the template and parameters, starts the runner, and exposes the model through the CLI and HTTP API.

๐Ÿ“ Modelfile: Dockerfile-like config for models

Ollama has a Modelfile, which is similar in spirit to a Dockerfile for a model. The official Modelfile reference describes instructions like:

  • FROM for the base model
  • PARAMETER for runtime settings
  • TEMPLATE for prompt formatting
  • SYSTEM for the default system message
  • ADAPTER for LoRA adapters
  • LICENSE for license text

Example:

FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM You are a concise developer assistant.

Create and run it:

ollama create dev-helper -f ./Modelfile
ollama run dev-helper

That does not train a new base model. It creates a local Ollama model definition using an existing base model plus your chosen behavior and parameters.

๐Ÿงญ Developer mental model

The simplest way to understand Ollama:

Your app / CLI / UI
        |
        v
Ollama local API on :11434
        |
        v
Model package: weights + tokenizer + template + params
        |
        v
CPU/GPU inference on your machine

You are not calling "AI in general". You are calling a specific local model through a local runtime. Model choice matters. Hardware matters. Prompt format matters. Context size matters. Quantization matters.

Ollama just makes all of that much easier to start with.

๐Ÿ‘‰ Next

In Part 2, I cover the practical side: installing Ollama, running models from terminal and UI, calling the API, storing models on a custom disk, and exposing Ollama safely to your network.

๐Ÿ”— Useful links

Related posts