Run Gemma 4 with Ollama: A Practical Guide to Every Model Size

Learn how to run Gemma 4 locally using Ollama — from the lightweight E2B edge model to the full 31B dense powerhouse. Covers setup, model selection, multimodal image input, API usage, and performance tips.

Google's Gemma 4 landed in early April 2026, and one of the fastest ways to take it for a spin on your own hardware is through Ollama. No cloud account, no API key, no billing dashboard — just a terminal and a decent internet connection to download the model weights.

This guide walks you through running each Gemma 4 variant in Ollama, picking the right size for your machine, and putting the model to work with text prompts, images, and the local REST API.

Why Ollama?

If you've never worked with local LLMs, you might wonder why Ollama keeps coming up. The short version: it removes the friction. Behind the scenes, Ollama wraps the llama.cpp inference engine and takes care of model downloading, GPU detection, quantization selection, and serving — all through a single CLI. You don't need to manually convert weights, write config files, or figure out CUDA paths.

For Gemma 4 specifically, Ollama ships pre-quantized GGUF versions of every official variant, so you can go from zero to a running model in two commands.

The Gemma 4 Lineup

Gemma 4 isn't a single model — it's a family of four, each targeting a different hardware profile. Here's the quick breakdown:

Tag	Parameters	Disk Size	Context Window	Best For
`gemma4:e2b`	~2B effective	~7.2 GB	128K tokens	Phones, tablets, Raspberry Pi-class devices
`gemma4:e4b`	~4B effective	~9.6 GB	128K tokens	Laptops, everyday development
`gemma4:26b`	26B (MoE, 4B active)	~18 GB	256K tokens	Best quality-per-GB thanks to Mixture of Experts
`gemma4:31b`	31B dense	~20 GB	256K tokens	Maximum quality, workstation hardware

The "E" in E2B and E4B stands for effective — these are edge-optimized models that punch above their parameter count through architectural tricks like Hybrid Sliding Window Attention (HSWA), which cuts memory usage by roughly 30% compared to earlier Gemma generations.

The 26B variant is particularly interesting: it uses a Mixture of Experts (MoE) architecture where only about 4 billion parameters are active for any given token. That means it delivers 26B-level quality while running almost as fast as a 4B model.

Not sure which variant your hardware can handle? Check our Hardware Requirements guide first.

Getting Started

1. Install Ollama

Head to ollama.com/download and grab the installer for your OS. Or if you prefer the terminal:

macOS (Homebrew):

brew install --cask ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download and run the .exe installer from the website.

After installation, confirm it's working:

ollama --version

You should see something like ollama version 0.20.x or later. Gemma 4 support requires Ollama v0.20.0+, so update if you're on an older version.

2. Pull a Gemma 4 Model

With Ollama running, downloading a model is one command:

ollama pull gemma4

This grabs the default variant (E4B) — a solid middle ground for most laptops. If you want a specific size:

# Lightweight edge model
ollama pull gemma4:e2b

# MoE model — great quality, moderate hardware
ollama pull gemma4:26b

# Full dense model — needs a beefy machine
ollama pull gemma4:31b

Verify your download:

ollama list

You'll see each model listed with its tag, size, and modification date.

3. Run It

The simplest way to interact with Gemma 4:

ollama run gemma4

This drops you into an interactive chat session. Type a message, hit Enter, and Gemma 4 responds right in your terminal. Type /bye or press Ctrl+D when you're done.

For a quick one-shot prompt without entering interactive mode:

ollama run gemma4 "What makes a sourdough starter bubbly?"

Multimodal: Feeding Images to Gemma 4

All Gemma 4 variants are natively multimodal — they understand images out of the box, not as an add-on. The model supports variable aspect ratios and resolutions up to 2.6 megapixels, using 2D positional embeddings for genuine spatial understanding (it can tell "above" from "below," "left" from "right").

To pass an image from the command line:

ollama run gemma4 "What's happening in this photo?" /path/to/your/photo.jpg

A few practical examples:

# Describe a screenshot
ollama run gemma4 "Summarize the key information in this screenshot" ~/Desktop/dashboard.png

# Read text from an image
ollama run gemma4 "Extract all the text you can see" ~/Documents/receipt.jpg

# Analyze a chart
ollama run gemma4 "What trends does this chart show?" ~/Downloads/sales-q1.png

Everything stays on your machine — no image data is sent anywhere.

Edge models and media: The E2B and E4B variants also support audio and video input natively, though video support through Ollama is still evolving. Keep an eye on Ollama's release notes for updates.

Using the Local API

Ollama automatically runs a local HTTP server on port 11434. This is handy for building apps, writing scripts, or connecting other tools.

Basic API Call

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Explain quantum entanglement in plain English",
  "stream": false
}'

With an Image (Base64)

For multimodal API requests, encode your image as base64:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Describe this image",
  "images": ["'"$(base64 -i /path/to/image.jpg)"'"]
}'

Chat Endpoint

For multi-turn conversations, use the /api/chat endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "user", "content": "What is photosynthesis?"},
    {"role": "assistant", "content": "Photosynthesis is the process plants use to convert sunlight into energy..."},
    {"role": "user", "content": "How efficient is it compared to solar panels?"}
  ],
  "stream": false
}'

Python Integration

Ollama's API is OpenAI-compatible, so you can use the standard openai Python package:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "Write a haiku about debugging."}],
)

print(response.choices[0].message.content)

No API key needed — the api_key field accepts any non-empty string.

Picking the Right Variant

If you're unsure which tag to pull, here's a decision framework based on real-world use patterns:

gemma4:e2b — For constrained environments

You're running on a device with 8 GB RAM or less
You need fast responses and can accept shorter, less nuanced answers
Good for: chatbots, simple Q&A, mobile/embedded prototypes

gemma4:e4b — The default for a reason

Works well on any modern laptop with 16 GB RAM
Balanced speed and quality for everyday coding help, writing, and analysis
Good for: development, personal assistant, content drafting

gemma4:26b — The sleeper hit

Mixture of Experts means you get 26B-quality output at near-4B speed
Needs about 18 GB of RAM, making it feasible on 32 GB MacBooks or a desktop with a 24 GB GPU
Good for: complex reasoning, code review, detailed analysis

gemma4:31b — When quality is everything

Dense 31B model, no shortcuts
Plan for at least 20 GB of VRAM or 32+ GB of unified memory on Apple Silicon
Good for: research, long-form writing, tasks where you'd otherwise reach for a cloud API

Performance Tips

A few things that can meaningfully affect your experience:

Close memory-hungry apps before running larger models. Web browsers with dozens of tabs can easily eat 8–10 GB of RAM. Freeing that up might be the difference between the 26B variant running smoothly or swapping to disk.

Use an SSD. Model loading from a hard drive can take 3–5× longer. Once loaded, the model runs from RAM, but that initial load matters.

Check GPU utilization. On macOS with Apple Silicon, Ollama uses Metal automatically. On Linux/Windows with NVIDIA, make sure your CUDA drivers are current:

# Check NVIDIA GPU status
nvidia-smi

# Check which models are loaded
ollama ps

Try the MoE model before going dense. The gemma4:26b MoE variant often matches or comes close to the 31B dense model's quality while using less compute per token. It's worth testing before committing to the larger download.

Using Custom or Fine-Tuned Models

If you've fine-tuned a Gemma 4 model and want to run it through Ollama, you'll need to convert it to the GGUF format first. Ollama supports importing from a Modelfile that points to your GGUF weights:

FROM /path/to/your/gemma4-finetuned.gguf

Then create and run it:

ollama create my-gemma4 -f Modelfile
ollama run my-gemma4

For detailed conversion instructions (Safetensors → GGUF), check the llama.cpp conversion guide.

What's Next?

You've got Gemma 4 running locally through Ollama. From here, a few directions worth exploring:

Add a UI — Open WebUI gives you a ChatGPT-style frontend that connects to Ollama with zero configuration
Build something — the OpenAI-compatible API means you can slot Gemma 4 into any project that currently calls GPT-5 or other LLMs
Experiment with vision — try feeding it screenshots, diagrams, or handwritten notes to see how the multimodal capabilities hold up in your workflow

Running a model this capable on your own hardware, with no internet required after the initial download, still feels a little surreal. But that's where open-source AI is now — and Gemma 4 with Ollama is one of the easiest ways to experience it firsthand.