How to Install Gemma 4 Locally Using Ollama (Mac & Windows Guide)
A complete step-by-step guide to running Google's Gemma 4 on your own machine using Ollama. Works on Mac (Apple Silicon & Intel) and Windows. From installation to first response in under 10 minutes.
This is the guide I wish existed when I first tried running a local model. By the end of this page, you'll have Gemma 4 running on your own machine, responding to prompts — zero cloud, zero API keys.
Time to complete: ~10 minutes (+ model download time)
Difficulty: Beginner
Tested on: macOS 14 Sonoma (M2), Windows 11 (RTX 3080)
What is Ollama?
Ollama is a free, open-source tool that makes running large language models locally as simple as docker run. It handles:
- Model downloads and version management
- GPU/CPU detection and optimization
- A local REST API (OpenAI-compatible)
- Background service management
Think of it as "Docker, but for LLMs."
Step 1: Check Your Hardware
Before downloading anything, verify your machine meets the minimums:
# macOS — check available memory
sysctl hw.memsize | awk '{printf "RAM: %.0f GB\n", $2/1073741824}'
# Windows PowerShell
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB
Minimum for Gemma 4 4B: 4 GB free RAM, ~4 GB free disk space.
Not sure which variant to pick? Read our Hardware Requirements guide first.
Step 2: Install Ollama
On macOS
Option A — Download the app (easiest):
- Go to ollama.com and click Download for Mac
- Open the
.dmgfile and drag Ollama to your Applications folder - Launch Ollama from your Applications — a llama icon appears in your menu bar
Option B — Homebrew:
brew install ollama
Then start the service:
ollama serve
On Windows
- Go to ollama.com and click Download for Windows
- Run the installer — it's a standard
.exewizard - Ollama starts automatically and appears in the system tray
NVIDIA GPU users: Make sure you have the latest NVIDIA drivers installed. Ollama will auto-detect your GPU.
On Linux
curl -fsSL https://ollama.com/install.sh | sh
This script detects your OS, installs Ollama, and registers it as a systemd service.
Step 3: Verify Ollama is Running
Open a terminal and run:
ollama --version
You should see something like:
ollama version 0.6.4
If you get command not found, make sure Ollama is running (look for the icon in your menu bar / system tray) and try again.
Step 4: Pull the Gemma 4 Model
Now for the exciting part. Pick your variant:
# Gemma 4 4B — recommended for most users (fastest, ~3.5 GB download)
ollama pull gemma3:4b
# Gemma 4 12B — more capable, needs 16 GB RAM (~8 GB download)
ollama pull gemma3:12b
# Gemma 4 27B — workstation grade (~18 GB download)
ollama pull gemma3:27b
Note: Google's Gemma 4 is listed as
gemma3in Ollama's library (it's the 4th-gen Gemma architecture). The numbers (4b, 12b, 27b) refer to parameter count.
You'll see a progress bar:
pulling manifest...
pulling 8eeb52dfb3bb... 100% ▕████████████████████▏ 3.5 GB
pulling 56bb8bd477a5... 100% ▕████████████████████▏ 96 B
verifying sha256 digest
writing manifest
success
Download time depends on your internet speed — a 3.5 GB model takes about 3–7 minutes on a typical broadband connection.
Step 5: Run Your First Prompt
Once the download completes, run:
ollama run gemma3:4b
You'll enter an interactive chat session:
>>> Send a message (/? for help)
Try asking it something:
>>> Explain what a neural network is in 2 sentences, like I'm 12.
Gemma 4 will respond directly in your terminal. Hit Ctrl+D or type /bye to exit.
Step 6: Test the Vision Capability
Gemma 4 is multimodal — you can pass it images directly from the CLI:
# Pass a local image
ollama run gemma3:4b "Describe what you see in this image" /path/to/your/image.jpg
# Or from a URL
ollama run gemma3:4b "What's in this image?" https://example.com/photo.jpg
This works entirely locally — the image never leaves your machine.
Step 7: Use the REST API (Optional)
Ollama automatically starts a local server at http://localhost:11434. This is OpenAI API-compatible, so you can use it in any app that talks to ChatGPT.
# Test the API endpoint
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:4b",
"prompt": "Why is the sky blue?",
"stream": false
}'
Or use the OpenAI Python SDK with a one-line change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1", # Point to local Ollama
api_key="ollama", # Any string works
)
response = client.chat.completions.create(
model="gemma3:4b",
messages=[{"role": "user", "content": "Summarize the theory of relativity."}],
)
print(response.choices[0].message.content)
No OpenAI account needed. No billing. 100% local.
Useful Ollama Commands
Here's a cheat sheet for the commands you'll use most:
# List all downloaded models
ollama list
# Pull a new model
ollama pull <model-name>
# Remove a model (free up disk space)
ollama rm gemma3:27b
# See currently running models
ollama ps
# Get model details
ollama show gemma3:4b
# Run a one-off prompt (non-interactive)
ollama run gemma3:4b "What is 17 * 43?"
Troubleshooting Common Issues
"Error: model requires more system memory"
You don't have enough free RAM. Either:
- Close other applications to free RAM
- Use a smaller variant (e.g., switch from 12b to 4b)
- Upgrade to a model with more RAM
Ollama is very slow (< 3 tokens/second)
Ollama is probably running on CPU instead of GPU. Check:
# macOS — is Metal GPU being used?
ollama run gemma3:4b ""
# Look for "GPU layers: X" in the output
For NVIDIA GPUs, make sure drivers are up to date:
nvidia-smi # Should show your GPU
"connection refused" when calling the API
The Ollama service isn't running. Start it manually:
# macOS / Linux
ollama serve
# Windows — relaunch from the system tray
Model download fails mid-way
Ollama supports resumable downloads. Just run ollama pull again — it will resume from where it left off.
Monitor Resource Usage
Want to see how much your GPU/CPU is sweating?
# macOS — Activity Monitor or:
sudo powermetrics --samplers gpu_power -n 1
# Windows — open Task Manager > Performance > GPU
# Linux
watch -n 1 nvidia-smi
What's Next?
You've got Gemma 4 running locally. Here's what to explore next:
- Try the 12B model if you have 16 GB RAM — the quality jump is noticeable
- Connect it to a frontend — tools like Open WebUI give you a ChatGPT-style UI for free
- Use it in your code — the OpenAI-compatible API means you can drop it into any existing project
- Keep an eye on our Roadmap — we're publishing guides on model selection benchmarks and building your first AI app next
If this guide helped you, share it with someone who's been putting off running local AI. It's easier than it looks.