Gemma 4 Hardware Requirements: Can Your Mac or PC Run It?

A clear breakdown of GPU memory, system RAM, and storage needed to run Gemma 4 on a laptop, desktop, or workstation — E4B, 26B MoE, and 31B dense — based on Google's official specs. Covers quantization, Apple Silicon (including Mac Mini), and NVIDIA GPU matching.

You've probably heard that Google's Gemma 4 can run locally on your own hardware. The first question is always the same: "Does my machine have enough juice?"

The answer isn't one-size-fits-all. Gemma 4 is a family of models spanning from edge devices to workstations. This guide focuses on the three variants that matter for laptops, desktops, and workstations — and gives you the real numbers straight from Google's official documentation, so you can figure out exactly where your setup falls.

The Gemma 4 Models for Local Use

Gemma 4 comes in four sizes overall. The smallest — E2B — is designed primarily for phones, tablets, and IoT devices, so we won't spend much time on it here. The three models relevant to laptop and desktop users are:

Model	Architecture	Active Parameters	Context Window
Gemma 4 E4B	Dense	~4B effective	128K tokens
Gemma 4 26B A4B	Mixture of Experts	4B active / 26B total	256K tokens
Gemma 4 31B	Dense	31B	256K tokens

All three are natively multimodal (text + images) and support audio and video input. The "E" in E4B stands for effective — more on why that matters for memory in a moment.

How Much Memory Do You Need?

Here's the table that matters most. These numbers come directly from Google's Gemma 4 documentation and represent the minimum GPU/TPU memory required to load the model weights:

Model	Q4_0 (4-bit)	SFP8 (8-bit)	BF16 (16-bit)
Gemma 4 E4B	5 GB	7.5 GB	15 GB
Gemma 4 26B A4B	15.6 GB	25 GB	48 GB
Gemma 4 31B	17.4 GB	30.4 GB	58.3 GB

These are the minimum memory requirements for loading model weights only — actual usage during inference will be higher. More on that below.

For most people running models locally, Q4_0 (4-bit quantization) is the practical starting point. It dramatically reduces memory usage while keeping output quality surprisingly close to full precision.

Why Does the E4B Need 5 GB for "4 Billion" Parameters?

If you do the math — 4 billion parameters at 4 bits per parameter — you'd expect roughly 2 GB. So where does the extra memory go?

Gemma 4's smaller models use a technique called Per-Layer Embeddings (PLE). Instead of sharing one big embedding table across the entire model, PLE gives every decoder layer its own small embedding lookup for each token. These tables are large in terms of storage but only perform quick lookups — not heavy computation. The model performs like a 4B-parameter model while actually storing considerably more data on disk.

That's why Google uses the word "effective" — the E4B is effective-4B in compute cost, but its actual weight file is closer to what you'd expect from a larger model.

The MoE Catch: 26B Loaded, 4B Active

The 26B A4B model uses a Mixture of Experts (MoE) architecture. Here's what that means in practical terms:

When you send a prompt, the model activates only about 4 billion parameters to process each token — which is why it runs at speeds comparable to a 4B dense model. But all 26 billion parameters need to sit in memory at all times, because the router needs instant access to every expert to decide which ones to activate.

So while 26B A4B runs like a small model, it loads like a large one. Plan your memory budget around the 15.6 GB minimum (at 4-bit), not around the 4B active parameter count.

What the Table Doesn't Tell You

Google's official numbers cover static model weights only. In real-world use, you'll need additional memory for:

KV Cache (Context Window)

Every token in your conversation — both what you type and what the model generates — gets stored in a key-value cache. The longer the conversation, the more memory this consumes. A short prompt might add a few hundred MB, but pushing toward the full 128K or 256K context window can add several gigabytes on top of the base weights.

Operating System and Runtime Overhead

Your OS, the inference runtime (Ollama, llama.cpp, etc.), and other background processes all claim memory. On a typical setup, budget an extra 2–4 GB for system overhead beyond the model itself.

A practical rule of thumb: take the Q4_0 number from the table and add 30–50% for comfortable daily use. So for the E4B at 5 GB base, plan for roughly 7–8 GB of available GPU memory or unified memory.

Which NVIDIA GPU Fits Which Model?

If you're on a desktop or laptop with an NVIDIA GPU, here's how common cards match up against the Q4_0 (4-bit) requirements:

GPU	VRAM	What It Can Run
GTX 1660 / RTX 3050	6 GB	E4B (tight — close other GPU apps)
RTX 3060 / 4060	8 GB	E4B (comfortable)
RTX 3070 / 4070	8–12 GB	E4B (comfortable, room for longer context)
RTX 3080 / 4070 Ti Super	12–16 GB	E4B, 26B A4B (tight at 16 GB)
RTX 3090 / 4090 / 5080	24 GB	26B A4B (comfortable), 31B (tight)
RTX 5090 / A6000	32–48 GB	31B (comfortable)

"Tight" means it'll load and run, but you won't have much room for long conversations or other GPU workloads. "Comfortable" means you have headroom for the KV cache and normal multitasking.

# Check your NVIDIA GPU VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv

AMD GPU users: ROCm support is more mature on Linux than Windows. If you're on an AMD card, Linux is the safer bet.

Apple Silicon: Unified Memory Is Your Friend

Apple's M-series chips use unified memory — the CPU and GPU share the same pool of RAM. This is a significant advantage for local AI, because the model doesn't need to be copied between separate CPU and GPU memory. Every gigabyte of RAM your Mac has can be used directly for inference.

# Check your Mac's total memory
system_profiler SPHardwareDataType | grep Memory

Mac	Memory	Best Gemma 4 Fit
MacBook Air M1/M2 (8 GB)	8 GB	E4B (Q4_0, tight)
MacBook Air/Pro M2/M3 (16 GB)	16 GB	E4B (comfortable), 26B A4B (tight)
Mac Mini M4 (16–32 GB)	16–32 GB	E4B, 26B A4B at 24+ GB
Mac Mini M4 Pro (24–48 GB)	24–48 GB	26B A4B (comfortable), 31B at 48 GB
MacBook Pro M3 Pro/Max (18–36 GB)	18–36 GB	26B A4B (comfortable)
MacBook Pro M4 Max (48 GB)	48 GB	31B
Mac Studio / Mac Pro (64–192 GB)	64+ GB	All models, including BF16

The Mac Mini deserves a special mention — it's one of the most cost-effective ways to run local AI on Apple Silicon. The M4 base model starts at 16 GB (enough for E4B), and the M4 Pro can be configured up to 48 GB, which comfortably handles the 31B dense model. Compared to a MacBook Pro with similar specs, a Mac Mini is significantly cheaper and easier to leave running as a dedicated AI server.

Since macOS and background apps typically consume 3–5 GB, an 8 GB Mac realistically has about 4–5 GB free for the model — just enough to squeeze in E4B (5 GB at Q4_0), though it'll be tight. A 16 GB machine is a much more comfortable starting point.

CPU-Only: Possible, but Set Your Expectations

No GPU? You can still run Gemma 4 on CPU alone — the model loads into system RAM instead. But inference will be significantly slower:

E4B on CPU: roughly 2–5 tokens/second on a modern laptop. Workable for simple Q&A, but slow for extended chat.
26B / 31B on CPU: under 1 token/second on most consumer hardware. Really only practical for batch processing or testing.

For CPU inference, you'll generally need about 2× the model's Q4_0 weight size in system RAM — one copy for the weights, plus working memory for computation. So E4B on CPU needs roughly 10 GB of free system RAM.

What Is Quantization, and Why Should You Care?

You'll see terms like "Q4_0," "SFP8," and "BF16" throughout this guide. Here's what they mean in plain language:

Every AI model stores its knowledge as billions of numbers (called parameters). The precision of those numbers — how many bits each one uses — determines both the model's quality and its memory footprint.

BF16 (16-bit): Full precision. Best quality, highest memory cost. Mostly for research or high-end servers.
SFP8 (8-bit): Half the memory of BF16, very close to full quality. A good choice if you have the VRAM to spare.
Q4_0 (4-bit): Quarter the memory of BF16. Some quality loss, but for most conversational and coding tasks, the difference is hard to notice. This is what most local users run.

Think of it like image compression: a JPEG at 90% quality looks nearly identical to the original, but takes a fraction of the storage. Quantization does the same thing for model weights.

Quick Decision Guide

Not sure which model to try first? Start here:

8 GB or less → Gemma 4 E4B at Q4_0 is feasible but tight. Close unnecessary apps and keep conversations short.
16 GB → E4B comfortably, with room for longer context windows and multitasking. A great entry point.
16–24 GB → The 26B A4B MoE model becomes feasible. It gives you a big quality jump for the memory cost.
24–32 GB → 26B A4B comfortably, or stretch toward the 31B dense model.
48 GB+ → The full 31B dense model runs comfortably. This is as good as Gemma 4 gets locally.

# macOS — check available RAM
sysctl hw.memsize | awk '{printf "Total RAM: %.0f GB\n", $2/1073741824}'

# Windows PowerShell
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB

# Linux
free -h | grep Mem

What's Next?

Now that you know which variant fits your hardware, it's time to actually run it.

👉 Next guide: Run Gemma 4 with Ollama: A Practical Guide to Every Model Size

From installation to your first response — about 10 minutes on a decent connection.