Which quantization level should I use for Llama?

Q4_K_M is the best default — it runs on 4.7GB for the 8B model with minimal quality loss. Use Q8_0 if you have 8GB+ to spare and want near-perfect quality. Avoid Q4_0 (older scheme with more quality loss than Q4_K_M at the same size).

Does quantization affect instruction following?

Yes, but minimally for Q4_K_M and above. Quality loss is most noticeable for complex reasoning, long-form generation, and multilingual tasks. For conversational use cases and code generation, Q4_K_M is indistinguishable from F16 in practice.

Can I run Llama 70B locally?

Llama 3.1 70B in Q4_K_M requires approximately 40GB RAM. This needs either a Mac with 48GB or 64GB unified memory, a high-end PC with 64GB RAM, or multiple GPUs. It is feasible but not for typical consumer hardware.

Choose Llama Quantization Level for Local Deployment

Quantization reduces model file size and memory requirements by representing weights in lower precision — trading a small amount of accuracy for dramatically reduced resource requirements. Llama 3.1 8B in full F16 precision requires 16GB of RAM, but the Q4_K_M quantized version requires only 4.7GB, making it runnable on a laptop. This example shows the quantization options available through Ollama and Hugging Face and explains the quality-performance trade-off for each level. The GGUF format (used by Ollama and llama.cpp) supports many quantization levels, but the most commonly used are Q4_K_M (4-bit mixed quantization, recommended default), Q5_K_M (5-bit mixed, better quality at slightly higher memory), Q8_0 (8-bit, near-full quality), and F16 (full 16-bit precision, reference quality). The K suffix indicates K-quants, a newer quantization scheme that applies different precision to different weight matrices based on their sensitivity — this produces significantly better quality than older Q4_0 quantization at the same bit width. For most use cases, Q4_K_M offers the best quality-per-GB ratio. The quality loss versus F16 is measurable in benchmarks but barely perceptible for conversational tasks and code generation. Q8_0 is recommended if you have the RAM and want near-lossless quality without the complexity of running full F16. Only use F16 or BF16 when you specifically need reference-quality output and have the VRAM to support it (16GB for 8B models).

Example

Llama 3.1 8B quantization comparison:

Model: llama3.1:8b-instruct-fp16 → 16.1 GB RAM, ~18 tok/s on M2
Model: llama3.1:8b-instruct-q8_0 → 8.5 GB RAM, ~28 tok/s on M2
Model: llama3.1:8b-instruct-q5_K_M → 5.7 GB RAM, ~35 tok/s on M2
Model: llama3.1:8b-instruct-q4_K_M → 4.7 GB RAM, ~42 tok/s on M2

Ollama pull commands:
ollama pull llama3.1:8b          # default Q4_K_M
ollama pull llama3.1:8b-q8_0    # Q8 version

[ open in LLaMA API Request Builder → ]

FAQ

Which quantization level should I use for Llama?: Q4_K_M is the best default — it runs on 4.7GB for the 8B model with minimal quality loss. Use Q8_0 if you have 8GB+ to spare and want near-perfect quality. Avoid Q4_0 (older scheme with more quality loss than Q4_K_M at the same size).
Does quantization affect instruction following?: Yes, but minimally for Q4_K_M and above. Quality loss is most noticeable for complex reasoning, long-form generation, and multilingual tasks. For conversational use cases and code generation, Q4_K_M is indistinguishable from F16 in practice.
Can I run Llama 70B locally?: Llama 3.1 70B in Q4_K_M requires approximately 40GB RAM. This needs either a Mac with 48GB or 64GB unified memory, a high-end PC with 64GB RAM, or multiple GPUs. It is feasible but not for typical consumer hardware.

Related Examples

Run Llama with the Ollama API

Ollama is the most popular tool for running Llama and other open-source models l...

Count Tokens for a Llama 3 Prompt

Llama 3 uses a custom tokenizer based on tiktoken's BPE algorithm with a vocabul...

Write a System Prompt for Llama 3

Llama 3 uses a structured chat template with special tokens that must be applied...