How do I install Ollama?

Download from ollama.ai for macOS, Linux, or Windows. After installation, run "ollama pull llama3.1:8b" to download the model, then "ollama serve" to start the API server. The server listens on localhost:11434 by default.

Is the Ollama API fully compatible with OpenAI SDKs?

Largely yes. The OpenAI Python and JavaScript SDKs work with Ollama by setting base_url="http://localhost:11434/v1" and using any string as the api_key. Some advanced features like function calling have varying support depending on the model.

What Llama models run well on consumer hardware?

Llama 3.1 8B runs on 8-16GB RAM with acceptable speed. Llama 3.1 70B requires 40GB+ RAM and runs much slower. For GPU acceleration on Apple Silicon, Ollama uses Metal automatically. For NVIDIA GPUs, CUDA acceleration is automatic.

Run Llama with the Ollama API

Ollama is the most popular tool for running Llama and other open-source models locally. It exposes an OpenAI-compatible API at localhost:11434 that accepts the same request format as the OpenAI Chat Completions API, making it a drop-in replacement for development and testing without incurring API costs. This example shows a complete Ollama API request for Llama 3.1 8B — a model that runs comfortably on a Mac with 16GB RAM and produces excellent results for general tasks. The Ollama API adds model-specific options through the options object: num_ctx controls the context window size for the current request (default 2048, can be set up to the model's training limit), temperature and top_p work the same as OpenAI, and num_predict limits the maximum number of generated tokens. The stream parameter defaults to true in Ollama, returning a streaming response — set it to false for a single-response JSON object. Beyond local development, Ollama is increasingly used in production for privacy-sensitive applications where sending data to third-party APIs is not acceptable: medical records, financial data, proprietary code analysis. The cost model is fundamentally different from API-based LLMs — hardware acquisition cost amortized over requests rather than per-token pricing. For high-volume use cases, self-hosting becomes economically competitive with cloud APIs.

Example

{
  "model": "llama3.1:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful coding assistant. Provide concise, working code examples."
    },
    {
      "role": "user",
      "content": "Write a Python function that reads a CSV file and returns a list of dictionaries, one per row."
    }
  ],
  "stream": false,
  "options": {
    "num_ctx": 4096,
    "temperature": 0.3,
    "num_predict": 512
  }
}

[ open in LLaMA API Request Builder → ]

FAQ

How do I install Ollama?: Download from ollama.ai for macOS, Linux, or Windows. After installation, run "ollama pull llama3.1:8b" to download the model, then "ollama serve" to start the API server. The server listens on localhost:11434 by default.
Is the Ollama API fully compatible with OpenAI SDKs?: Largely yes. The OpenAI Python and JavaScript SDKs work with Ollama by setting base_url="http://localhost:11434/v1" and using any string as the api_key. Some advanced features like function calling have varying support depending on the model.
What Llama models run well on consumer hardware?: Llama 3.1 8B runs on 8-16GB RAM with acceptable speed. Llama 3.1 70B requires 40GB+ RAM and runs much slower. For GPU acceleration on Apple Silicon, Ollama uses Metal automatically. For NVIDIA GPUs, CUDA acceleration is automatic.

Related Examples

Count Tokens for a Llama 3 Prompt

Llama 3 uses a custom tokenizer based on tiktoken's BPE algorithm with a vocabul...

Write a System Prompt for Llama 3

Llama 3 uses a structured chat template with special tokens that must be applied...

Build an OpenAI Chat Completion Request

The Chat Completion API is the primary interface for all GPT models and the foun...