Run Llama with the Ollama API

Ollama is the most popular tool for running Llama and other open-source models locally. It exposes an OpenAI-compatible API at localhost:11434 that accepts the same request format as the OpenAI Chat Completions API, making it a drop-in replacement for development and testing without incurring API costs. This example shows a complete Ollama API request for Llama 3.1 8B — a model that runs comfortably on a Mac with 16GB RAM and produces excellent results for general tasks. The Ollama API adds model-specific options through the options object: num_ctx controls the context window size for the current request (default 2048, can be set up to the model's training limit), temperature and top_p work the same as OpenAI, and num_predict limits the maximum number of generated tokens. The stream parameter defaults to true in Ollama, returning a streaming response — set it to false for a single-response JSON object. Beyond local development, Ollama is increasingly used in production for privacy-sensitive applications where sending data to third-party APIs is not acceptable: medical records, financial data, proprietary code analysis. The cost model is fundamentally different from API-based LLMs — hardware acquisition cost amortized over requests rather than per-token pricing. For high-volume use cases, self-hosting becomes economically competitive with cloud APIs.

Example
{
  "model": "llama3.1:8b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful coding assistant. Provide concise, working code examples."
    },
    {
      "role": "user",
      "content": "Write a Python function that reads a CSV file and returns a list of dictionaries, one per row."
    }
  ],
  "stream": false,
  "options": {
    "num_ctx": 4096,
    "temperature": 0.3,
    "num_predict": 512
  }
}
[ open in LLaMA API Request Builder → ]

FAQ

How do I install Ollama?
Download from ollama.ai for macOS, Linux, or Windows. After installation, run "ollama pull llama3.1:8b" to download the model, then "ollama serve" to start the API server. The server listens on localhost:11434 by default.
Is the Ollama API fully compatible with OpenAI SDKs?
Largely yes. The OpenAI Python and JavaScript SDKs work with Ollama by setting base_url="http://localhost:11434/v1" and using any string as the api_key. Some advanced features like function calling have varying support depending on the model.
What Llama models run well on consumer hardware?
Llama 3.1 8B runs on 8-16GB RAM with acceptable speed. Llama 3.1 70B requires 40GB+ RAM and runs much slower. For GPU acceleration on Apple Silicon, Ollama uses Metal automatically. For NVIDIA GPUs, CUDA acceleration is automatic.

Related Examples