Run Llama with the Ollama API
Ollama is the most popular tool for running Llama and other open-source models locally. It exposes an OpenAI-compatible API at localhost:11434 that accepts the same request format as the OpenAI Chat Completions API, making it a drop-in replacement for development and testing without incurring API costs. This example shows a complete Ollama API request for Llama 3.1 8B — a model that runs comfortably on a Mac with 16GB RAM and produces excellent results for general tasks. The Ollama API adds model-specific options through the options object: num_ctx controls the context window size for the current request (default 2048, can be set up to the model's training limit), temperature and top_p work the same as OpenAI, and num_predict limits the maximum number of generated tokens. The stream parameter defaults to true in Ollama, returning a streaming response — set it to false for a single-response JSON object. Beyond local development, Ollama is increasingly used in production for privacy-sensitive applications where sending data to third-party APIs is not acceptable: medical records, financial data, proprietary code analysis. The cost model is fundamentally different from API-based LLMs — hardware acquisition cost amortized over requests rather than per-token pricing. For high-volume use cases, self-hosting becomes economically competitive with cloud APIs.
{
"model": "llama3.1:8b",
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant. Provide concise, working code examples."
},
{
"role": "user",
"content": "Write a Python function that reads a CSV file and returns a list of dictionaries, one per row."
}
],
"stream": false,
"options": {
"num_ctx": 4096,
"temperature": 0.3,
"num_predict": 512
}
}FAQ
- How do I install Ollama?
- Download from ollama.ai for macOS, Linux, or Windows. After installation, run "ollama pull llama3.1:8b" to download the model, then "ollama serve" to start the API server. The server listens on localhost:11434 by default.
- Is the Ollama API fully compatible with OpenAI SDKs?
- Largely yes. The OpenAI Python and JavaScript SDKs work with Ollama by setting base_url="http://localhost:11434/v1" and using any string as the api_key. Some advanced features like function calling have varying support depending on the model.
- What Llama models run well on consumer hardware?
- Llama 3.1 8B runs on 8-16GB RAM with acceptable speed. Llama 3.1 70B requires 40GB+ RAM and runs much slower. For GPU acceleration on Apple Silicon, Ollama uses Metal automatically. For NVIDIA GPUs, CUDA acceleration is automatic.
Related Examples
Llama 3 uses a custom tokenizer based on tiktoken's BPE algorithm with a vocabul...
Write a System Prompt for Llama 3Llama 3 uses a structured chat template with special tokens that must be applied...
Build an OpenAI Chat Completion RequestThe Chat Completion API is the primary interface for all GPT models and the foun...