Run Llama on Groq for High-Speed Inference

Groq provides cloud inference for open-source models including Llama 3.1 on their custom LPU (Language Processing Unit) hardware, delivering 400-900 tokens per second — 10-30x faster than typical GPU inference. Groq's API is OpenAI-compatible, making it a drop-in replacement for applications that already use the OpenAI SDK. This example shows a Groq API request for Llama 3.1 70B that would run in under a second. Groq's key differentiator is latency, not just throughput. Time-to-first-token (TTFT) on Groq is typically under 200ms, compared to 2-5 seconds on standard GPU inference for large models. This makes Groq particularly valuable for real-time applications — live voice assistants, interactive coding tools, and streaming chat interfaces where users can see the response begin immediately. The free tier provides generous token allowances for testing. The models available on Groq include Llama 3.1 8B, 70B, and 405B, as well as Mixtral, Gemma, and other open-source models. Model availability changes as Groq partners with model providers, so check their models page for the current list. The API uses model IDs like "llama-3.1-70b-versatile" rather than Ollama-style format strings.

Example
{
  "model": "llama-3.1-70b-versatile",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant. Be concise and accurate."
    },
    {
      "role": "user",
      "content": "Explain the CAP theorem in distributed systems and give a real-world example of a system that prioritizes availability over consistency."
    }
  ],
  "temperature": 0.5,
  "max_tokens": 600,
  "stream": true
}
[ open in LLaMA API Request Builder → ]

FAQ

How fast is Groq compared to OpenAI?
Groq delivers 400-900 tokens per second versus 40-100 tokens per second for GPT-4o on OpenAI. Time-to-first-token is under 200ms on Groq versus 1-3 seconds on OpenAI. The speed difference is most noticeable for long outputs.
Is Groq free to use?
Yes, Groq has a free tier with generous daily token limits per model. The free tier is sufficient for development and moderate production use. Paid tiers remove rate limits for high-volume applications.
Does Groq support all OpenAI API features?
Groq supports chat completions, function calling, and JSON mode from the OpenAI API spec. It does not support fine-tuning, embeddings, or the Assistants API. The streaming format is compatible with OpenAI SDK streaming.

Related Examples