JSONL Datasets for LLM Fine-tuning and Evaluation

JSONL (JSON Lines) is the standard format for LLM training datasets, fine-tuning files, and batch evaluation sets. Each line in a JSONL file is a self-contained JSON object representing one training example or request. This guide covers how to create, validate, and optimise JSONL datasets for OpenAI fine-tuning, Anthropic Message Batches, and custom evaluation pipelines.

What is JSONL and Why LLMs Use It

JSONL (JSON Lines) stores one complete JSON object per line, unlike standard JSON which requires the entire file to be a valid JSON value. This design is ideal for large datasets because each line is independently parseable — you can stream, sample, and process records without loading the entire file into memory. For LLM APIs, JSONL is the standard format for fine-tuning files (OpenAI, Mistral, Together AI), batch request files (OpenAI Batch API, Anthropic Message Batches), and evaluation benchmarks. Each line typically represents one training example in a messages format: a JSON object with a "messages" array containing system, user, and assistant turns.

OpenAI Fine-tuning Format

OpenAI fine-tuning expects JSONL where each line follows the chat completions format: {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. The assistant message is the "completion" the model learns to generate given the system + user context. For function calling fine-tuning, add a "tools" array and include function call examples where the assistant message is a tool_call rather than text. Minimum dataset size is 10 examples, but 50-100 examples are typically needed for visible improvements; 500-1,000 examples for reliable specialisation. OpenAI's fine-tuning API validates the file format before training begins.

Dataset Quality and Diversity

The quality of your fine-tuning dataset directly determines the quality of the fine-tuned model. Three dimensions of quality matter: accuracy (assistant responses must be factually correct and high-quality), diversity (examples should cover the range of inputs the model will encounter in production), and coverage (edge cases and difficult inputs should be represented, not just typical ones). Avoid datasets that are biased toward a narrow vocabulary or phrasing style — the model will overfit to that style and perform poorly on differently-phrased inputs. A useful audit: generate 20 examples with your target model (e.g., GPT-4o) and check whether they cover distinct input types; if the first 10 examples look similar, your dataset generation prompt lacks diversity.

Token Counting and Dataset Size

Before uploading a fine-tuning file, count the total tokens in the dataset to estimate training cost and time. OpenAI charges per training token, so a dataset with many long examples costs significantly more than one with shorter examples of equal quality. The total training cost is: (training_tokens × number_of_epochs) × price_per_training_token. For GPT-4o mini fine-tuning, the price is $25 per million training tokens. A 500-example dataset averaging 500 tokens per example = 250,000 tokens × 3 epochs = 750,000 training tokens = $18.75. Validate token counts per example and truncate examples that exceed the model's maximum training sequence length (4,096 tokens for GPT-4o mini).

Validation and Pre-processing

Before submitting a JSONL file for fine-tuning, run these validations: (1) every line must be valid JSON; (2) every record must have a "messages" key with an array containing at least one user and one assistant message; (3) roles must be "system", "user", or "assistant" — no other values; (4) content must be non-empty strings; (5) total token count per example must be within the model's maximum sequence length. OpenAI provides an official validation script in their documentation. Additionally, check for duplicate examples (duplicates waste training budget and cause overfitting), empty examples (often caused by a bug in dataset generation), and examples where the assistant message is suspiciously short (may indicate truncated data).

Evaluation Datasets and Benchmarking

Beyond fine-tuning, JSONL is the standard format for model evaluation benchmarks. An evaluation dataset pairs each input with one or more gold-standard outputs. For LLM evaluation, use three evaluation strategies: exact match (the model's output exactly matches the expected output — suitable for structured data extraction); LLM-as-judge (a second model evaluates the quality of the first model's output on a rubric — suitable for open-ended generation); and human evaluation (for high-stakes applications, sample 5% of responses for human review). Store evaluation results in JSONL too, with fields for the input, expected output, actual output, judge score, and latency — this makes it easy to sort by score and find failure patterns.

Try These Tools

FAQ

How do I generate a fine-tuning dataset with AI assistance?
Use a frontier model to generate synthetic training examples. Provide 5-10 high-quality hand-crafted examples as few-shot context, then ask the model to generate 50-100 similar examples with varied inputs. Review the generated examples for quality and diversity before using them for fine-tuning. Synthetic datasets work well when combined with at least 20% real, curated examples.
Can I fine-tune on my existing conversation logs?
Yes, if the existing conversations meet quality standards. Filter out conversations where the assistant's response was rated poorly or where the user submitted a follow-up correction. Only keep conversations where the model's response was accepted without correction, and ensure the dataset has diversity across input types.
How many fine-tuning epochs should I use?
OpenAI defaults to 3 epochs, which is a good starting point. More epochs can cause overfitting on small datasets. Evaluate the fine-tuned model after 1, 2, and 3 epochs on a held-out validation set and pick the epoch with the best validation performance. Early stopping is better than a fixed number of epochs.

Related Guides