Format CSV Data for AI Fine-Tuning

Fine-tuning LLMs on custom datasets requires converting raw training data into the specific JSONL format each provider expects. OpenAI fine-tuning requires a messages array with system, user, and assistant turns. Anthropic fine-tuning uses a similar structure. Both require one training example per line as valid JSON (not a JSON array), with consistent field names and no missing required fields. This example shows a CSV file of customer support Q&A pairs and converts each row to the correct fine-tuning format. The most common mistakes when preparing fine-tuning datasets are: using a JSON array instead of JSONL (one JSON object per line), including examples with empty assistant responses, inconsistent system prompt formatting across examples, and control characters or newlines in values that break the JSONL format. The formatter catches all of these and reports the row number of each error. For optimal fine-tuning results, aim for at least 50 high-quality examples (more is better, but quality matters more than quantity), ensure examples cover the full range of scenarios your model will encounter in production, and include a validation split of 10-20% to detect overfitting during training.

Example
user_message,assistant_response,system_context
How do I reset my password?,Click the "Forgot Password" link on the login page and follow the email instructions.,You are a helpful customer support agent for Acme Software.
What payment methods do you accept?,We accept Visa, Mastercard, PayPal, and bank transfers.,You are a helpful customer support agent for Acme Software.
How do I cancel my subscription?,Go to Account Settings > Billing > Cancel Subscription.,You are a helpful customer support agent for Acme Software.
Is there a free trial?,Yes! We offer a 14-day free trial with no credit card required.,You are a helpful customer support agent for Acme Software.
[ open in AI Dataset Formatter → ]

FAQ

What is JSONL format?
JSONL (JSON Lines) is a format where each line is a separate, valid JSON object. It is preferred over a single JSON array for large datasets because it can be processed line by line without loading the entire file into memory.
How many examples do I need for fine-tuning?
OpenAI recommends starting with 50-100 high-quality examples and scaling up. More examples improve results up to a point, but low-quality or inconsistent examples degrade performance regardless of quantity.
Should I include a system prompt in every fine-tuning example?
Yes, if you plan to use a system prompt in production. The model learns to follow the system prompt context during fine-tuning, so including it in every training example makes the model better at following it at inference time.

Related Examples