AI Data
Prepare, clean, and transform data for AI models. Generate synthetic datasets, create fine-tuning examples, format training data, and convert between JSONL, CSV, and other ML data formats.
Split text into token-sized chunks with configurable overlap for RAG and embedding pipelines.
Convert CSV, TSV, or JSON data to JSONL format for LLM fine-tuning with role mapping.
Remove duplicate and near-duplicate lines from text using exact matching and Jaccard similarity.
Full preprocessing pipeline for LLM input: trim, normalize, strip HTML, collapse whitespace, and truncate to context window.
Clean and sanitize text for LLM input by stripping HTML, normalizing Unicode, and collapsing whitespace.
Normalize smart quotes, dashes, ligatures, and accented characters for consistent LLM input.
FAQ
- What is JSONL format used for in AI?
- JSONL (JSON Lines) is a file format where each line is a valid JSON object. It is the standard format for fine-tuning datasets for models like GPT and LLaMA — each line typically represents one training example.
- How do I create a fine-tuning dataset?
- A fine-tuning dataset consists of prompt-completion pairs (or system/user/assistant message triples for chat models). The data should represent the task you want the model to learn, with diverse, high-quality examples.
- What is synthetic data generation?
- Synthetic data is artificially generated data that mimics real data patterns. It is used to augment small datasets, create privacy-safe training data, and bootstrap evaluation benchmarks.