What order are preprocessing steps applied in?

Steps are applied in this order: (1) trim leading/trailing whitespace, (2) normalize typographic characters (smart quotes, dashes), (3) strip HTML tags, (4) collapse multiple spaces and newlines, (5) truncate to the selected model's context window at a word boundary.

How does truncation to context window work?

The tool estimates token count using a word-based approximation and cuts the text at the last word boundary before the model's context limit. This ensures the output will fit in the model's context without being cut mid-word.

Can I skip individual steps?

The pipeline runs all steps in sequence. For individual step control, use the dedicated tools: AI Text Cleaner for HTML/whitespace, AI Text Normalizer for typographic normalization, or AI Chunk Overlap for chunking long documents.

AI Input Preprocessor

Full preprocessing pipeline for LLM input: trim, normalize, strip HTML, collapse whitespace, and truncate to context window.

Run text through a complete preprocessing pipeline before sending to an LLM API. Steps: trim whitespace, normalize typographic characters, strip HTML tags, collapse whitespace, then truncate at word boundaries to fit the selected model's context window. Shows token counts at each pipeline stage as a vertical timeline.

Input text

Target model

Paste text above to run through the preprocessing pipeline

Related Tools

ATCAI Text CleanerNEW

Clean and sanitize text for LLM input by stripping HTML, normalizing Unicode, and collapsing whitespace.

ATNAI Text NormalizerNEW

Normalize smart quotes, dashes, ligatures, and accented characters for consistent LLM input.

ACOAI Chunk Overlap ToolNEW

Split text into token-sized chunks with configurable overlap for RAG and embedding pipelines.

AISAI Input SanitizerNEW

Remove invisible Unicode, escape injection keywords, and strip dangerous content from LLM input.

Learn More

guide:chunking strategies use case:data cleaning

FAQ

What order are preprocessing steps applied in?: Steps are applied in this order: (1) trim leading/trailing whitespace, (2) normalize typographic characters (smart quotes, dashes), (3) strip HTML tags, (4) collapse multiple spaces and newlines, (5) truncate to the selected model's context window at a word boundary.
How does truncation to context window work?: The tool estimates token count using a word-based approximation and cuts the text at the last word boundary before the model's context limit. This ensures the output will fit in the model's context without being cut mid-word.
Can I skip individual steps?: The pipeline runs all steps in sequence. For individual step control, use the dedicated tools: AI Text Cleaner for HTML/whitespace, AI Text Normalizer for typographic normalization, or AI Chunk Overlap for chunking long documents.