Clean HTML-Heavy Text for AI Processing
Web-scraped content is almost never ready to feed directly to an LLM. HTML tags, entity-encoded characters, navigation boilerplate, cookie consent banners, and JavaScript snippets all inflate token counts and distract the model from the actual content. This example shows a realistic snippet of HTML-heavy web content and demonstrates how the text cleaner strips all markup, decodes entities, and returns clean prose that tokens efficiently and processes reliably. The cleaning pipeline runs five steps in order: HTML tag removal, entity decoding (& → &, → space, < → <), whitespace normalization, boilerplate pattern removal (common navigation and cookie consent phrases), and Unicode normalization. The order matters — removing tags first prevents entities inside tag attributes from being decoded into the visible text. For RAG pipelines that ingest web content, text cleaning before chunking and embedding dramatically improves retrieval quality. Clean text produces tighter, more semantically coherent embeddings because the embedding model is not wasting capacity on HTML structure. Aim to reduce token count by at least 40% compared to the raw HTML for typical web pages.
<div class="article-content">
<h1>Introduction to Machine Learning</h1>
<p>Machine learning is a subset of artificial intelligence that enables systems to learn from data.</p>
<nav><a href="/">Home</a> | <a href="/blog">Blog</a></nav>
<p>Key algorithms include:</p>
<ul>
<li><strong>Decision Trees</strong> — rule-based classifiers</li>
<li><em>Neural Networks</em> — inspired by the brain</li>
</ul>
<script>trackPageView('article-ml-intro');</script>
<p>Click "Accept Cookies" to continue browsing our site.</p>
</div>FAQ
- Why does HTML hurt LLM performance?
- HTML tags consume tokens without adding semantic value. A tag like <div class="article-content"> uses 6+ tokens that contribute nothing to the content. For long documents this can waste 20-40% of your context window.
- Should I preserve any HTML structure?
- For most LLM tasks, no. The exception is when document structure is part of the task — for example, extracting specific sections by heading. In that case, convert headings to markdown (#, ##) and remove all other tags.
- What other artifacts should I clean from scraped content?
- Navigation menus, cookie consent notices, share buttons, related article lists, and advertisement text all appear in scraped pages. Pattern-based removal of common boilerplate phrases is effective for high-volume pipelines.
Related Examples
Prompts written in word processors or copied from websites often contain invisib...
Deduplicate a Training Dataset ListDuplicate entries in AI training datasets are a silent quality problem. Exact du...
Format CSV Data for AI Fine-TuningFine-tuning LLMs on custom datasets requires converting raw training data into t...