Block AI Crawlers robots.txt Template
As large language models became commercially significant in 2023–2024, the companies building them deployed specialized web crawlers to collect training data from public websites. These crawlers operate differently from search engine bots: rather than indexing content for search results (which sends traffic back to your site), AI training crawlers extract content to train models that may generate summaries or answers that replace visits to your site entirely. Many publishers and content creators have responded by blocking these crawlers in robots.txt.
The major AI training crawlers to block include: GPTBot (OpenAI), CCBot (Common Crawl, which provides training data to many AI companies including Meta and Mistral), Claude-Web (Anthropic), Google-Extended (Google AI/Gemini, separate from Googlebot for search), Omgili and OmgiliBot, PerplexityBot (Perplexity AI), Bytespider (ByteDance/TikTok AI), and FacebookBot. This list expands regularly as new AI companies launch crawlers.
It is important to distinguish Google-Extended from Googlebot. Google-Extended controls whether your content is used to train Gemini and improve Google's AI products. Blocking Google-Extended does not affect your Google Search rankings — Googlebot (the search crawler) continues operating normally. This distinction allows you to preserve your organic search traffic while opting out of AI training data collection.
Blocking robots.txt crawlers only works for compliant bots. Reputable AI companies (OpenAI, Anthropic, Google) have stated they respect robots.txt. Common Crawl has a more mixed record. Unknown scrapers and data brokers that do not identify themselves with recognizable User-agent strings will not be blocked by robots.txt. For more robust protection, consider using rate limiting, CAPTCHAs, or commercial bot management solutions.
The legal landscape around AI web scraping is rapidly evolving. Multiple lawsuits (New York Times v. OpenAI, class actions from book authors) are testing whether scraping for AI training constitutes copyright infringement regardless of robots.txt. Blocking crawlers in robots.txt establishes a documented opt-out that may be relevant in future legal proceedings, even if its legal significance is not yet settled.
For content businesses — news publishers, educational content sites, creative writing platforms — blocking AI crawlers is becoming standard practice. For SaaS companies, developer tools, and informational sites that derive value from broad discovery, the calculus is different: appearing in AI search results (Perplexity, ChatGPT Browse) can drive traffic, making selective blocking (training data crawlers only, not inference crawlers) more appropriate.
Template Preview
# Standard search engines — allow User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # Google AI training — block (does not affect Search) User-agent: Google-Extended Disallow: / # OpenAI training and browsing User-agent: GPTBot Disallow: / # Anthropic User-agent: Claude-Web Disallow: / User-agent: anthropic-ai Disallow: / # Common Crawl (training data for many LLMs) User-agent: CCBot Disallow: / # Perplexity AI User-agent: PerplexityBot Disallow: / # ByteDance AI User-agent: Bytespider Disallow: / # Meta AI User-agent: FacebookBot Disallow: / # Default — allow all other crawlers User-agent: * Disallow: /admin/ Allow: / Sitemap: https://example.com/sitemap.xml
Customize this template with your own details using the free generator:
▸Open in GeneratorFAQ
- Does blocking GPTBot prevent my content from appearing in ChatGPT answers?
- Blocking GPTBot prevents OpenAI from crawling your site to update its training data going forward. However, if OpenAI has already crawled your site (the training data cutoff for existing models is in the past), blocking now does not retroactively remove your content from existing models. For real-time browsing in ChatGPT (when users activate the Browse feature), blocking GPTBot should prevent live retrieval from your site.
- Will blocking AI crawlers hurt my SEO?
- Blocking AI training crawlers does not affect your Google Search rankings because Googlebot and Google-Extended are separate crawlers. Block Google-Extended to opt out of Gemini/Google AI training while keeping Googlebot enabled for search. Blocking Perplexity may reduce your site's visibility in Perplexity AI search results. Whether Perplexity traffic is valuable to you depends on your audience — developer and researcher audiences tend to use Perplexity more.
- Is blocking AI crawlers legally enforceable?
- The legal status of robots.txt as an enforceable mechanism is unsettled. robots.txt is not a contract. Courts have generally held that scraping public content is legal absent a contractual prohibition (such as a Terms of Service that users have agreed to) or computer fraud statutes. However, courts have also found that ignoring robots.txt after receiving a specific cease-and-desist communication can strengthen a legal claim. Block crawlers in robots.txt and update your Terms of Service to explicitly prohibit AI training data collection.