Thursday, April 23, 2026

How To Get Your...

Introduction to Generative Engine Optimization The game has changed, and quite recently, too. Generative...

The Free Traffic Blueprint:...

The internet has become an essential part of our daily lives, and for...

Maximize Your Reach: The...

As a blogger, having a solid content marketing strategy is crucial to increasing...

The Benefits of Mobile-First...

Introduction to Mobile-First Design In today's digital world, having a website or app that...
HomeDigital MarketingMost Major News...

Most Major News Publishers Block AI Training & Retrieval Bots

Introduction to AI Training Bots and News Publishers

Most top news publishers block AI training bots via robots.txt, but they’re also blocking the retrieval bots that determine whether sites appear in AI-generated answers. A study by BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found that 79% block at least one training bot. More notably, 71% also block at least one retrieval or live search bot.

What The Data Shows

The study examined the top 50 news sites in each market based on SimilarWeb traffic share, then deduplicated the list. The study grouped bots into three categories: training, retrieval/live search, and indexing. Among training bots, Common Crawl’s CCBot was the most frequently blocked at 75%, followed by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%. Google-Extended, which trains Gemini, was the least blocked training bot at 46% overall.

Training Bot Blocks

US publishers blocked Google-Extended at 58%, nearly double the 29% rate among UK publishers. According to Harry Clarkson-Bennett, SEO Director at The Telegraph, “Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.”

- Advertisement -

Retrieval Bot Blocks

The study found 71% of sites block at least one retrieval or live search bot. Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot, which powers ChatGPT’s live search, was blocked by 49%. ChatGPT-User was blocked by 40%. Perplexity-User, which handles user-initiated retrieval requests, was the least blocked at 17%.

Indexing Blocks

PerplexityBot, which Perplexity uses to index pages for its search corpus, was blocked by 67% of sites. Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none.

The Enforcement Gap

The study acknowledges that robots.txt is a directive, not a barrier, and bots can ignore it. We covered this enforcement gap when Google’s Gary Illyes confirmed robots.txt can’t prevent unauthorized access. It functions more like a “please keep out” sign than a locked door. Clarkson-Bennett raised the same point in BuzzStream’s report: “The robots.txt file is a directive. It’s like a sign that says please keep out, but doesn’t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.”

Why This Matters

The retrieval-blocking numbers warrant attention here. In addition to opting out of AI training, many publishers are opting out of the citation and discovery layer that AI search tools use to surface sources. OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Blocking one doesn’t block the other. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval.

Looking Ahead

The robots.txt method has limits, and sites that want to block AI crawlers may find CDN-level restrictions more effective than robots.txt alone. Cloudflare’s Year in Review found GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role Google’s crawler plays in search indexing and AI training.

Conclusion

For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now. Publishers need to understand the implications of blocking AI bots and make informed decisions about their online presence. As the use of AI continues to grow, it’s essential to find a balance between protecting content and allowing discovery. By understanding the types of bots and their functions, publishers can make better decisions about their robots.txt files and ensure their content is visible to the right audiences.

- Advertisement -

Latest Articles

- Advertisement -

Continue reading

Google Tests AI Headlines, Rolls Out Spam Update – SEO Pulse

Introduction to Google's Latest Updates Google has been making significant changes to how content appears in its search results. This week's updates affect how headlines appear in search, how spam enforcement is handled, and how AI-generated content is labeled. These...

Google Answers Questions About Search Console’s Branded Queries Filter

Introduction to Google Search Console's Branded Queries Filter Google Search Central recently announced that the branded queries filter in Search Console is now available to all eligible sites. This update has led to many questions from SEOs, which Google's John...

ChatGPT’s Default & Premium Models Search The Web Differently

Introduction to ChatGPT Models Ask ChatGPT's default and premium models the same question, and they'll cite almost entirely different sources. A Writesonic analysis found that GPT-5.4 Thinking, ChatGPT's premium model, sent 56% of its citations to brand websites, while GPT-5.3...

WordPress Gutenberg 22.7 Lays Groundwork For AI Publishing

New Updates in Gutenberg 22.7 Introduction to New Features Gutenberg 22.7 has introduced several exciting new features that make it easier for users to work with the platform. One of the key updates is the live preview for style variation transforms,...