Should I block AI bots in robots.txt?

Depends on the use case. (1) Blocking GPTBot, ClaudeBot, etc prevents your content from being used to train models, but also blocks the crawlers that surface your brand in ChatGPT/Claude/Perplexity answers. (2) Blocking Google-Extended specifically opts out of Gemini training without affecting Google search rankings. Most sites benefit from being indexed by AI engines (it drives citation traffic and brand visibility); only block if you have a specific reason — paywalled content, premium training data, IP concerns.

What's the difference between robots.txt block and noindex?

robots.txt prevents crawling — Googlebot can't fetch the URL. noindex (meta tag or HTTP header) prevents indexing — Googlebot can fetch and read it but won't put it in the index. Critical: a URL blocked by robots.txt CAN still appear in search results (Google indexes the URL based on links to it, just without content). To remove a page from the index entirely, use noindex AND make sure robots.txt allows crawl so Google can read the noindex tag.

How do I test my robots.txt?

Three sources. (1) Search Console > robots.txt Tester (legacy but still works) — check specific URLs against the live file. (2) URL Inspection in Search Console — confirms whether a specific URL is blocked. (3) Curl the file directly: curl https://www.example.com/robots.txt. After every deploy that touches robots.txt, sanity-check the production file against the staging template — accidentally promoting a 'Disallow: /' from staging to production has killed entire sites.

Where should robots.txt live?

Always at the root of the domain: /robots.txt. Subdirectories (e.g. /folder/robots.txt) are not honored. Each subdomain needs its own (m.example.com/robots.txt is separate from www.example.com/robots.txt). The file is plain text, UTF-8 encoded, served with Content-Type: text/plain.

Technical SEO

Chapter 06 / 09

robots.txt

What robots.txt actually controls (it doesn't prevent indexing), the syntax that matters, and the AI-engine bot rules every site needs in 2026 — ChatGPT, Gemini, Perplexity, Claude.

8 min readPublished May 4, 2026

robots.txtis a small text file with an outsized capacity to ruin a site. It controls crawl, not indexing — a distinction that catches more teams than any other technical-SEO concept. Used right, it manages crawl budget on large sites, blocks low-value paths from waste, and tells AI engine bots whether they’re welcome. Used wrong, it silently kills traffic.

“robots.txt is a hint to well-behaved crawlers, not an access-control system. If a URL must be private, authentication is the answer. If it must not appear in Google’s index, noindex is the answer. robots.txt is for crawl management — that’s the only job it does.”

What robots.txt does — and doesn’t

What it DOES	What it DOES NOT
Tell well-behaved crawlers which paths they can fetch	Prevent indexing — URLs blocked from crawl can still appear in SERPs
Manage crawl budget by excluding low-value paths	Hide content from the public — /robots.txt is publicly readable
Reference the XML sitemap location	Enforce access control — bad bots ignore it
Differentiate behavior per user-agent (Googlebot, Bingbot, GPTBot)	Block already-indexed URLs from search — needs noindex + recrawl

What it DOESTell well-behaved crawlers which paths they can fetch

What it DOES NOTPrevent indexing — URLs blocked from crawl can still appear in SERPs

What it DOESManage crawl budget by excluding low-value paths

What it DOES NOTHide content from the public — /robots.txt is publicly readable

What it DOESReference the XML sitemap location

What it DOES NOTEnforce access control — bad bots ignore it

What it DOESDifferentiate behavior per user-agent (Googlebot, Bingbot, GPTBot)

What it DOES NOTBlock already-indexed URLs from search — needs noindex + recrawl

The syntax that matters

A minimal robots.txt:

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

That’s the “everything is open, here’s where my sitemap lives” configuration. Every directive in detail:

User-agent: — which crawler the rules apply to. * matches all bots; Googlebot matches only Googlebot; specific bots can be named.
Allow: / Disallow: — paths to allow or block. Paths are relative to the domain root. Wildcards (*) and end-of-string ($) are supported.
Sitemap: — full absolute URL of the XML sitemap. Multiple Sitemap lines allowed for sitemap-indexes by content type.
Crawl-delay: — Google ignores it; Bing and Yandex respect it. Manage crawl rate via Search Console for Google.

Common patterns

Block /admin and internal search results

User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /*?sort=

Allow everything

User-agent: *
Allow: /

Block everything (catastrophic if pushed accidentally)

User-agent: *
Disallow: /

This is the rule that kills sites when staging robots.txt promotes to production. Add a deploy-time check that flags Disallow: / against User-agent: * in production.

Different rules per bot

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Disallow: /premium/

User-agent: *
Allow: /

The 2026 AI-engine bot list

New surface for technical SEO in 2026: AI engines have their own crawlers. Decide explicitly whether each is welcome. Default for most sites: allow them all (drives citation traffic from AI answers); block only with a specific reason.

Bot	Engine / use case	Block to opt out of...
GPTBot	OpenAI — training and ChatGPT browse	ChatGPT training data + ChatGPT-with-search citations
ChatGPT-User	OpenAI — real-time browsing in ChatGPT	Live ChatGPT citations only (training unaffected)
Google-Extended	Google — Gemini training (not Search)	Gemini training without affecting Google Search rankings
ClaudeBot	Anthropic — training and Claude.ai citations	Claude training data + Claude citation surface
PerplexityBot	Perplexity — real-time browsing for answers	Perplexity citation surface
Applebot-Extended	Apple Intelligence — training	Apple Intelligence training data
CCBot	Common Crawl — used by many model trainers	Most public model training data

BotGPTBot

Engine / use caseOpenAI — training and ChatGPT browse

Block to opt out of...ChatGPT training data + ChatGPT-with-search citations

BotChatGPT-User

Engine / use caseOpenAI — real-time browsing in ChatGPT

Block to opt out of...Live ChatGPT citations only (training unaffected)

BotGoogle-Extended

Engine / use caseGoogle — Gemini training (not Search)

Block to opt out of...Gemini training without affecting Google Search rankings

BotClaudeBot

Engine / use caseAnthropic — training and Claude.ai citations

Block to opt out of...Claude training data + Claude citation surface

BotPerplexityBot

Engine / use casePerplexity — real-time browsing for answers

Block to opt out of...Perplexity citation surface

BotApplebot-Extended

Engine / use caseApple Intelligence — training

Block to opt out of...Apple Intelligence training data

BotCCBot

Engine / use caseCommon Crawl — used by many model trainers

Block to opt out of...Most public model training data

The catastrophic mistakes

Pushing staging’s Disallow: / to production. The single most common way to kill a site’s traffic. Add a deploy-time guard that fails the build if the production robots.txt blocks the root path.
Blocking JS / CSS files Google needs to render the page. Modern sites are JS-rendered; if Googlebot can’t fetch the JS, it can’t see the rendered content. Allow /_next/static/, your CSS bundle paths, your JS bundles. Audit via Search Console > URL Inspection > View tested page.
Using robots.txt to “hide” sensitive URLs. robots.txt is publicly readable at /robots.txt — anyone curious about what you’re hiding can see the list. Use authentication for sensitive content.
Blocking pages you want removed from the index. Doesn’t work — Google can still index URLs it can’t crawl. Use noindex meta tag instead, and make sure crawl is allowed so Google can read the tag.
Forgetting to reference the sitemap. Add Sitemap: line at the bottom; lets crawlers other than Googlebot find it without manual submission.
One robots.txt per subdomain forgotten. m.example.com, blog.example.com, shop.example.com each need their own. Common after subdomain migrations.

The deploy-safety pattern

For sites where a robots.txt regression would cost real money, add this CI check before deploy:

Fail build if production robots.txt contains User-agent: *\nDisallow: / with no other Allow rules.
Fail build if production robots.txt is empty / 404s / 500s.
Diff the deploy candidate against the previous live version; surface any change to a human reviewer.
Daily smoke test that fetches /robots.txt from prod and validates it parses cleanly.

Cheap to implement, pays for itself the one time it catches a staging-to-prod accident.

The bottom line

robots.txt is a crawl-management hint, not an access-control system and not an indexing control. Use it to manage crawl budget, exclude low-value paths, and decide which AI bots get access. Use authentication for privacy and noindex for index removal — those are different jobs. Add a deploy guard against Disallow: / in production. The file is a one-line change away from disaster; the safety net pays for itself.

Common questions

Quick answers to what we get asked before every trial signup.

It tells well-behaved web crawlers which paths they can fetch. That's it. It does NOT prevent indexing — Google can index a URL it can't crawl if it's linked from elsewhere, just without seeing the content. It does NOT keep content secret — robots.txt is publicly readable at /robots.txt. It does NOT enforce anything — bad bots ignore it. The single most common mistake in technical SEO is using robots.txt to 'hide' a page when you actually want noindex.

In this cluster

Technical SEO

Previous chapter

05. XML sitemaps

Next chapter

07. Canonical tags

Product

Resources

Company

robots.txt

What robots.txt does — and doesn’t

The syntax that matters

Common patterns

Block /admin and internal search results

Allow everything

Block everything (catastrophic if pushed accidentally)

Different rules per bot

The 2026 AI-engine bot list

The catastrophic mistakes

The deploy-safety pattern

The bottom line

Common questions

Technical SEO