Technical SEO
Chapter 06 / 09
robots.txt
What robots.txt actually controls (it doesn't prevent indexing), the syntax that matters, and the AI-engine bot rules every site needs in 2026 — ChatGPT, Gemini, Perplexity, Claude.

robots.txtis a small text file with an outsized capacity to ruin a site. It controls crawl, not indexing — a distinction that catches more teams than any other technical-SEO concept. Used right, it manages crawl budget on large sites, blocks low-value paths from waste, and tells AI engine bots whether they’re welcome. Used wrong, it silently kills traffic.
“robots.txt is a hint to well-behaved crawlers, not an access-control system. If a URL must be private, authentication is the answer. If it must not appear in Google’s index, noindex is the answer. robots.txt is for crawl management — that’s the only job it does.”
What robots.txt does — and doesn’t
| What it DOES | What it DOES NOT |
|---|---|
| Tell well-behaved crawlers which paths they can fetch | Prevent indexing — URLs blocked from crawl can still appear in SERPs |
| Manage crawl budget by excluding low-value paths | Hide content from the public — /robots.txt is publicly readable |
| Reference the XML sitemap location | Enforce access control — bad bots ignore it |
| Differentiate behavior per user-agent (Googlebot, Bingbot, GPTBot) | Block already-indexed URLs from search — needs noindex + recrawl |
The syntax that matters
A minimal robots.txt:
User-agent: *Allow: /Sitemap: https://www.example.com/sitemap.xml
That’s the “everything is open, here’s where my sitemap lives” configuration. Every directive in detail:
User-agent:— which crawler the rules apply to.*matches all bots;Googlebotmatches only Googlebot; specific bots can be named.Allow:/Disallow:— paths to allow or block. Paths are relative to the domain root. Wildcards (*) and end-of-string ($) are supported.Sitemap:— full absolute URL of the XML sitemap. Multiple Sitemap lines allowed for sitemap-indexes by content type.Crawl-delay:— Google ignores it; Bing and Yandex respect it. Manage crawl rate via Search Console for Google.
Common patterns
Block /admin and internal search results
User-agent: *Disallow: /admin/Disallow: /search?Disallow: /*?sort=
Allow everything
User-agent: *Allow: /
Block everything (catastrophic if pushed accidentally)
User-agent: *Disallow: /
This is the rule that kills sites when staging robots.txt promotes to production. Add a deploy-time check that flags Disallow: / against User-agent: * in production.
Different rules per bot
User-agent: GooglebotAllow: /User-agent: GPTBotDisallow: /premium/User-agent: *Allow: /
The 2026 AI-engine bot list
New surface for technical SEO in 2026: AI engines have their own crawlers. Decide explicitly whether each is welcome. Default for most sites: allow them all (drives citation traffic from AI answers); block only with a specific reason.
| Bot | Engine / use case | Block to opt out of... |
|---|---|---|
| GPTBot | OpenAI — training and ChatGPT browse | ChatGPT training data + ChatGPT-with-search citations |
| ChatGPT-User | OpenAI — real-time browsing in ChatGPT | Live ChatGPT citations only (training unaffected) |
| Google-Extended | Google — Gemini training (not Search) | Gemini training without affecting Google Search rankings |
| ClaudeBot | Anthropic — training and Claude.ai citations | Claude training data + Claude citation surface |
| PerplexityBot | Perplexity — real-time browsing for answers | Perplexity citation surface |
| Applebot-Extended | Apple Intelligence — training | Apple Intelligence training data |
| CCBot | Common Crawl — used by many model trainers | Most public model training data |
The catastrophic mistakes
- Pushing staging’s
Disallow: /to production. The single most common way to kill a site’s traffic. Add a deploy-time guard that fails the build if the production robots.txt blocks the root path. - Blocking JS / CSS files Google needs to render the page. Modern sites are JS-rendered; if Googlebot can’t fetch the JS, it can’t see the rendered content. Allow
/_next/static/, your CSS bundle paths, your JS bundles. Audit via Search Console > URL Inspection > View tested page. - Using robots.txt to “hide” sensitive URLs. robots.txt is publicly readable at /robots.txt — anyone curious about what you’re hiding can see the list. Use authentication for sensitive content.
- Blocking pages you want removed from the index. Doesn’t work — Google can still index URLs it can’t crawl. Use
noindexmeta tag instead, and make sure crawl is allowed so Google can read the tag. - Forgetting to reference the sitemap. Add
Sitemap:line at the bottom; lets crawlers other than Googlebot find it without manual submission. - One robots.txt per subdomain forgotten.
m.example.com,blog.example.com,shop.example.comeach need their own. Common after subdomain migrations.
The deploy-safety pattern
For sites where a robots.txt regression would cost real money, add this CI check before deploy:
- Fail build if production robots.txt contains
User-agent: *\nDisallow: /with no other Allow rules. - Fail build if production robots.txt is empty / 404s / 500s.
- Diff the deploy candidate against the previous live version; surface any change to a human reviewer.
- Daily smoke test that fetches
/robots.txtfrom prod and validates it parses cleanly.
Cheap to implement, pays for itself the one time it catches a staging-to-prod accident.
The bottom line
robots.txt is a crawl-management hint, not an access-control system and not an indexing control. Use it to manage crawl budget, exclude low-value paths, and decide which AI bots get access. Use authentication for privacy and noindex for index removal — those are different jobs. Add a deploy guard against Disallow: / in production. The file is a one-line change away from disaster; the safety net pays for itself.
Common questions
Common questions
Quick answers to what we get asked before every trial signup.
It tells well-behaved web crawlers which paths they can fetch. That's it. It does NOT prevent indexing — Google can index a URL it can't crawl if it's linked from elsewhere, just without seeing the content. It does NOT keep content secret — robots.txt is publicly readable at /robots.txt. It does NOT enforce anything — bad bots ignore it. The single most common mistake in technical SEO is using robots.txt to 'hide' a page when you actually want noindex.
In this cluster