06

Technical SEO

Chapter 06 / 09

robots.txt

What robots.txt actually controls (it doesn't prevent indexing), the syntax that matters, and the AI-engine bot rules every site needs in 2026 — ChatGPT, Gemini, Perplexity, Claude.

8 min readPublished May 4, 2026
robots.txt

robots.txtis a small text file with an outsized capacity to ruin a site. It controls crawl, not indexing — a distinction that catches more teams than any other technical-SEO concept. Used right, it manages crawl budget on large sites, blocks low-value paths from waste, and tells AI engine bots whether they’re welcome. Used wrong, it silently kills traffic.

robots.txt is a hint to well-behaved crawlers, not an access-control system. If a URL must be private, authentication is the answer. If it must not appear in Google’s index, noindex is the answer. robots.txt is for crawl management — that’s the only job it does.

What robots.txt does — and doesn’t

What it DOESTell well-behaved crawlers which paths they can fetch
What it DOES NOTPrevent indexing — URLs blocked from crawl can still appear in SERPs
What it DOESManage crawl budget by excluding low-value paths
What it DOES NOTHide content from the public — /robots.txt is publicly readable
What it DOESReference the XML sitemap location
What it DOES NOTEnforce access control — bad bots ignore it
What it DOESDifferentiate behavior per user-agent (Googlebot, Bingbot, GPTBot)
What it DOES NOTBlock already-indexed URLs from search — needs noindex + recrawl

The syntax that matters

A minimal robots.txt:

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

That’s the “everything is open, here’s where my sitemap lives” configuration. Every directive in detail:

  • User-agent: — which crawler the rules apply to. * matches all bots; Googlebot matches only Googlebot; specific bots can be named.
  • Allow: / Disallow: — paths to allow or block. Paths are relative to the domain root. Wildcards (*) and end-of-string ($) are supported.
  • Sitemap: — full absolute URL of the XML sitemap. Multiple Sitemap lines allowed for sitemap-indexes by content type.
  • Crawl-delay: — Google ignores it; Bing and Yandex respect it. Manage crawl rate via Search Console for Google.

Common patterns

Block /admin and internal search results

User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /*?sort=

Allow everything

User-agent: *
Allow: /

Block everything (catastrophic if pushed accidentally)

User-agent: *
Disallow: /

This is the rule that kills sites when staging robots.txt promotes to production. Add a deploy-time check that flags Disallow: / against User-agent: * in production.

Different rules per bot

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Disallow: /premium/

User-agent: *
Allow: /

The 2026 AI-engine bot list

New surface for technical SEO in 2026: AI engines have their own crawlers. Decide explicitly whether each is welcome. Default for most sites: allow them all (drives citation traffic from AI answers); block only with a specific reason.

BotGPTBot
Engine / use caseOpenAI — training and ChatGPT browse
Block to opt out of...ChatGPT training data + ChatGPT-with-search citations
BotChatGPT-User
Engine / use caseOpenAI — real-time browsing in ChatGPT
Block to opt out of...Live ChatGPT citations only (training unaffected)
BotGoogle-Extended
Engine / use caseGoogle — Gemini training (not Search)
Block to opt out of...Gemini training without affecting Google Search rankings
BotClaudeBot
Engine / use caseAnthropic — training and Claude.ai citations
Block to opt out of...Claude training data + Claude citation surface
BotPerplexityBot
Engine / use casePerplexity — real-time browsing for answers
Block to opt out of...Perplexity citation surface
BotApplebot-Extended
Engine / use caseApple Intelligence — training
Block to opt out of...Apple Intelligence training data
BotCCBot
Engine / use caseCommon Crawl — used by many model trainers
Block to opt out of...Most public model training data

The catastrophic mistakes

  • Pushing staging’s Disallow: / to production. The single most common way to kill a site’s traffic. Add a deploy-time guard that fails the build if the production robots.txt blocks the root path.
  • Blocking JS / CSS files Google needs to render the page. Modern sites are JS-rendered; if Googlebot can’t fetch the JS, it can’t see the rendered content. Allow /_next/static/, your CSS bundle paths, your JS bundles. Audit via Search Console > URL Inspection > View tested page.
  • Using robots.txt to “hide” sensitive URLs. robots.txt is publicly readable at /robots.txt — anyone curious about what you’re hiding can see the list. Use authentication for sensitive content.
  • Blocking pages you want removed from the index. Doesn’t work — Google can still index URLs it can’t crawl. Use noindex meta tag instead, and make sure crawl is allowed so Google can read the tag.
  • Forgetting to reference the sitemap. Add Sitemap: line at the bottom; lets crawlers other than Googlebot find it without manual submission.
  • One robots.txt per subdomain forgotten. m.example.com, blog.example.com, shop.example.com each need their own. Common after subdomain migrations.

The deploy-safety pattern

For sites where a robots.txt regression would cost real money, add this CI check before deploy:

  • Fail build if production robots.txt contains User-agent: *\nDisallow: / with no other Allow rules.
  • Fail build if production robots.txt is empty / 404s / 500s.
  • Diff the deploy candidate against the previous live version; surface any change to a human reviewer.
  • Daily smoke test that fetches /robots.txt from prod and validates it parses cleanly.

Cheap to implement, pays for itself the one time it catches a staging-to-prod accident.

The bottom line

robots.txt is a crawl-management hint, not an access-control system and not an indexing control. Use it to manage crawl budget, exclude low-value paths, and decide which AI bots get access. Use authentication for privacy and noindex for index removal — those are different jobs. Add a deploy guard against Disallow: / in production. The file is a one-line change away from disaster; the safety net pays for itself.

Common questions

Common questions

Quick answers to what we get asked before every trial signup.

It tells well-behaved web crawlers which paths they can fetch. That's it. It does NOT prevent indexing — Google can index a URL it can't crawl if it's linked from elsewhere, just without seeing the content. It does NOT keep content secret — robots.txt is publicly readable at /robots.txt. It does NOT enforce anything — bad bots ignore it. The single most common mistake in technical SEO is using robots.txt to 'hide' a page when you actually want noindex.