06

Technical SEO

Chapter 06 / 09

robots.txt

What robots.txt actually controls (it doesn't prevent indexing), the syntax that matters, and the AI-engine bot rules every site needs in 2026 — ChatGPT, Gemini, Perplexity, Claude.

8 min readPublished May 4, 2026
robots.txt

robots.txtis a small text file with an outsized capacity to ruin a site. It controls crawl, not indexing — a distinction that catches more teams than any other technical-SEO concept. Used right, it manages crawl budget on large sites, blocks low-value paths from waste, and tells AI engine bots whether they’re welcome. Used wrong, it silently kills traffic.

robots.txt is a hint to well-behaved crawlers, not an access-control system. If a URL must be private, authentication is the answer. If it must not appear in Google’s index, noindex is the answer. robots.txt is for crawl management — that’s the only job it does.

What robots.txt does — and doesn’t

What it DOESTell well-behaved crawlers which paths they can fetch
What it DOES NOTPrevent indexing — URLs blocked from crawl can still appear in SERPs
What it DOESManage crawl budget by excluding low-value paths
What it DOES NOTHide content from the public — /robots.txt is publicly readable
What it DOESReference the XML sitemap location
What it DOES NOTEnforce access control — bad bots ignore it
What it DOESDifferentiate behavior per user-agent (Googlebot, Bingbot, GPTBot)
What it DOES NOTBlock already-indexed URLs from search — needs noindex + recrawl

The syntax that matters

A minimal robots.txt:

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

That’s the “everything is open, here’s where my sitemap lives” configuration. Every directive in detail:

  • User-agent: — which crawler the rules apply to. * matches all bots; Googlebot matches only Googlebot; specific bots can be named.
  • Allow: / Disallow: — paths to allow or block. Paths are relative to the domain root. Wildcards (*) and end-of-string ($) are supported.
  • Sitemap: — full absolute URL of the XML sitemap. Multiple Sitemap lines allowed for sitemap-indexes by content type.
  • Crawl-delay: — Google ignores it; Bing and Yandex respect it. Manage crawl rate via Search Console for Google.

Common patterns

Block /admin and internal search results

User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /*?sort=

Allow everything

User-agent: *
Allow: /

Block everything (catastrophic if pushed accidentally)

User-agent: *
Disallow: /

This is the rule that kills sites when staging robots.txt promotes to production. Add a deploy-time check that flags Disallow: / against User-agent: * in production.

Different rules per bot

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Disallow: /premium/

User-agent: *
Allow: /

The 2026 AI-engine bot list

New surface for technical SEO in 2026: AI engines have their own crawlers. Decide explicitly whether each is welcome. Default for most sites: allow them all (drives citation traffic from AI answers); block only with a specific reason.

BotGPTBot
Engine / use caseOpenAI — training and ChatGPT browse
Block to opt out of...ChatGPT training data + ChatGPT-with-search citations
BotChatGPT-User
Engine / use caseOpenAI — real-time browsing in ChatGPT
Block to opt out of...Live ChatGPT citations only (training unaffected)
BotGoogle-Extended
Engine / use caseGoogle — Gemini training (not Search)
Block to opt out of...Gemini training without affecting Google Search rankings
BotClaudeBot
Engine / use caseAnthropic — training and Claude.ai citations
Block to opt out of...Claude training data + Claude citation surface
BotPerplexityBot
Engine / use casePerplexity — real-time browsing for answers
Block to opt out of...Perplexity citation surface
BotApplebot-Extended
Engine / use caseApple Intelligence — training
Block to opt out of...Apple Intelligence training data
BotCCBot
Engine / use caseCommon Crawl — used by many model trainers
Block to opt out of...Most public model training data

The catastrophic mistakes

  • Pushing staging’s Disallow: / to production. The single most common way to kill a site’s traffic. Add a deploy-time guard that fails the build if the production robots.txt blocks the root path.
  • Blocking JS / CSS files Google needs to render the page. Modern sites are JS-rendered; if Googlebot can’t fetch the JS, it can’t see the rendered content. Allow /_next/static/, your CSS bundle paths, your JS bundles. Audit via Search Console > URL Inspection > View tested page.
  • Using robots.txt to “hide” sensitive URLs. robots.txt is publicly readable at /robots.txt — anyone curious about what you’re hiding can see the list. Use authentication for sensitive content.
  • Blocking pages you want removed from the index. Doesn’t work — Google can still index URLs it can’t crawl. Use noindex meta tag instead, and make sure crawl is allowed so Google can read the tag.
  • Forgetting to reference the sitemap. Add Sitemap: line at the bottom; lets crawlers other than Googlebot find it without manual submission.
  • One robots.txt per subdomain forgotten. m.example.com, blog.example.com, shop.example.com each need their own. Common after subdomain migrations.

The deploy-safety pattern

For sites where a robots.txt regression would cost real money, add this CI check before deploy:

  • Fail build if production robots.txt contains User-agent: *\nDisallow: / with no other Allow rules.
  • Fail build if production robots.txt is empty / 404s / 500s.
  • Diff the deploy candidate against the previous live version; surface any change to a human reviewer.
  • Daily smoke test that fetches /robots.txt from prod and validates it parses cleanly.

Cheap to implement, pays for itself the one time it catches a staging-to-prod accident.

The bottom line

robots.txt is a crawl-management hint, not an access-control system and not an indexing control. Use it to manage crawl budget, exclude low-value paths, and decide which AI bots get access. Use authentication for privacy and noindex for index removal — those are different jobs. Add a deploy guard against Disallow: / in production. The file is a one-line change away from disaster; the safety net pays for itself.

Common questions

Common questions

Quick answers to what we get asked before every trial signup.

It tells well-behaved web crawlers which paths they can fetch. That's it. It does NOT prevent indexing — Google can index a URL it can't crawl if it's linked from elsewhere, just without seeing the content. It does NOT keep content secret — robots.txt is publicly readable at /robots.txt. It does NOT enforce anything — bad bots ignore it. The single most common mistake in technical SEO is using robots.txt to 'hide' a page when you actually want noindex.

Book a Demo

See the OS in Action

30-minute strategy session with our growth team. We’ll walk you through the platform, analyze your current SEO performance, and show you exactly where the growth opportunities are.

No commitment requiredFree site analysis includedTalk to a senior strategist

Quick context, then book

Three questions so we walk in already prepared. Calendar opens after you submit.

We never share your details. One human emails you back.