05

Technical SEO

Chapter 05 / 09

XML sitemaps

What sitemaps actually do (and don't), what to put in them, what to leave out, and the structure that scales from a 50-page site to a 5-million-URL marketplace.

8 min readPublished May 4, 2026
XML sitemaps

XML sitemaps are simple in concept and consistently misunderstood in practice. They’re a hint to search engines about which URLs matter — nothing more, nothing less. They don’t boost rankings, don’t guarantee indexing, and don’t solve content quality problems. What they do is give search engines a curated list of canonical URLs you want crawled, and give you a clean Search Console coverage report you can actually act on.

A sitemap is not a request that Google index your URLs. It’s a hint that you think these URLs are important. Quality, duplication, and crawl budget still decide whether the index accepts them.

What sitemaps do — and what they don’t

What sitemaps DOHelp search engines discover URLs they might otherwise miss
What sitemaps DON'T doForce indexing — quality and duplication checks still apply
What sitemaps DOSpeed up crawl of new URLs added to the sitemap
What sitemaps DON'T doBoost rankings — they're a discovery hint, not a ranking signal
What sitemaps DOSurface coverage data in Search Console for systematic auditing
What sitemaps DON'T doReplace internal linking — orphan pages still rank weakly even if in the sitemap
What sitemaps DOAllow last-modified hints (lastmod) for content-refresh detection
What sitemaps DON'T doOverride noindex tags or robots.txt blocks
What sitemaps DOScale via sitemap-index files for large catalogs
What sitemaps DON'T doExcuse poor architecture — they're a complement to internal linking, not a substitute

Sitemap structure — the basics

A minimal sitemap is XML listing each URL with optional metadata:

<url>
  <loc>https://www.example.com/page/</loc>
  <lastmod>2026-05-04</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>

In practice:

  • loc — required. The full canonical URL.
  • lastmod — strongly recommended. The date the content meaningfully changed. Google uses this for refresh detection. Don’t lie about it; updating lastmod on every deploy without changing content trains Google to ignore it.
  • changefreq — Google has publicly said this is mostly ignored; safe to omit.
  • priority — also mostly ignored; safe to omit.

Modern sitemaps usually include just loc and lastmod. Anything more is theatre.

What to include — and what to exclude

The mental model: the sitemap is the curated list of every URL you want indexed and would be proud to rank. If you wouldn’t want a URL ranking on Google, it doesn’t belong in your sitemap.

Include

  • Canonical homepage and all canonical landing pages.
  • Cluster pages, hub pages, category pages.
  • All article / blog post / academy URLs.
  • Product detail pages (canonical only).
  • Service pages, location pages, comparison pages.
  • Any user-generated content you’ve decided to make crawlable (review pages, profile pages with substantive content).

Exclude

  • Pages with noindex meta tag.
  • URLs blocked by robots.txt.
  • Non-canonical duplicates — only the canonical belongs.
  • Paginated archive pages (/page/2/, /page/3/, etc).
  • Parameter variants (sort, filter, tracking) — only the clean canonical URL.
  • Internal search result pages.
  • Login, signup, thank-you, confirmation, account pages.
  • Redirect URLs — point at the destination, not the redirect.
  • Soft-404 candidates and pages returning 4xx/5xx.
  • Print versions, AMP pages (if separate URLs), m. mobile subdomain pages.

Sitemap-index files for larger sites

Google’s limits per sitemap file:

  • Maximum 50,000 URLs per sitemap.
  • Maximum 50 MB uncompressed file size.

Above either limit, split into multiple sitemap files referenced from a sitemap-index. The index pattern most teams use:

  • /sitemap.xml — the index, references all sub-sitemaps
  • /sitemap-articles.xml — all academy / blog articles
  • /sitemap-products.xml — all canonical product detail pages
  • /sitemap-categories.xml — all category and hub pages
  • /sitemap-locations.xml — for multi-location businesses

Splitting by content type makes the Search Console coverage report directly actionable — you can see at a glance whether the issue is in articles, products, or categories. A single monolithic sitemap forces you to filter manually.

Submission and validation

  • 1. Reference the sitemap in robots.txt — add Sitemap: https://www.example.com/sitemap.xml at the bottom. Tells crawlers where to look.
  • 2. Submit via Search Console — Sitemaps section. Google reads it and reports on coverage.
  • 3. Validate the format — Search Console flags syntax errors. Lighthouse and online validators (xml-sitemaps.com, sitemaps.org/protocol.html) check well-formedness.
  • 4. Monitor Search Console > Sitemaps — see submitted vs indexed counts; investigate gaps.
  • 5. Resubmit on major content changes — Google polls the sitemap automatically; manual resubmission speeds up discovery for urgent changes.

Common sitemap mistakes

  • Including non-canonical duplicate URLs. The sitemap should be the canonical list; duplicates dilute the signal.
  • Including noindex URLs. Confuses Google — you’re saying both “index this” and “don’t index this”.
  • Stale lastmod values. Either updating on every deploy (trains Google to ignore the field) or never updating at all (Google never re-crawls fresh content).
  • Sitemap returning 4xx/5xx. Search Console flags it; the bot can’t read it; nothing gets discovered.
  • Sitemap not referenced in robots.txt. Discovery still works via Search Console, but other crawlers (Bing, AI engine bots) may miss it.
  • Forgetting to update the sitemap when content changes. Especially common on hand-written sitemaps; auto-generation via the CMS / framework solves it.

The bottom line

An XML sitemap is a curated discovery hint. It tells search engines which URLs you consider canonical and worth crawling, and it gives you a Search Console coverage report you can audit systematically. It doesn’t guarantee indexing, doesn’t boost rankings, and doesn’t replace internal linking. Build it from canonical, indexable, valuable URLs only; split by content type once you exceed the limits; reference it from robots.txt and submit it via Search Console. Then watch the coverage report — it’s where most genuine crawl issues surface first.

Common questions

Common questions

Quick answers to what we get asked before every trial signup.

It tells search engines which URLs you consider important enough to crawl. It's a discovery hint — not a ranking signal, not a guarantee of indexing. URLs in your sitemap can still be ignored by Google if they fail quality, duplication, or technical checks. URLs not in your sitemap can still be discovered and indexed via internal links. Treat the sitemap as one of three discovery channels alongside internal linking and external links — important, but not the whole story.