Do small sites need an XML sitemap?

Yes, even for sites under a few hundred URLs. The cost is near-zero (any modern CMS or framework auto-generates it) and it gives Search Console a clean coverage report — you can see exactly which submitted URLs are indexed vs not. Without a sitemap, the coverage report still works but is harder to filter and reason about.

What should I include in my sitemap?

Only canonical, indexable, valuable URLs that you want to rank. Exclude: noindex pages, redirect targets that aren't the destination, parameterised duplicates, paginated /page/2/ /page/3/ archives, internal-search results, login pages, thank-you / confirmation pages, soft-404 candidates. The sitemap should be a curated list of every page you want Google to take seriously — not a dump of every URL the site can serve.

Should I split my sitemap into multiple files?

Required if you exceed 50,000 URLs or 50 MB uncompressed; in practice, useful at much smaller sizes. Split by content type (products, articles, categories, tags) so the Search Console coverage report becomes readable per type. A sitemap-index.xml then references each shard. Common pattern: /sitemap.xml is the index pointing to /sitemap-articles.xml, /sitemap-products.xml, /sitemap-categories.xml.

What's the difference between an XML sitemap and an HTML sitemap?

Different audiences. XML sitemap is for search-engine crawlers — machine-readable, lives at a /sitemap.xml URL, submitted via Search Console. HTML sitemap is for users — rendered as a normal page, lists important sections of the site, helps with internal navigation and link discovery. Most sites benefit from having both; they serve different jobs and don't replace each other.

Technical SEO

Chapter 05 / 09

XML sitemaps

What sitemaps actually do (and don't), what to put in them, what to leave out, and the structure that scales from a 50-page site to a 5-million-URL marketplace.

8 min readPublished May 4, 2026

XML sitemaps are simple in concept and consistently misunderstood in practice. They’re a hint to search engines about which URLs matter — nothing more, nothing less. They don’t boost rankings, don’t guarantee indexing, and don’t solve content quality problems. What they do is give search engines a curated list of canonical URLs you want crawled, and give you a clean Search Console coverage report you can actually act on.

“A sitemap is not a request that Google index your URLs. It’s a hint that you think these URLs are important. Quality, duplication, and crawl budget still decide whether the index accepts them.”

What sitemaps do — and what they don’t

What sitemaps DO	What sitemaps DON'T do
Help search engines discover URLs they might otherwise miss	Force indexing — quality and duplication checks still apply
Speed up crawl of new URLs added to the sitemap	Boost rankings — they're a discovery hint, not a ranking signal
Surface coverage data in Search Console for systematic auditing	Replace internal linking — orphan pages still rank weakly even if in the sitemap
Allow last-modified hints (lastmod) for content-refresh detection	Override noindex tags or robots.txt blocks
Scale via sitemap-index files for large catalogs	Excuse poor architecture — they're a complement to internal linking, not a substitute

What sitemaps DOHelp search engines discover URLs they might otherwise miss

What sitemaps DON'T doForce indexing — quality and duplication checks still apply

What sitemaps DOSpeed up crawl of new URLs added to the sitemap

What sitemaps DON'T doBoost rankings — they're a discovery hint, not a ranking signal

What sitemaps DOSurface coverage data in Search Console for systematic auditing

What sitemaps DON'T doReplace internal linking — orphan pages still rank weakly even if in the sitemap

What sitemaps DOAllow last-modified hints (lastmod) for content-refresh detection

What sitemaps DON'T doOverride noindex tags or robots.txt blocks

What sitemaps DOScale via sitemap-index files for large catalogs

What sitemaps DON'T doExcuse poor architecture — they're a complement to internal linking, not a substitute

Sitemap structure — the basics

A minimal sitemap is XML listing each URL with optional metadata:

<url>
  <loc>https://www.example.com/page/</loc>
  <lastmod>2026-05-04</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>

In practice:

loc — required. The full canonical URL.
lastmod — strongly recommended. The date the content meaningfully changed. Google uses this for refresh detection. Don’t lie about it; updating lastmod on every deploy without changing content trains Google to ignore it.
changefreq — Google has publicly said this is mostly ignored; safe to omit.
priority — also mostly ignored; safe to omit.

Modern sitemaps usually include just loc and lastmod. Anything more is theatre.

What to include — and what to exclude

The mental model: the sitemap is the curated list of every URL you want indexed and would be proud to rank. If you wouldn’t want a URL ranking on Google, it doesn’t belong in your sitemap.

Include

Canonical homepage and all canonical landing pages.
Cluster pages, hub pages, category pages.
All article / blog post / academy URLs.
Product detail pages (canonical only).
Service pages, location pages, comparison pages.
Any user-generated content you’ve decided to make crawlable (review pages, profile pages with substantive content).

Exclude

Pages with noindex meta tag.
URLs blocked by robots.txt.
Non-canonical duplicates — only the canonical belongs.
Paginated archive pages (/page/2/, /page/3/, etc).
Parameter variants (sort, filter, tracking) — only the clean canonical URL.
Internal search result pages.
Login, signup, thank-you, confirmation, account pages.
Redirect URLs — point at the destination, not the redirect.
Soft-404 candidates and pages returning 4xx/5xx.
Print versions, AMP pages (if separate URLs), m. mobile subdomain pages.

Sitemap-index files for larger sites

Google’s limits per sitemap file:

Maximum 50,000 URLs per sitemap.
Maximum 50 MB uncompressed file size.

Above either limit, split into multiple sitemap files referenced from a sitemap-index. The index pattern most teams use:

/sitemap.xml — the index, references all sub-sitemaps
/sitemap-articles.xml — all academy / blog articles
/sitemap-products.xml — all canonical product detail pages
/sitemap-categories.xml — all category and hub pages
/sitemap-locations.xml — for multi-location businesses

Splitting by content type makes the Search Console coverage report directly actionable — you can see at a glance whether the issue is in articles, products, or categories. A single monolithic sitemap forces you to filter manually.

Submission and validation

1. Reference the sitemap in robots.txt — add Sitemap: https://www.example.com/sitemap.xml at the bottom. Tells crawlers where to look.
2. Submit via Search Console — Sitemaps section. Google reads it and reports on coverage.
3. Validate the format — Search Console flags syntax errors. Lighthouse and online validators (xml-sitemaps.com, sitemaps.org/protocol.html) check well-formedness.
4. Monitor Search Console > Sitemaps — see submitted vs indexed counts; investigate gaps.
5. Resubmit on major content changes — Google polls the sitemap automatically; manual resubmission speeds up discovery for urgent changes.

Common sitemap mistakes

Including non-canonical duplicate URLs. The sitemap should be the canonical list; duplicates dilute the signal.
Including noindex URLs. Confuses Google — you’re saying both “index this” and “don’t index this”.
Stale lastmod values. Either updating on every deploy (trains Google to ignore the field) or never updating at all (Google never re-crawls fresh content).
Sitemap returning 4xx/5xx. Search Console flags it; the bot can’t read it; nothing gets discovered.
Sitemap not referenced in robots.txt. Discovery still works via Search Console, but other crawlers (Bing, AI engine bots) may miss it.
Forgetting to update the sitemap when content changes. Especially common on hand-written sitemaps; auto-generation via the CMS / framework solves it.

The bottom line

An XML sitemap is a curated discovery hint. It tells search engines which URLs you consider canonical and worth crawling, and it gives you a Search Console coverage report you can audit systematically. It doesn’t guarantee indexing, doesn’t boost rankings, and doesn’t replace internal linking. Build it from canonical, indexable, valuable URLs only; split by content type once you exceed the limits; reference it from robots.txt and submit it via Search Console. Then watch the coverage report — it’s where most genuine crawl issues surface first.

Common questions

Quick answers to what we get asked before every trial signup.

It tells search engines which URLs you consider important enough to crawl. It's a discovery hint — not a ranking signal, not a guarantee of indexing. URLs in your sitemap can still be ignored by Google if they fail quality, duplication, or technical checks. URLs not in your sitemap can still be discovered and indexed via internal links. Treat the sitemap as one of three discovery channels alongside internal linking and external links — important, but not the whole story.

In this cluster

Technical SEO

Previous chapter

04. Crawling and indexing

Next chapter

06. robots.txt

Product

Resources

Company

XML sitemaps

What sitemaps do — and what they don’t

Sitemap structure — the basics

What to include — and what to exclude

Include

Exclude

Sitemap-index files for larger sites

Submission and validation

Common sitemap mistakes

The bottom line

Common questions

Technical SEO