Technical SEO
Chapter 05 / 09
XML sitemaps
What sitemaps actually do (and don't), what to put in them, what to leave out, and the structure that scales from a 50-page site to a 5-million-URL marketplace.

XML sitemaps are simple in concept and consistently misunderstood in practice. They’re a hint to search engines about which URLs matter — nothing more, nothing less. They don’t boost rankings, don’t guarantee indexing, and don’t solve content quality problems. What they do is give search engines a curated list of canonical URLs you want crawled, and give you a clean Search Console coverage report you can actually act on.
“A sitemap is not a request that Google index your URLs. It’s a hint that you think these URLs are important. Quality, duplication, and crawl budget still decide whether the index accepts them.”
What sitemaps do — and what they don’t
| What sitemaps DO | What sitemaps DON'T do |
|---|---|
| Help search engines discover URLs they might otherwise miss | Force indexing — quality and duplication checks still apply |
| Speed up crawl of new URLs added to the sitemap | Boost rankings — they're a discovery hint, not a ranking signal |
| Surface coverage data in Search Console for systematic auditing | Replace internal linking — orphan pages still rank weakly even if in the sitemap |
| Allow last-modified hints (lastmod) for content-refresh detection | Override noindex tags or robots.txt blocks |
| Scale via sitemap-index files for large catalogs | Excuse poor architecture — they're a complement to internal linking, not a substitute |
Sitemap structure — the basics
A minimal sitemap is XML listing each URL with optional metadata:
<url> <loc>https://www.example.com/page/</loc> <lastmod>2026-05-04</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority></url>
In practice:
loc— required. The full canonical URL.lastmod— strongly recommended. The date the content meaningfully changed. Google uses this for refresh detection. Don’t lie about it; updatinglastmodon every deploy without changing content trains Google to ignore it.changefreq— Google has publicly said this is mostly ignored; safe to omit.priority— also mostly ignored; safe to omit.
Modern sitemaps usually include just loc and lastmod. Anything more is theatre.
What to include — and what to exclude
The mental model: the sitemap is the curated list of every URL you want indexed and would be proud to rank. If you wouldn’t want a URL ranking on Google, it doesn’t belong in your sitemap.
Include
- Canonical homepage and all canonical landing pages.
- Cluster pages, hub pages, category pages.
- All article / blog post / academy URLs.
- Product detail pages (canonical only).
- Service pages, location pages, comparison pages.
- Any user-generated content you’ve decided to make crawlable (review pages, profile pages with substantive content).
Exclude
- Pages with
noindexmeta tag. - URLs blocked by robots.txt.
- Non-canonical duplicates — only the canonical belongs.
- Paginated archive pages (
/page/2/,/page/3/, etc). - Parameter variants (sort, filter, tracking) — only the clean canonical URL.
- Internal search result pages.
- Login, signup, thank-you, confirmation, account pages.
- Redirect URLs — point at the destination, not the redirect.
- Soft-404 candidates and pages returning 4xx/5xx.
- Print versions, AMP pages (if separate URLs), m. mobile subdomain pages.
Sitemap-index files for larger sites
Google’s limits per sitemap file:
- Maximum 50,000 URLs per sitemap.
- Maximum 50 MB uncompressed file size.
Above either limit, split into multiple sitemap files referenced from a sitemap-index. The index pattern most teams use:
/sitemap.xml— the index, references all sub-sitemaps/sitemap-articles.xml— all academy / blog articles/sitemap-products.xml— all canonical product detail pages/sitemap-categories.xml— all category and hub pages/sitemap-locations.xml— for multi-location businesses
Splitting by content type makes the Search Console coverage report directly actionable — you can see at a glance whether the issue is in articles, products, or categories. A single monolithic sitemap forces you to filter manually.
Submission and validation
- 1. Reference the sitemap in robots.txt — add
Sitemap: https://www.example.com/sitemap.xmlat the bottom. Tells crawlers where to look. - 2. Submit via Search Console — Sitemaps section. Google reads it and reports on coverage.
- 3. Validate the format — Search Console flags syntax errors. Lighthouse and online validators (
xml-sitemaps.com,sitemaps.org/protocol.html) check well-formedness. - 4. Monitor Search Console > Sitemaps — see submitted vs indexed counts; investigate gaps.
- 5. Resubmit on major content changes — Google polls the sitemap automatically; manual resubmission speeds up discovery for urgent changes.
Common sitemap mistakes
- Including non-canonical duplicate URLs. The sitemap should be the canonical list; duplicates dilute the signal.
- Including
noindexURLs. Confuses Google — you’re saying both “index this” and “don’t index this”. - Stale
lastmodvalues. Either updating on every deploy (trains Google to ignore the field) or never updating at all (Google never re-crawls fresh content). - Sitemap returning 4xx/5xx. Search Console flags it; the bot can’t read it; nothing gets discovered.
- Sitemap not referenced in robots.txt. Discovery still works via Search Console, but other crawlers (Bing, AI engine bots) may miss it.
- Forgetting to update the sitemap when content changes. Especially common on hand-written sitemaps; auto-generation via the CMS / framework solves it.
The bottom line
An XML sitemap is a curated discovery hint. It tells search engines which URLs you consider canonical and worth crawling, and it gives you a Search Console coverage report you can audit systematically. It doesn’t guarantee indexing, doesn’t boost rankings, and doesn’t replace internal linking. Build it from canonical, indexable, valuable URLs only; split by content type once you exceed the limits; reference it from robots.txt and submit it via Search Console. Then watch the coverage report — it’s where most genuine crawl issues surface first.
Common questions
Common questions
Quick answers to what we get asked before every trial signup.
It tells search engines which URLs you consider important enough to crawl. It's a discovery hint — not a ranking signal, not a guarantee of indexing. URLs in your sitemap can still be ignored by Google if they fail quality, duplication, or technical checks. URLs not in your sitemap can still be discovered and indexed via internal links. Treat the sitemap as one of three discovery channels alongside internal linking and external links — important, but not the whole story.
In this cluster