04

Technical SEO

Chapter 04 / 09

Crawling and indexing

Two distinct stages, two distinct failure modes. What stops Googlebot from crawling, what stops it from indexing, and how to diagnose either one from Search Console without guessing.

9 min readPublished May 4, 2026
Crawling and indexing

Crawling and indexing are the two foundational stages every page has to clear before it can rank. They’re distinct: crawling is discovery and rendering, indexing is decision and storage. They have different failure modes, different diagnostic surfaces in Search Console, and different fixes. Conflating them is the most common reason “why isn’t my page ranking” investigations go in circles.

Crawled but not indexed is a content problem. Discovered but not crawled is a crawl-priority problem. Not even discovered is an internal-linking problem. Three different failures, three different fixes — and Search Console tells you which one you’re looking at.

The full pipeline — crawl → render → index → rank

StageDiscovery
What happensGooglebot finds the URL via internal link, sitemap, external link, or manual submission
What can failNo internal link + no sitemap = orphan page
StageCrawl
What happensGooglebot fetches the HTML at the URL
What can failRobots.txt block, 4xx/5xx server error, slow response
StageRender
What happensGoogle runs JavaScript, builds the final DOM, extracts content + signals
What can failJS errors, blocked resources, dynamic content not rendering
StageIndex decision
What happensGoogle decides whether the rendered page goes into the index
What can failLow quality, duplication, noindex, canonical pointing elsewhere
StageRank
What happensIndexed pages compete in retrieval for a query
What can failOut of scope for this article — see the Google algorithm cluster

Each stage has a Search Console signal. Discovery and crawl issues show in the Crawl Stats report and the Pages report (“Discovered — currently not indexed”). Render issues show in the Inspect URL tool when you compare HTML to rendered HTML. Index decisions show in the Pages report (“Crawled — currently not indexed”).

Stage 1 — Discovery

Googlebot finds new URLs through three primary channels:

  • Internal links from already-crawled pages on your domain.
  • XML sitemaps submitted via Search Console.
  • External links from other domains, plus URL submission via Search Console’s URL Inspection tool.

A page that doesn’t appear in any of those is an “orphan page” — Google doesn’t know it exists. The fix is the simplest in this whole article: add an internal link from somewhere reachable, or add the URL to your sitemap, or both.

Stage 2 — Crawl

Once Googlebot has the URL, it tries to fetch it. The fetch can fail for several reasons:

  • Robots.txt block. Common after launches when staging robots.txt rules accidentally promote to production.
  • 4xx errors. 404s and 410s are correct for deleted pages but a problem when valid pages return them by mistake.
  • 5xx errors. Server-side issues — overload, application crashes, misconfigured CDN. Googlebot backs off and retries; persistent 5xx demotes the URL.
  • Slow response. If the server takes more than 10–15 seconds to respond, Googlebot may abandon the fetch.
  • Crawl budget caps on large sites — Googlebot won’t fetch every URL on every visit.

Search Console’s Crawl Stats report shows the volume Googlebot is fetching, the response codes it’s seeing, and the average response time. Anomalies there usually predict ranking trouble before it shows in traffic.

Stage 3 — Render

Modern Google renders pages with a headless Chromium that executes JavaScript before extracting content. Two pages can return identical HTML and very different rendered DOMs depending on what their JS does. Render failures show as missing content in the indexed version even when the URL was crawled successfully.

Use Search Console > URL Inspection > Test live URL > View tested page > Screenshot + HTML. If the rendered HTML doesn’t match what users see in the browser, Google can’t see the missing content either. Common causes:

  • Render-blocking JavaScript that times out before the bot finishes rendering.
  • Content loaded after user interaction (click-to-reveal, infinite scroll without IntersectionObserver-based prerender).
  • Resources blocked by robots.txt — JS files, CSS files, API endpoints critical to the rendered output.
  • API failures during render — content fetched from a backend that the bot can’t reach.

See the dedicated JavaScript SEO article for the deeper render fix list.

Stage 4 — Index decision

Once rendered, Google decides whether the page is worth keeping in the index. The two most common rejection states in Search Console:

StateCrawled — currently not indexed
What it meansGoogle fetched + rendered the page and rejected it. Quality, duplication, or thin-content reasons.
Typical fixImprove content quality, add unique value, consolidate duplicate URLs, refresh outdated pages
StateDiscovered — currently not indexed
What it meansGoogle knows the URL exists but didn't fetch it. Crawl-priority or budget reason.
Typical fixIncrease internal linking from authoritative pages; reduce low-value URLs in the crawl path; check site speed
StateDuplicate without user-selected canonical
What it meansGoogle decided this page is a duplicate of another one, no canonical set
Typical fixSet explicit canonical, consolidate duplicates, or improve uniqueness
StatePage with redirect
What it meansURL redirects to another URL — the destination is what gets indexed
Typical fixUsually correct; verify the destination is the intended canonical
StateSoft 404
What it meansPage returns HTTP 200 but Google sees it as 'not found'
Typical fixReturn proper 404/410, restore content, or 301 redirect
StateBlocked by robots.txt
What it meansrobots.txt prevents crawling
Typical fixAdjust robots.txt if the block was unintentional
StateExcluded by 'noindex' tag
What it meansPage has noindex meta or X-Robots-Tag header
Typical fixRemove noindex if the exclusion was unintentional

Crawl budget — when it matters

For sites with fewer than ~10,000 URLs, crawl budget rarely matters; Google can crawl your entire site frequently. For larger sites — e-commerce with deep faceted catalogs, marketplaces, programmatic SEO at scale — crawl budget becomes a real constraint.

Symptoms of crawl budget pressure:

  • New URLs taking weeks to be crawled and indexed.
  • Updated content not refreshing in the index for a long time.
  • Large numbers of URLs in “Discovered — currently not indexed”.
  • Crawl Stats showing the bot spending most of its quota on low-value URLs (faceted nav permutations, sort variants, filter combinations).

Mitigations:

  • Block low-value URL parameters via robots.txt or noindex.
  • Use canonical tags to consolidate duplicates instead of letting all variants get crawled.
  • Prune dead-weight URLs (long-tail product pages with no traffic, archive listings nobody reads).
  • Improve site speed — faster responses = more URLs crawled per session.
  • Use XML sitemaps to signal priority URLs.

The Search Console diagnostic workflow

When a page isn’t ranking and you suspect crawl/index issues, work this sequence:

  • 1. URL Inspection. Paste the URL, check “URL is on Google” status. If not indexed, the inspection tool tells you why.
  • 2. Pages report > filter to the relevant URL pattern. See which bucket the URL falls into (indexed, crawled-not-indexed, discovered-not-crawled, etc).
  • 3. Crawl Stats report. Confirm Googlebot is reaching the site successfully, response codes are sane, average response time is under a few seconds.
  • 4. Coverage trends. Sudden drops in indexed-page count are usually a robots.txt regression, a noindex tag rolled out site-wide, or a canonical pointing elsewhere.
  • 5. URL Inspection > Test live URL. Confirms the rendered HTML matches what you expect; checks if the bot can render the content.

The bottom line

Crawling and indexing are two stages, not one. A page can fail at discovery (no link, no sitemap), at crawl (robots block, 4xx/5xx), at render (JS issues), or at the index decision (quality, duplication, canonical). Each failure shows in a different Search Console surface and demands a different fix. Don’t guess — diagnose. The tooling is there; most teams just don’t use it systematically.

Common questions

Common questions

Quick answers to what we get asked before every trial signup.

Crawling is the discovery stage — Googlebot follows links, fetches HTML, and decides what to render. Indexing is the storage stage — after rendering, Google decides whether the page is worth keeping in the index, and stores it with extracted signals (content, schema, canonical, links). A page can be crawled but not indexed (Google saw it but rejected it) and a page can fail to be crawled at all (no internal link, blocked by robots, server returned an error). Different failures require different fixes.