02

Fundamentals

Chapter 02 / 09

How search engines work

Crawling, indexing, ranking — and the fourth stage that didn't exist in 2018 but decides who gets cited inside ChatGPT and Google AI Overviews in 2026.

10 min readPublished May 4, 2026
How search engines work

Search engines work by repeating four stages on a continuous loop.Three of them — crawling, indexing, ranking — have been the model since 1998. The fourth — synthesis — is the one that didn’t exist when most SEO advice was written and now decides who gets cited inside ChatGPT, Google AI Overview, Gemini, Perplexity, and Claude. Skipping it is why a lot of teams ship technically perfect content that nobody finds.

This article walks each stage and points out where modern SEO actually moves the needle versus where the textbooks tell you to look.

Most SEO problems don’t live where teams optimise. The bottleneck is usually one stage upstream of where they’re looking.

The four stages

Stage1. Crawling
What happensA bot fetches the URL, follows links, returns to the queue
Where SEO leverage livesInternal links, sitemap, robots.txt, render performance
Stage2. Indexing
What happensThe engine decides whether the page is worth storing
Where SEO leverage livesContent quality, duplicate detection, canonical, schema
Stage3. Ranking
What happensWhen a query arrives, candidates are scored and ordered
Where SEO leverage livesSearch intent match, authority signals, freshness, E-E-A-T
Stage4. Synthesis
What happensAI engines compose an answer citing multiple indexed sources
Where SEO leverage livesPassage structure, entity clarity, sameAs, citation-readiness

1. Crawling — how the bot finds your page

Googlebot, Bingbot, ChatGPT’s GPTBot, Claude’s ClaudeBot, Perplexity’s PerplexityBot, and OpenAI’s OAI-SearchBot all do the same thing: fetch a URL, parse the HTML, follow the links inside, and add new URLs to a queue. The queue gets enormous fast — billions of URLs across the open web — so engines prioritise. The mistake is assuming “published” means “crawled.”

Three things determine whether a bot reaches your page in days versus weeks:

  • Internal links from authoritative pages. A new URL with three internal links from ranking pages gets crawled before the same URL with zero internal links. The single most underused crawl-priority lever in 2026 is publishing a new article and not adding it to the homepage, the cluster page, or sibling articles.
  • Sitemap freshness. An XML sitemap with accurate lastmodtimestamps tells the engine which URLs are new or changed since the last crawl. A sitemap that was generated once at launch and never regenerated is invisible to the engine’s prioritisation logic.
  • Render performance. Bots have a budget per site (informally called the crawl budget). Pages that take 8 seconds to render burn budget that the engine could have spent on other URLs. Core Web Vitals matter here too — not just for ranking but for how many of your pages get crawled in a given window.

2. Indexing — the decision to store

Crawled is not indexed. After fetching the page the engine asks one question: is this worth keeping? Pages that fail this check get crawled but discarded — they’ll never rank, no matter how good the on-page optimisation looks.

The 2026 reasons a page fails the indexing decision:

  • Thin or duplicate content. If the page repeats what already exists in the index — same boilerplate, same product description, same FAQ — the engine has no reason to add another copy. Programmatic SEO done badly fails here.
  • Confused canonical signals.When two URLs serve essentially the same content (e.g., a query-string variant) and don’t agree on which is canonical, the engine often indexes neither. Canonical chains and self-referential canonicals trip this constantly.
  • Missing or invalid schema.Schema doesn’t guarantee indexation, but an Article + FAQPage + BreadcrumbList graph signals to the engine that you’ve thought about what the page is. Pages without it look generic and lose tie-breakers.
  • Soft 404 patterns.Pages that load but say “no results” or “this product is unavailable” in their main content get classified as soft 404s by Google and skipped.

3. Ranking — the score on every query

When a user types a query, the engine pulls candidate pages from the index, scores them against hundreds of factors, and orders them. The score is not a single number computed once; it’s recalculated for each query because the same page can be a great fit for one query and a poor fit for another.

Most SEO advice fixates on the ranking stage because it’s the visible one. The reality is that ranking factors only matter for pages that already cleared the indexing decision — so optimising on-page elements before the indexation problem is solved is wasted effort.

That said, the factor categories that move ranking in 2026 are well established:

  • Search intent match. Does the page answer the actual user need behind the query? Informational queries want guides; transactional queries want products. Mismatched intent loses to a weaker page that nailed the intent.
  • Authority signals.Backlinks, brand mentions, sameAs identity, citation count in adjacent media. The engine’s shorthand for “is this site trustworthy in this category?”
  • Freshness.Different query types demand different freshness. “What year is it” needs daily updates; “how to write a will” doesn’t. Stale pages on time-sensitive queries lose; recently-updated pages on evergreen queries don’t automatically win.
  • E-E-A-T.Experience, Expertise, Authoritativeness, Trust — Google’s framework for who the engine should believe. Encoded through Person + Author schema, sameAs identity, and editorial signals from third parties.

4. Synthesis — the stage SEO advice keeps missing

AI engines do not show ten blue links. They compose an answer by pulling passages from multiple indexed sources, weaving them into a single response, and citing each source. The decision of which sources to pull and quote is the synthesis stage — and it scores pages on signals that classic ranking does not weigh as heavily.

What synthesis looks for that ranking under-weights:

  • Self-contained passages. Two-to-three sentences that answer a sub-question without needing the rest of the article for context. Synthesis tends to pull paragraphs, not pages, so paragraphs that stand alone get pulled more often.
  • Entity clarity. The page mentions the entity (your brand, product, person) in a way that the engine can disambiguate. Inconsistent entity descriptions across the site, vague company descriptions, missing sameAs links — all hurt synthesis pickup even if classic ranking is fine.
  • Citation-ready facts.Numbers, dates, attribution to sources. AI engines prefer to quote pages where the facts are clearly stated and clearly sourced. Vague writing (“studies show”) loses to specific writing (“a 2026 ahrefs study of 4M URLs found”).

Synthesis is why a page can rank position 8 in Google but be the source most quoted in ChatGPT for the same query — and vice versa. The two stages reward different things.

What this means for SEO practice

Map every SEO problem you’re trying to solve to one of the four stages first. The fix lives at the stage where the problem starts, not the stage where the symptom appears.

SymptomPage not in Google index
Likely stageCrawling or Indexing
Likely fixAdd internal links + check Search Console for 'Crawled — currently not indexed' or duplicate/canonical errors
SymptomIndexed but ranks position 30+
Likely stageRanking
Likely fixSearch intent mismatch or thin authority for this category — usually content-side, not technical
SymptomRanks well in Google but never appears in ChatGPT
Likely stageSynthesis
Likely fixRestructure paragraphs to be self-contained; tighten entity description; add sameAs
SymptomLoses ranking after a Google core update
Likely stageRanking
Likely fixFactor weights shifted — usually means the site was relying on something the algorithm now devalues

Common questions

Common questions

Quick answers to what we get asked before every trial signup.

Crawling is when a bot fetches a URL and reads its content. Indexing is when the engine decides that page is worth storing in its database to rank later. Crawled is not indexed — Search Console reports the gap as 'Crawled — currently not indexed', and it's the most common reason a page exists but never ranks.