Crawling & Indexing: How the Google Index Works [Guide 2026]

Crawling & Indexing: How Google Finds Your Website

Key Takeaways:

Before your website can rank, Google must find it (crawling) and understand it (indexing). This two-stage process is the invisible foundation of every SEO strategy – and the most common cause of missing rankings.

  • Googlebot systematically explores the web via links, sitemaps, and Search Console. The crawl budget determines how often and how many pages are visited.
  • Google’s Caffeine system analyzes crawled pages in real time and stores them in different index tiers (Base, Zeppelins, Landfills). The tier-to-storage mapping comes from Mike King’s analysis of the API Leaks (iPullRank, 2024).
  • Canonical tags, robots.txt, meta robots, hreflang, and structured data are essential – errors here mean Google can’t find or understand your content.
  • Since October 2023, Google crawls and indexes primarily the mobile version of your website. Desktop-only content may not be captured at all.

Before your website can appear in search results, Google first needs to find and understand it. This process – consisting of crawling and indexing – forms the invisible foundation of every SEO strategy. Without it, even the best content is worthless.

In my technical SEO audits, I see it time and again: teams invest months in keyword research and content production – only to discover in Search Console that hundreds of their most important URLs carry the status “Crawled – currently not indexed.” They’re optimizing for rankings that can never happen because the technical foundation is missing. It’s like opening a shop and not registering the address on Google Maps – your offering may be excellent, but nobody can find you.

This article is part of my comprehensive guide to the Google Search Algorithm: From Crawling to Ranking. Here we dive deep into the technical foundations – with insights from the Google API Leaks 2024, which revealed internal system names like index tiers and crawl prioritization for the first time. Once your pages are indexed, the next phases take over: Query Processing, Ranking, and Re-Ranking with Twiddlers.

Why Crawling & Indexing Are the Foundation of SEO

Key Takeaway: The path to search results follows a clear hierarchy: Crawling → Indexing → Ranking → Visibility. If any of these stages fails, your content never reaches users. Most SEO problems are not ranking problems but crawling or indexing problems.

A web page’s journey to search results follows a clear hierarchy: crawling, then indexing, then ranking, and finally visibility. If any of these stages fails, your content never reaches users. The brutal truth is: most SEO problems are not ranking problems but crawling or indexing problems.

The Google API Leaks confirmed what many SEOs had long suspected: Google doesn’t manage a single index but multiple levels (tiers) with different priorities. A page in the so-called “Landfills” tier is practically never shown for important search queries – even if the content is excellent. The goal must therefore be not just to get indexed, but to end up in the right index tier.

A Typical Scenario from Practice

Consider a hypothetical but realistic example: a mid-sized online shop with thousands of product pages – yet only a few hundred rank for relevant keywords. Search Console shows “Crawled – currently not indexed” for a large portion of URLs. The typical cause? Faceted navigation generating tens of thousands of filter URLs without unique content, devouring Google’s crawl budget. This is exactly the pattern I regularly see in my technical audits.

The solution typically involves three steps: First, non-canonical filter combinations are excluded from crawling via robots.txt or canonical tags. Second, important category and product pages receive improved internal linking. Third, the sitemap is cleaned up to include only genuinely indexable URLs. In my experience, indexing rates typically increase significantly after such measures – organic traffic often doubles within a few months.

Note: According to Google’s John Mueller, Google indexes on average only between 30 and 60 percent of a website’s pages. The vast majority consists of duplicates, low-quality pages, undiscoverable URLs, or content actively excluded from indexing. Your goal: ensure your important pages belong to the indexed portion.

What Is Crawling? Googlebot Explained

Key Takeaway: Crawling is the process by which Google’s automated crawlers explore the web to discover new and updated pages. Googlebot jumps from link to link, processes HTTP requests, and queues JavaScript-heavy pages into a separate rendering queue.

Crawling is the process by which Google’s automated programs – crawlers or spiders – explore the web to discover new and updated pages. The most well-known of these crawlers is Googlebot, but Google actually operates an entire family of specialized crawlers for different content types.

How Googlebot Works

Googlebot works like a tireless reader jumping from link to link. It starts with a list of known URLs from previous crawls, submitted sitemaps, or Search Console. For each URL, it sends an HTTP request to the server and downloads the HTML code, extracting all links on the page and adding them to its queue.

For modern JavaScript-heavy websites, the process is more complex. Googlebot initially downloads only the initial HTML code and then queues the page into a separate rendering queue. There, JavaScript is executed and the fully rendered DOM is analyzed. This two-stage process can cause delays, which is why server-side rendering is so important for SEO-critical pages.

Google deploys specialized crawlers for different tasks:

Crawler User-Agent Task
Googlebot Smartphone Googlebot/2.1 (Mobile) Mobile-First crawling (primary since Oct. 2023)
Googlebot Desktop Googlebot/2.1 Desktop pages (secondary only)
Googlebot Images Googlebot-Image/1.0 Images for Google Image Search
Googlebot Video Googlebot-Video/1.0 Videos for Video Search
Googlebot News Googlebot-News News content for Google News
AdsBot AdsBot-Google Landing page quality for Google Ads
Important since October 2023: Googlebot Smartphone is the primary crawler. Google crawls and indexes the mobile version of your website by default. Desktop-only content risks being completely overlooked.

Crawl Budget: How Google Sets Priorities

Key Takeaway: Crawl budget consists of Crawl Rate Limit (server capacity) and Crawl Demand (Google’s interest). Duplicate content, soft errors, and endless parameter URLs waste it massively.

Google can’t crawl every URL on the internet simultaneously – even with massive infrastructure. That’s why Google assigns each website a crawl budget: the number of pages Googlebot can and wants to crawl within a given timeframe.

The crawl budget consists of two components. The first is the Crawl Rate Limit – the maximum crawl frequency without overloading your server. Google automatically adjusts this limit based on your server response times. The second component is Crawl Demand – how much does Google actually “want” to crawl your pages? This demand is based on page popularity, freshness, and perceived importance. Frequently linked and often updated pages have significantly higher priority.

What Wastes Your Crawl Budget

Certain technical problems can massively waste your crawl budget. Duplicate content is one of the most common culprits: when the same content is accessible under multiple URLs, Google crawls each separately. Faceted navigation in online shops often creates thousands of filter combinations without unique content. Session IDs in URLs generate infinite URL variants for identical content.

Particularly insidious are so-called soft-error pages: pages that show users an error message but return a 200 status to Googlebot. The bot crawls them repeatedly without recognizing they’re worthless. Similarly problematic are “infinite spaces” like endlessly paginated archives or calendars that can theoretically generate unlimited URLs. And if your website has been hacked, spam pages can consume your entire crawl budget – this is where Google’s SpamBrain system steps in to detect manipulative content.

For a detailed optimization guide, read my specialized guide: Optimize Crawl Budget: Get Your Content Indexed Faster.

Controlling Crawling: robots.txt, Sitemaps & Search Console

Key Takeaway: robots.txt only blocks crawling, not indexing. For true indexing control, you need the noindex tag. XML sitemaps and Search Console complement your control toolset.

You have several tools to influence how Google crawls your website. The most important is the robots.txt file, which sits in your domain’s root directory and gives instructions to crawlers.

robots.txt – Access Control

With robots.txt, you can exclude certain areas of your website from crawling. This makes sense for admin areas, checkout processes, or internal search pages that don’t belong in the index. You can also define different rules for different crawlers – treating AdsBot differently from regular Googlebot, for example.

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /search?

User-agent: Googlebot
Crawl-delay: 1

Sitemap: https://example.com/sitemap.xml
Warning: robots.txt only prevents crawling, not indexing! If other pages link to a URL blocked by robots.txt, Google can still index it – just without knowing the content. The result is a search result without a snippet. For true indexing control, you need the noindex tag.

XML Sitemaps – Your Website’s Map

An XML sitemap lists all important URLs on your website and helps Google with discovery. It’s especially valuable for new websites without many backlinks, large websites with complex structures, and pages that aren’t well internally linked.

There are several best practices for sitemap maintenance. Only include indexable, canonical URLs – no pages with noindex, no duplicates, no redirect targets. Only update the lastmod date when actual content changes are made, because Google learns whether you’re “lying” and then completely ignores your lastmod data – John Mueller has explicitly confirmed this. Each sitemap allows a maximum of 50,000 URLs; for larger websites, use a sitemap index. Always submit the sitemap in Google Search Console. For Google Discover traffic, you should also set max-image-preview:large in meta robots.

Search Console – The Direct Connection to Google

Search Console gives you a direct communication line to Google. With URL Inspection, you can check the exact status of every single URL. You see when it was last crawled, whether it’s indexed, which canonical Google recognized, and whether there are mobile usability issues. For important new pages, you can directly request indexing – though this is no guarantee, just a signal to Google.

The Page Indexing report under “Indexing → Pages” shows you all crawling and indexing problems at a glance. The current view distinguishes between “Indexed” (green) and “Not Indexed” (gray) with detailed sub-reasons. Here you’ll find pages blocked by robots.txt, accidentally set to noindex, or recognized as duplicates. The Removals tool lets you temporarily remove URLs from search results – useful for sensitive content or critical errors.

What Is Indexing? From Caffeine to Index Tiers

Key Takeaway: Indexing is the process by which Google analyzes, understands, and stores crawled pages in its database. Since 2010, the Caffeine system works incrementally and in near real-time. Only indexed pages can rank.

Indexing is the process by which Google analyzes, understands, and stores crawled pages in its database. Only indexed pages can appear in search results. Crawling alone isn’t enough – a page can be crawled but still not indexed if Google deems it low-quality or redundant.

The Caffeine System

Since 2010, Google has used the Caffeine system for indexing. Unlike the old system, which updated the entire index in large batches, Caffeine works incrementally and in near real-time. This means new content can appear in the index much faster – provided it meets quality criteria.

Once Googlebot has downloaded a page, the actual analysis begins. First, the HTML code is converted into a DOM structure – a process called parsing. Then Google extracts all relevant content: text, images, videos, and structured data are captured and categorized.

Structured data deserves special attention. These are machine-readable pieces of information in JSON-LD, Microdata, or RDFa format that help Google unambiguously understand a page’s content. A recipe is recognized as a recipe (with ingredients, cooking time, ratings), a product as a product (with price, availability, reviews), an FAQ as an FAQ. This data enables rich snippets in search results – the eye-catching additional information like star ratings, prices, or FAQ accordions that can significantly boost your click-through rate.

Particularly important is the linguistic analysis. Here, Google identifies the document’s language, recognizes topics covered, and connects entities like people, places, and concepts to the Knowledge Graph. This semantic analysis goes far beyond simple keyword matching – more on this in my article on Semantic Search & the Knowledge Graph.

Simultaneously, Google checks whether the content is a duplicate or variation of existing pages. Similar content is grouped, and Google selects a canonical version as the “original.” Initial quality signals are also captured – E-E-A-T factors already play a role here. At the end of the process, the page is stored in the appropriate index tier.

What Signals Influence Indexing Priority

Google decides based on various factors whether and how quickly a page gets indexed. Internal linking plays a central role: how prominent is the page in your site structure? Pages reachable from the homepage in few clicks are rated as more important. External backlinks amplify this effect – when trustworthy websites link to your page, it signals to Google that the content is relevant.

Content quality itself is naturally decisive. Is the content unique and does it offer real value? Or is it yet another generic page on an oversaturated topic? Technical signals like loading speed and mobile usability also factor in. And finally, the authority of the entire domain plays a role – established websites with good reputation get a trust bonus.

The Three Index Levels: Base, Zeppelins & Landfills

Key Takeaway: The 2024 Google API Leaks revealed that Google manages multiple index levels. According to Mike King’s analysis (iPullRank), Base, Zeppelins, and Landfills correspond to different storage hierarchies – from fast flash memory to standard hard drives.

The 2024 Google API Leaks delivered one of the most exciting revelations: Google doesn’t manage a unified index but multiple levels with different priorities and update frequencies. A deep analysis of these leak findings can be found in: Google Leak: Why User Signals Matter More Than Google Admitted.

The leaked API documents contain the attribute scaledSelectionTierRank with the reference “over the serving tier (Base, Zeppelins, Landfills).” The exact tier names are thus confirmed – but the content interpretation of which tier corresponds to which storage type comes from Mike King’s analysis at iPullRank (May 2024). King inferred from context signals in the documentation that the tiers follow a physical storage hierarchy:

Index Tier Storage Type (per King) Update Frequency Typical Content
Base Index Flash Memory Frequent (hours to days) High-quality main pages, news, authoritative domains
Zeppelins Solid State Drives (SSD) Occasional (weeks) Archive pages, deep hierarchies, lower authority
Landfills Standard Hard Drives (HDD) Rare (months or never) Old content, low-quality pages, rarely linked URLs
Note on sources: The tier names Base, Zeppelins, and Landfills come directly from the leaked Google API documents. The mapping to specific storage types (Flash/SSD/HDD) is Mike King’s interpretive conclusion – not confirmed by Google. King bases this on contextual hints like “For TeraGoogle, this data resides in very limited serving memory (Flash storage)” from the internal documentation.

Pages in the Base Index are considered for competitive keywords. Pages in Zeppelins can rank but have worse chances for competitive queries. Content in Landfills has practically no chance of rankings for relevant queries – regardless of how good the content theoretically is.

What Influences Tier Assignment

The leaks point to several factors that determine which tier a page lands in. The siteAuthority signal evaluates the entire domain – yes, a type of domain authority actually exists, even though Google denied it for a long time. Good old PageRank still plays a role, albeit in modified form. Content freshness and user engagement also factor in. Finally, crawl frequency – how often the page changes – also influences tier assignment.

Practical tip: Pages in the Landfills tier have virtually no chance of ranking for competitive keywords. If important pages end up there, you need to improve their quality, linking, and freshness to “promote” them to higher tiers. After Google Core Updates, tier assignment can change – both positively and negatively.

Controlling Indexing: Canonical, noindex & hreflang

Key Takeaway: Canonical tags, noindex, and hreflang are essential control instruments. The canonical is a hint, not a directive – Google can decide otherwise. hreflang must be implemented bidirectionally.

You have several ways to actively influence indexing. These tags and signals aren’t optional extras but essential tools for any professional SEO strategy.

Canonical Tags – Defining the Preferred Version

For duplicates or very similar pages, the canonical tag points to the “original” version. This is especially important when the same content is accessible under multiple URLs – such as a product linked through different categories or parameter URLs for sorting and filtering.

<link rel="canonical" href="https://example.com/original-page/" />

The canonical tag is a hint, not a directive. Google can choose to treat a different URL as canonical if signals are contradictory. That’s why consistency matters: internal links, sitemap entries, and canonical should all point to the same URL.

Meta Robots – Selectively Preventing Indexing

<meta name="robots" content="noindex, follow" />

The “noindex, follow” combination is particularly useful: the page itself isn’t indexed, but Google still follows its links. This is ideal for overview pages serving only navigation, or login areas whose content shouldn’t appear in search but link to indexable content.

hreflang – Linking International Versions

<link rel="alternate" hreflang="de" href="https://example.com/de/page/" />
<link rel="alternate" hreflang="en" href="https://example.com/en/page/" />
<link rel="alternate" hreflang="x-default" href="https://example.com/" />

Crucially, hreflang must be bidirectional: if page A references page B, page B must also reference page A. Faulty hreflang implementations are one of the most common technical SEO problems with international websites.

JavaScript & Rendering: The Hidden Hurdle

Key Takeaway: Google crawls and renders in two separate steps. JavaScript content is only processed by the Web Rendering Service (WRS) – with potential delays. Server-Side Rendering is the most robust solution.

Modern websites often use JavaScript to dynamically load content. What creates a smooth experience for users poses significant challenges for Google. Googlebot can execute JavaScript, but the process is complex and delayed.

Google’s Two-Stage Rendering Process

Google crawls and renders in two separate steps. On the first crawl, only the initial HTML code is captured. The page is then queued for execution by the Web Rendering Service (WRS). Only there is JavaScript executed and the fully rendered DOM extracted. A second indexing with the rendered content follows.

The problem is the time delay between first crawl and rendering. Based on current analysis, this can take hours, days, or in individual cases even weeks – depending on Google’s current capacity and your website’s priority.

Solutions for JavaScript SEO

Server-Side Rendering (SSR) is the most robust solution. Dynamic Rendering is a middle ground. At minimum, practice Progressive Enhancement: core content must be available without JavaScript, and internal links must exist as real HTML links in the initial HTML.

Warning: Single Page Applications (SPAs) that only load content after user interaction are often not fully captured by Google. The bot doesn’t click buttons, fill out forms, or scroll. Everything requiring user action remains invisible to Google.

Mobile-First Indexing: The New Standard

Key Takeaway: Google officially completed Mobile-First Indexing for all websites in October 2023. Googlebot Smartphone is the primary crawler – the mobile version of your website determines indexing and ranking.

Google officially completed the transition to Mobile-First Indexing in October 2023, as confirmed by John Mueller on the Google blog. Googlebot Smartphone has been the primary crawler since then, and the mobile version of your website is the basis for indexing and ranking.

The transition had a long history: Google first announced Mobile-First Indexing in November 2016, began the rollout in March 2018, set multiple deadlines (September 2020, then March 2021) which were each postponed, and carried out the last batch in May 2023. The official confirmation of completion came on October 31, 2023.

More than 60% of all Google searches now come from mobile devices. For more on mobile optimization, see Core Web Vitals & Page Experience: The Complete Optimization Guide.

Mobile-First Checklist

  • Content parity: All important texts, images, and videos are identical on mobile and desktop
  • Meta tags: Title, description, and robots tags are present on mobile
  • Structured data: JSON-LD is also embedded in the mobile version
  • Internal links: Mobile navigation contains all important links
  • Touch targets: Buttons and links are at least 48×48 pixels
  • Readability: Font size at least 16px, no horizontal scrolling needed

Diagnosing Crawling & Indexing Problems

Key Takeaway: The Search Console Page Indexing report is your most important diagnostic tool. The current view distinguishes between “Indexed” and “Not Indexed” with detailed sub-reasons. Log file analysis reveals Google’s actual crawling behavior.

Google Search Console is your most important tool for identifying crawling and indexing problems. The Page Indexing report under “Indexing → Pages” shows all URLs and their current status.

Understanding the Two Main Categories

Status Color Meaning Action
Indexed Green Page is in Google’s index and can appear in search results All OK ✓
Not Indexed Gray Page is not in the index for a specific reason Check the detailed sub-reason – intentional or error?

Common Exclusion Reasons and Solutions

  • Blocked by robots.txt: Crawling is prevented. Check if this is intentional.
  • Noindex tag detected: You or a plugin explicitly deactivated indexing.
  • Duplicate without canonical: Google chose a canonical itself because you didn’t define one.
  • Discovered – currently not indexed: Google knows the URL but hasn’t crawled it yet.
  • Crawled – currently not indexed: Google doesn’t consider the page indexworthy – often a quality problem.
Diagnostic tip: With URL Inspection, you can examine the exact status of every single URL. You see the last crawl date, indexing status, recognized canonical, mobile usability, and can even perform a live test to see how Google currently perceives the page.

Log File Analysis – Behind the Scenes

Search Console shows what Google has indexed. But it doesn’t show what Google actually does on your website. For that, you need a log file analysis. Server logs record every single request – including Googlebot’s. You can use specialized tools like Screaming Frog Log File Analyser or JetOctopus.

Best practice: Perform a complete technical audit with Screaming Frog or Sitebulb at least once per quarter. More frequently for large websites or after relaunches. Search Console alone isn’t enough to find all problems.

Infographic: The Journey of a URL into Google’s Index

Infographic: The journey of a URL through crawling and indexing into Google's index - from discovery through rendering to tier assignment
The complete process from URL discovery to index tier – including the storage hierarchy confirmed through 2024 API Leaks (interpretation: Mike King / iPullRank). seo-kreativ.de – Christian Ott

Conclusion: The Invisible Foundation of Your Rankings

Key Takeaway: Technical SEO is not an optional add-on but the foundation. Crawling and indexing determine whether your content even gets a chance at rankings. The Google API Leaks confirmed: quality starts with technical infrastructure.

Crawling and indexing are the often-overlooked fundamentals of every successful SEO strategy. The best keyword research and the most valuable content are useless if Google can’t find or understand your pages.

The four pillars of technical SEO: Discoverability through clean site structure, current sitemaps, and thoughtful internal linking. Crawlability through fast servers, no technical blockades, and efficient crawl budget usage. Indexability through correct canonical tags, no accidental noindex, and unique content. And Renderability through JavaScript-friendly architecture and Mobile-First optimization.

“What Google can’t crawl will never rank. What Google doesn’t index will never be found.”

Invest time in the technical foundations. For the big picture of how these phases work together in Google’s overall system, read the main article: How Does the Google Search Algorithm Work? And to understand what happens after indexing – how Google processes and interprets search queries – continue to the next chapter: Query Processing: How Google Understands Your Search Query.

Checklist: Regularly check your website for crawling and indexing problems. Use Search Console as a mandatory tool, supplemented by quarterly technical audits with Screaming Frog or Sitebulb. And remember: every page you want indexed must be crawlable, indexable, and renderable.

Frequently Asked Questions (FAQ)

How long does it take for Google to index my new page?

This varies widely – from a few hours to several weeks. Key factors are your website’s authority, crawl frequency, content quality, and how the page is discovered. For established websites with good reputation, a new page can be indexed within hours. For new websites without authority or backlinks, it often takes weeks.

Why isn’t my page being indexed?

The most common reasons are technical: an accidentally set noindex tag, a robots.txt block, a canonical tag pointing to another page, or too few internal links. It can also be a quality issue – if Google considers the content thin, duplicate, or low-quality, it won’t be indexed. Use URL Inspection in Search Console to identify the exact status.

What’s the difference between crawling and indexing?

Crawling means Google finds and downloads your page. Indexing means Google analyzes, understands, and stores the page in its database. A page can be crawled but not indexed – for example, if Google recognizes it as a duplicate or considers it low quality. Only indexed pages can appear in search results.

How do I know which index tier my pages are in?

Google doesn’t publicly communicate tier assignment. But there are indicators: if a page is indexed but never ranks for relevant keywords, it’s likely in a lower tier. Pages that get re-crawled quickly after content updates are probably in higher tiers. Regular updates, good linking, and user engagement help with “promotion” to higher tiers.

What’s better: robots.txt Disallow or noindex?

It depends on your goal. robots.txt prevents crawling, saving crawl budget, but doesn’t reliably prevent indexing. noindex allows crawling but definitively prevents indexing. For content that must absolutely not appear in search results, noindex is the safer choice.

Can I speed up indexing of my pages?

Partially. Helpful measures include: submitting the URL for indexing in Search Console, setting strong internal links from already indexed pages, including the page in the sitemap, creating high-quality and unique content, and building external links from trustworthy websites. There’s no guarantee of fast indexing – Google ultimately decides based on relevance, quality, and available resources.

Last updated: April 24, 2026 – Content refresh: Source attributions clarified, outdated facts and dates corrected, TL;DR restructured, new infographic.
Christian Ott - Gründer von www.seo-kreativ.de

Christian Ott – Creative SEO Thinking & Knowledge Sharing

As the founder of SEO-Kreativ, I live out my passion for SEO, which I discovered in 2014. My journey from hobby blogger to SEO expert and product developer has shaped my approach: I share knowledge in a clear, practical way-without jargon.