Key Takeaways:
Before your website can rank, Google must find it (crawling) and understand it (indexing). This two-stage process is the invisible foundation of every SEO strategy – and the most common cause of missing rankings.
- Googlebot systematically explores the web via links, sitemaps, and Search Console. The crawl budget determines how often and how many pages are visited.
- Google’s Caffeine system analyzes crawled pages in real time and stores them in different index tiers (Base, Zeppelins, Landfills). The tier-to-storage mapping comes from Mike King’s analysis of the API Leaks (iPullRank, 2024).
- Canonical tags, robots.txt, meta robots, hreflang, and structured data are essential – errors here mean Google can’t find or understand your content.
- Since October 2023, Google crawls and indexes primarily the mobile version of your website. Desktop-only content may not be captured at all.
Before your website can appear in search results, Google first needs to find and understand it. This process – consisting of crawling and indexing – forms the invisible foundation of every SEO strategy. Without it, even the best content is worthless.
In my technical SEO audits, I see it time and again: teams invest months in keyword research and content production – only to discover in Search Console that hundreds of their most important URLs carry the status “Crawled – currently not indexed.” They’re optimizing for rankings that can never happen because the technical foundation is missing. It’s like opening a shop and not registering the address on Google Maps – your offering may be excellent, but nobody can find you.
This article is part of my comprehensive guide to the Google Search Algorithm: From Crawling to Ranking. Here we dive deep into the technical foundations – with insights from the Google API Leaks 2024, which revealed internal system names like index tiers and crawl prioritization for the first time. Once your pages are indexed, the next phases take over: Query Processing, Ranking, and Re-Ranking with Twiddlers.
Why Crawling & Indexing Are the Foundation of SEO
A web page’s journey to search results follows a clear hierarchy: crawling, then indexing, then ranking, and finally visibility. If any of these stages fails, your content never reaches users. The brutal truth is: most SEO problems are not ranking problems but crawling or indexing problems.
The Google API Leaks confirmed what many SEOs had long suspected: Google doesn’t manage a single index but multiple levels (tiers) with different priorities. A page in the so-called “Landfills” tier is practically never shown for important search queries – even if the content is excellent. The goal must therefore be not just to get indexed, but to end up in the right index tier.
A Typical Scenario from Practice
Consider a hypothetical but realistic example: a mid-sized online shop with thousands of product pages – yet only a few hundred rank for relevant keywords. Search Console shows “Crawled – currently not indexed” for a large portion of URLs. The typical cause? Faceted navigation generating tens of thousands of filter URLs without unique content, devouring Google’s crawl budget. This is exactly the pattern I regularly see in my technical audits.
The solution typically involves three steps: First, non-canonical filter combinations are excluded from crawling via robots.txt or canonical tags. Second, important category and product pages receive improved internal linking. Third, the sitemap is cleaned up to include only genuinely indexable URLs. In my experience, indexing rates typically increase significantly after such measures – organic traffic often doubles within a few months.
What Is Crawling? Googlebot Explained
Crawling is the process by which Google’s automated programs – crawlers or spiders – explore the web to discover new and updated pages. The most well-known of these crawlers is Googlebot, but Google actually operates an entire family of specialized crawlers for different content types.
How Googlebot Works
Googlebot works like a tireless reader jumping from link to link. It starts with a list of known URLs from previous crawls, submitted sitemaps, or Search Console. For each URL, it sends an HTTP request to the server and downloads the HTML code, extracting all links on the page and adding them to its queue.
For modern JavaScript-heavy websites, the process is more complex. Googlebot initially downloads only the initial HTML code and then queues the page into a separate rendering queue. There, JavaScript is executed and the fully rendered DOM is analyzed. This two-stage process can cause delays, which is why server-side rendering is so important for SEO-critical pages.
Google deploys specialized crawlers for different tasks:
| Crawler | User-Agent | Task |
|---|---|---|
| Googlebot Smartphone | Googlebot/2.1 (Mobile) | Mobile-First crawling (primary since Oct. 2023) |
| Googlebot Desktop | Googlebot/2.1 | Desktop pages (secondary only) |
| Googlebot Images | Googlebot-Image/1.0 | Images for Google Image Search |
| Googlebot Video | Googlebot-Video/1.0 | Videos for Video Search |
| Googlebot News | Googlebot-News | News content for Google News |
| AdsBot | AdsBot-Google | Landing page quality for Google Ads |
Crawl Budget: How Google Sets Priorities
Google can’t crawl every URL on the internet simultaneously – even with massive infrastructure. That’s why Google assigns each website a crawl budget: the number of pages Googlebot can and wants to crawl within a given timeframe.
The crawl budget consists of two components. The first is the Crawl Rate Limit – the maximum crawl frequency without overloading your server. Google automatically adjusts this limit based on your server response times. The second component is Crawl Demand – how much does Google actually “want” to crawl your pages? This demand is based on page popularity, freshness, and perceived importance. Frequently linked and often updated pages have significantly higher priority.
What Wastes Your Crawl Budget
Certain technical problems can massively waste your crawl budget. Duplicate content is one of the most common culprits: when the same content is accessible under multiple URLs, Google crawls each separately. Faceted navigation in online shops often creates thousands of filter combinations without unique content. Session IDs in URLs generate infinite URL variants for identical content.
Particularly insidious are so-called soft-error pages: pages that show users an error message but return a 200 status to Googlebot. The bot crawls them repeatedly without recognizing they’re worthless. Similarly problematic are “infinite spaces” like endlessly paginated archives or calendars that can theoretically generate unlimited URLs. And if your website has been hacked, spam pages can consume your entire crawl budget – this is where Google’s SpamBrain system steps in to detect manipulative content.
For a detailed optimization guide, read my specialized guide: Optimize Crawl Budget: Get Your Content Indexed Faster.
Controlling Crawling: robots.txt, Sitemaps & Search Console
You have several tools to influence how Google crawls your website. The most important is the robots.txt file, which sits in your domain’s root directory and gives instructions to crawlers.
robots.txt – Access Control
With robots.txt, you can exclude certain areas of your website from crawling. This makes sense for admin areas, checkout processes, or internal search pages that don’t belong in the index. You can also define different rules for different crawlers – treating AdsBot differently from regular Googlebot, for example.
# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /search?
User-agent: Googlebot
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xml
XML Sitemaps – Your Website’s Map
An XML sitemap lists all important URLs on your website and helps Google with discovery. It’s especially valuable for new websites without many backlinks, large websites with complex structures, and pages that aren’t well internally linked.
There are several best practices for sitemap maintenance. Only include indexable, canonical URLs – no pages with noindex, no duplicates, no redirect targets. Only update the lastmod date when actual content changes are made, because Google learns whether you’re “lying” and then completely ignores your lastmod data – John Mueller has explicitly confirmed this. Each sitemap allows a maximum of 50,000 URLs; for larger websites, use a sitemap index. Always submit the sitemap in Google Search Console. For Google Discover traffic, you should also set max-image-preview:large in meta robots.
Search Console – The Direct Connection to Google
Search Console gives you a direct communication line to Google. With URL Inspection, you can check the exact status of every single URL. You see when it was last crawled, whether it’s indexed, which canonical Google recognized, and whether there are mobile usability issues. For important new pages, you can directly request indexing – though this is no guarantee, just a signal to Google.
The Page Indexing report under “Indexing → Pages” shows you all crawling and indexing problems at a glance. The current view distinguishes between “Indexed” (green) and “Not Indexed” (gray) with detailed sub-reasons. Here you’ll find pages blocked by robots.txt, accidentally set to noindex, or recognized as duplicates. The Removals tool lets you temporarily remove URLs from search results – useful for sensitive content or critical errors.
What Is Indexing? From Caffeine to Index Tiers
Indexing is the process by which Google analyzes, understands, and stores crawled pages in its database. Only indexed pages can appear in search results. Crawling alone isn’t enough – a page can be crawled but still not indexed if Google deems it low-quality or redundant.
The Caffeine System
Since 2010, Google has used the Caffeine system for indexing. Unlike the old system, which updated the entire index in large batches, Caffeine works incrementally and in near real-time. This means new content can appear in the index much faster – provided it meets quality criteria.
Once Googlebot has downloaded a page, the actual analysis begins. First, the HTML code is converted into a DOM structure – a process called parsing. Then Google extracts all relevant content: text, images, videos, and structured data are captured and categorized.
Structured data deserves special attention. These are machine-readable pieces of information in JSON-LD, Microdata, or RDFa format that help Google unambiguously understand a page’s content. A recipe is recognized as a recipe (with ingredients, cooking time, ratings), a product as a product (with price, availability, reviews), an FAQ as an FAQ. This data enables rich snippets in search results – the eye-catching additional information like star ratings, prices, or FAQ accordions that can significantly boost your click-through rate.
Particularly important is the linguistic analysis. Here, Google identifies the document’s language, recognizes topics covered, and connects entities like people, places, and concepts to the Knowledge Graph. This semantic analysis goes far beyond simple keyword matching – more on this in my article on Semantic Search & the Knowledge Graph.
Simultaneously, Google checks whether the content is a duplicate or variation of existing pages. Similar content is grouped, and Google selects a canonical version as the “original.” Initial quality signals are also captured – E-E-A-T factors already play a role here. At the end of the process, the page is stored in the appropriate index tier.
What Signals Influence Indexing Priority
Google decides based on various factors whether and how quickly a page gets indexed. Internal linking plays a central role: how prominent is the page in your site structure? Pages reachable from the homepage in few clicks are rated as more important. External backlinks amplify this effect – when trustworthy websites link to your page, it signals to Google that the content is relevant.
Content quality itself is naturally decisive. Is the content unique and does it offer real value? Or is it yet another generic page on an oversaturated topic? Technical signals like loading speed and mobile usability also factor in. And finally, the authority of the entire domain plays a role – established websites with good reputation get a trust bonus.
The Three Index Levels: Base, Zeppelins & Landfills
The 2024 Google API Leaks delivered one of the most exciting revelations: Google doesn’t manage a unified index but multiple levels with different priorities and update frequencies. A deep analysis of these leak findings can be found in: Google Leak: Why User Signals Matter More Than Google Admitted.
The leaked API documents contain the attribute scaledSelectionTierRank with the reference “over the serving tier (Base, Zeppelins, Landfills).” The exact tier names are thus confirmed – but the content interpretation of which tier corresponds to which storage type comes from Mike King’s analysis at iPullRank (May 2024). King inferred from context signals in the documentation that the tiers follow a physical storage hierarchy:
| Index Tier | Storage Type (per King) | Update Frequency | Typical Content |
|---|---|---|---|
| Base Index | Flash Memory | Frequent (hours to days) | High-quality main pages, news, authoritative domains |
| Zeppelins | Solid State Drives (SSD) | Occasional (weeks) | Archive pages, deep hierarchies, lower authority |
| Landfills | Standard Hard Drives (HDD) | Rare (months or never) | Old content, low-quality pages, rarely linked URLs |
Pages in the Base Index are considered for competitive keywords. Pages in Zeppelins can rank but have worse chances for competitive queries. Content in Landfills has practically no chance of rankings for relevant queries – regardless of how good the content theoretically is.
What Influences Tier Assignment
The leaks point to several factors that determine which tier a page lands in. The siteAuthority signal evaluates the entire domain – yes, a type of domain authority actually exists, even though Google denied it for a long time. Good old PageRank still plays a role, albeit in modified form. Content freshness and user engagement also factor in. Finally, crawl frequency – how often the page changes – also influences tier assignment.
Controlling Indexing: Canonical, noindex & hreflang
You have several ways to actively influence indexing. These tags and signals aren’t optional extras but essential tools for any professional SEO strategy.
Canonical Tags – Defining the Preferred Version
For duplicates or very similar pages, the canonical tag points to the “original” version. This is especially important when the same content is accessible under multiple URLs – such as a product linked through different categories or parameter URLs for sorting and filtering.
<link rel="canonical" href="https://example.com/original-page/" />
The canonical tag is a hint, not a directive. Google can choose to treat a different URL as canonical if signals are contradictory. That’s why consistency matters: internal links, sitemap entries, and canonical should all point to the same URL.
Meta Robots – Selectively Preventing Indexing
<meta name="robots" content="noindex, follow" />
The “noindex, follow” combination is particularly useful: the page itself isn’t indexed, but Google still follows its links. This is ideal for overview pages serving only navigation, or login areas whose content shouldn’t appear in search but link to indexable content.
hreflang – Linking International Versions
<link rel="alternate" hreflang="de" href="https://example.com/de/page/" />
<link rel="alternate" hreflang="en" href="https://example.com/en/page/" />
<link rel="alternate" hreflang="x-default" href="https://example.com/" />
Crucially, hreflang must be bidirectional: if page A references page B, page B must also reference page A. Faulty hreflang implementations are one of the most common technical SEO problems with international websites.
JavaScript & Rendering: The Hidden Hurdle
Modern websites often use JavaScript to dynamically load content. What creates a smooth experience for users poses significant challenges for Google. Googlebot can execute JavaScript, but the process is complex and delayed.
Google’s Two-Stage Rendering Process
Google crawls and renders in two separate steps. On the first crawl, only the initial HTML code is captured. The page is then queued for execution by the Web Rendering Service (WRS). Only there is JavaScript executed and the fully rendered DOM extracted. A second indexing with the rendered content follows.
The problem is the time delay between first crawl and rendering. Based on current analysis, this can take hours, days, or in individual cases even weeks – depending on Google’s current capacity and your website’s priority.
Solutions for JavaScript SEO
Server-Side Rendering (SSR) is the most robust solution. Dynamic Rendering is a middle ground. At minimum, practice Progressive Enhancement: core content must be available without JavaScript, and internal links must exist as real HTML links in the initial HTML.
Mobile-First Indexing: The New Standard
Google officially completed the transition to Mobile-First Indexing in October 2023, as confirmed by John Mueller on the Google blog. Googlebot Smartphone has been the primary crawler since then, and the mobile version of your website is the basis for indexing and ranking.
The transition had a long history: Google first announced Mobile-First Indexing in November 2016, began the rollout in March 2018, set multiple deadlines (September 2020, then March 2021) which were each postponed, and carried out the last batch in May 2023. The official confirmation of completion came on October 31, 2023.
More than 60% of all Google searches now come from mobile devices. For more on mobile optimization, see Core Web Vitals & Page Experience: The Complete Optimization Guide.
Mobile-First Checklist
- Content parity: All important texts, images, and videos are identical on mobile and desktop
- Meta tags: Title, description, and robots tags are present on mobile
- Structured data: JSON-LD is also embedded in the mobile version
- Internal links: Mobile navigation contains all important links
- Touch targets: Buttons and links are at least 48×48 pixels
- Readability: Font size at least 16px, no horizontal scrolling needed
Diagnosing Crawling & Indexing Problems
Google Search Console is your most important tool for identifying crawling and indexing problems. The Page Indexing report under “Indexing → Pages” shows all URLs and their current status.
Understanding the Two Main Categories
| Status | Color | Meaning | Action |
|---|---|---|---|
| Indexed | Green | Page is in Google’s index and can appear in search results | All OK ✓ |
| Not Indexed | Gray | Page is not in the index for a specific reason | Check the detailed sub-reason – intentional or error? |
Common Exclusion Reasons and Solutions
- Blocked by robots.txt: Crawling is prevented. Check if this is intentional.
- Noindex tag detected: You or a plugin explicitly deactivated indexing.
- Duplicate without canonical: Google chose a canonical itself because you didn’t define one.
- Discovered – currently not indexed: Google knows the URL but hasn’t crawled it yet.
- Crawled – currently not indexed: Google doesn’t consider the page indexworthy – often a quality problem.
Log File Analysis – Behind the Scenes
Search Console shows what Google has indexed. But it doesn’t show what Google actually does on your website. For that, you need a log file analysis. Server logs record every single request – including Googlebot’s. You can use specialized tools like Screaming Frog Log File Analyser or JetOctopus.
Infographic: The Journey of a URL into Google’s Index

Conclusion: The Invisible Foundation of Your Rankings
Crawling and indexing are the often-overlooked fundamentals of every successful SEO strategy. The best keyword research and the most valuable content are useless if Google can’t find or understand your pages.
The four pillars of technical SEO: Discoverability through clean site structure, current sitemaps, and thoughtful internal linking. Crawlability through fast servers, no technical blockades, and efficient crawl budget usage. Indexability through correct canonical tags, no accidental noindex, and unique content. And Renderability through JavaScript-friendly architecture and Mobile-First optimization.
“What Google can’t crawl will never rank. What Google doesn’t index will never be found.”
Invest time in the technical foundations. For the big picture of how these phases work together in Google’s overall system, read the main article: How Does the Google Search Algorithm Work? And to understand what happens after indexing – how Google processes and interprets search queries – continue to the next chapter: Query Processing: How Google Understands Your Search Query.
Frequently Asked Questions (FAQ)
How long does it take for Google to index my new page?
This varies widely – from a few hours to several weeks. Key factors are your website’s authority, crawl frequency, content quality, and how the page is discovered. For established websites with good reputation, a new page can be indexed within hours. For new websites without authority or backlinks, it often takes weeks.
Why isn’t my page being indexed?
The most common reasons are technical: an accidentally set noindex tag, a robots.txt block, a canonical tag pointing to another page, or too few internal links. It can also be a quality issue – if Google considers the content thin, duplicate, or low-quality, it won’t be indexed. Use URL Inspection in Search Console to identify the exact status.
What’s the difference between crawling and indexing?
Crawling means Google finds and downloads your page. Indexing means Google analyzes, understands, and stores the page in its database. A page can be crawled but not indexed – for example, if Google recognizes it as a duplicate or considers it low quality. Only indexed pages can appear in search results.
How do I know which index tier my pages are in?
Google doesn’t publicly communicate tier assignment. But there are indicators: if a page is indexed but never ranks for relevant keywords, it’s likely in a lower tier. Pages that get re-crawled quickly after content updates are probably in higher tiers. Regular updates, good linking, and user engagement help with “promotion” to higher tiers.
What’s better: robots.txt Disallow or noindex?
It depends on your goal. robots.txt prevents crawling, saving crawl budget, but doesn’t reliably prevent indexing. noindex allows crawling but definitively prevents indexing. For content that must absolutely not appear in search results, noindex is the safer choice.
Can I speed up indexing of my pages?
Partially. Helpful measures include: submitting the URL for indexing in Search Console, setting strong internal links from already indexed pages, including the page in the sitemap, creating high-quality and unique content, and building external links from trustworthy websites. There’s no guarantee of fast indexing – Google ultimately decides based on relevance, quality, and available resources.


