Crawling ≠ Indexing: robots.txt controls whether a crawler visits a URL – not whether it ends up in the index. A URL blocked via robots.txt can still be indexed and appear in search results – often without a snippet because Google doesn’t know the content, but discovered the URL via external links. If you don’t want indexing, you need noindex – and the page must remain crawlable for that.
One is enough for 90%: Most websites get by with three directives: User-agent, Disallow, and Sitemap. But a single Disallow: / in production prevents all crawling – already known URLs can still remain in the index until Google deindexes them via other means. The most common and expensive mistake from my audits.
AI Crawlers 2026: GPTBot, OAI-SearchBot, ClaudeBot, and Google-Extended can be controlled precisely – each for a different purpose. robots.txt and llms.txt operate on different levels: robots.txt controls crawling (standardized via RFC 9309), llms.txt is a new, not yet standardized convention intended to give AI systems guidance for concrete queries – not an IETF standard, but already actively used in practice.
Tuesday morning, technical audit for a new client. First action, always: open domain.de/robots.txt in the browser. And I see this:
User-agent: *
Disallow: /Complete crawl block. Since the relaunch four months ago. The client asked why organic traffic had dropped to zero after the launch. Now he knew. The developers had copied the staging configuration 1:1 into production. A classic. An expensive classic.
The robots.txt is one of the most inconspicuous files in technical SEO – two lines are enough for the worst-case scenario. Used correctly, it directs crawlers to what matters, protects your Crawl Budget, and in 2026 even gives you control over which AI systems can use your content for their training.
This guide is what I send my clients before we go through the first technical audit – straight from my work as a Product Developer at iGaming.com and from a good 50 technical SEO projects at SEO Kreativ. Not theory you’ve already read elsewhere. Practice that I apply myself.
What is robots.txt – and what is it not?
noindex. But: The page must remain crawlable for that, otherwise Google cannot read the noindex signal.The robots.txt is a simple text file in the root directory of your domain – so under domain.de/robots.txt. It follows the Robots Exclusion Protocol (RFC 9309) and is officially described by Google in the Search Central Documentation, which has been documented as an official IETF standard since September 2022, and contains instructions to web crawlers: which paths they are allowed to visit, which not.
First things first – and this gets confused in every second audit:
| robots.txt can | robots.txt cannot |
|---|---|
| Prevent crawling of a URL | Prevent indexing of a URL |
| Direct crawl resources to important areas | Block access for real users |
| Give specific rules to specific crawlers | Remove pages from the index that are already there |
| Communicate the sitemap path | Protect sensitive data (no substitute for authentication) |
| Exclude AI crawlers from training | Ensure that crawlers play by the rules |
robots.txt is also a recommendation, not an obligation. Reputable crawlers – Googlebot, Bingbot, GPTBot – adhere to it. Malware bots and intentional content scrapers adhere to nothing at all. If you really want to protect something, you need authentication, rate limiting, and firewall rules.
Disallow: /geheim/ does not protect confidential content. First, not all crawlers respect the directive. Second, the robots.txt is public – anyone can access it and see exactly which paths you mark as sensitive. Attackers actively use this as a reconnaissance tool.Syntax and directives: what you need to know
User-agent, Disallow, Sitemap. Everything else is fine-tuning. But an empty Disallow: (no value) means “everything allowed” – the opposite of Disallow: /. You should know this difference by heart.A robots.txt consists of one or more blocks. Each block begins with User-agent and then contains the directives for exactly that crawler. There must be a blank line between blocks – if it’s missing, some crawlers interpret everything as a single block.
The six directives at a glance
| Directive | Meaning | Support |
|---|---|---|
User-agent | Which crawler the block applies to. * = all. | Universal |
Disallow | Block path. Leave empty = all allowed. / = all blocked. | Universal |
Allow | Explicitly allow path – overrides Disallow. Most specific path wins. | All RFC-9309 compliant crawlers (historically: Google, Bing) |
Sitemap | Absolute URL to the XML sitemap. Can be used multiple times, outside the blocks. | Google, Bing, Yandex |
Crawl-delay | Wait time in seconds between crawler requests. | Bing, Yandex – not Google |
# | Comment line, is ignored. | Universal |
Crawl-delay – this has been documented behavior for years. The crawl speed for Google can only be influenced via the Search Console under “Settings → Crawl stats”. Anyone who has Crawl-delay for Google in their robots.txt can delete the entry without replacement.Syntax rules that cause errors in practice
An important point for everyone writing robots.txt rules for Google: The token Googlebot applies to both crawler variants – smartphone and desktop. It is not possible to distinguish between them in the robots.txt because both use the same User-agent token. Google explicitly confirms this: “you cannot selectively target either Googlebot Smartphone or Googlebot Desktop using robots.txt.” Anyone trying to build separate Mobile/Desktop rules is building on a foundation that doesn’t exist.
Wildcards work for Google, but not exactly the same for all crawlers. Disallow: /admin* blocks everything starting with /admin. Disallow: /*.pdf$ blocks all PDFs – the $ stands for “end of URL”. Case sensitivity matters: Disallow: /Admin does not block /admin. In case of conflicts between Allow and Disallow, the more specific rule always wins with Google.
The minimum – the cleanest robots.txt for small websites
# robots.txt for domain.de
# As of: 2026
User-agent: *
Disallow:
Sitemap: https://www.domain.de/sitemap.xml
Disallow: without a value: all crawlers are allowed everything. Sitemap path communicated. Many small websites simply don’t need more than this.
Infographic: Anatomy of a robots.txt
Before we get to the practical scenarios – here is the visual overview of the structure, directives, and the three most important rules of thumb:

5 practical scenarios with copy-paste code
Scenario 1: WordPress Standard
My starting point for WordPress sites without special requirements – distilled from hundreds of projects:
# robots.txt WordPress Standard
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /search/
Sitemap: https://www.domain.de/sitemap_index.xmlAdmin area blocked, AJAX endpoint explicitly allowed (some frontend components need this), theme and plugin assets blocked (zero SEO value, save crawl budget), search result pages blocked (duplicate content).
Scenario 2: E-Commerce with Faceted Navigation
Faceted navigation is the classic use case. Filter parameters create thousands of URL combinations that are crawled but have no SEO value and eat up the crawl budget:
# robots.txt E-Commerce
User-agent: *
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /*?filter=
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /search?
Allow: /
Sitemap: https://www.shop.de/sitemap.xmlDisallow: /*?filter= blocks all URLs containing the parameter ?filter= – regardless of the path and other parameters. That sounds precise, but can apply more broadly than intended. Always check against five URL variants using the Google robots.txt Tester before going live.Scenario 3: Staging environment – and why it must never go to production
# STAGING ONLY – never in production!
User-agent: *
Disallow: /These two lines belong on every staging system. And in every deployment pipeline I set up, there is a dedicated robots.txt check as a mandatory gate before go-live. That’s not paranoia – that’s the lesson from the case this article started with.
Scenario 4: Multilingual website
With language subfolders, the robots.txt is usually straightforward – let them crawl, communicate sitemaps. Each subdomain needs its own robots.txt:
# robots.txt multilingual
User-agent: *
Disallow: /intern/
Disallow: /preview/
Sitemap: https://www.domain.de/sitemap-de.xml
Sitemap: https://www.domain.de/sitemap-en.xml
Sitemap: https://www.domain.de/sitemap-fr.xmlMultiple sitemap entries are valid and make sense for large projects.
Scenario 5: Targeted media crawler block
If you want to keep images from specific areas out of Google Image Search without blocking the regular Googlebot:
# Limit only image crawlers
User-agent: Googlebot-Image
Disallow: /intern/
Disallow: /produkte/intern/
User-agent: *
Disallow: /intern/
Sitemap: https://www.domain.de/sitemap.xmlrobots.txt and Crawl Budget
Crawl-delay. If you really want to throttle the crawl budget for Google, do it exclusively in the Search Console.Crawl budget describes how many pages Googlebot visits on your website within a specific timeframe. For small to medium websites with under 10,000 indexed URLs, this is rarely a limiting factor. For large shops, publishers, or platforms with faceted navigation, tag archives, and URL parameters, it can become critical.
From my work at iGaming.com, I know how quickly this escalates: A shop with 80,000 product pages and unrestrained faceted navigation easily produces two million crawler-accessible URLs. If Googlebot visits most of those, the 80,000 relevant pages get less budget – and are crawled and updated less frequently. This is not a theoretical problem, it is a common reason for indexing issues.
| Measure | Impact on Crawl Budget | Recommendation |
|---|---|---|
| Block non-indexed URLs in robots.txt | Medium to high | |
| Crawl-delay (non-Google) | Reduces crawl rate for Bing/Yandex | |
| Keep sitemap up to date | Directs to relevant URLs | |
| Clean up internal link structure | High – direct impact | |
| Reduce redirect chains | Medium | |
| Throttle crawl rate in Search Console | Direct impact on Googlebot |
Controlling AI crawlers: GPTBot, ClaudeBot, Google-Extended & Co.
Google-Extended only blocks Gemini training – not search. And robots.txt and llms.txt are not an either-or choice: robots.txt (RFC 9309 standard) controls crawling. llms.txt is a new, not yet standardized convention for AI systems. Both operate on different levels – robots.txt is mandatory, llms.txt is optional and not yet an IETF standard.This wasn’t a question that popped up in technical audits back in 2022. Today, clients ask me about it in every second project. The crawlers of AI companies have differentiated enormously over the last two years – and if you treat them generically, you lose control.
The most important AI crawlers in 2026 with verified User-agent names, according to the official documentation of the providers – including OpenAI Platform Docs:
| User-agent | Operator | Purpose |
|---|---|---|
GPTBot | OpenAI | AI training (ChatGPT, GPT models) |
OAI-SearchBot | OpenAI | ChatGPT search / Atlas index |
ChatGPT-User | OpenAI | User-triggered browsing in ChatGPT |
ClaudeBot | Anthropic | Training & Retrieval |
anthropic-ai | Anthropic | Older User-agent (still active) |
PerplexityBot | Perplexity AI | AI search index |
Google-Extended | Gemini & Vertex AI Training/Grounding – no ranking signal, no impact on Google Search | |
Bytespider | ByteDance | Training (TikTok ecosystem) |
A block that stops AI training without affecting Googlebot for search:
# Block AI training (search untouched)
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Only Gemini training – NOT Google Search
User-agent: Google-Extended
Disallow: /Disallow: / for GPTBot stops the training crawl. The llms.txt, on the other hand, controls what an AI prioritizes for a specific user query – these are two different levels. If you want to stop AI training but still appear in AI answers, you combine both.Important limitation: robots.txt is a guideline, not a technical barrier. Reputable providers like OpenAI and Anthropic have publicly stated they respect the directives. If you want absolute protection, you need server-side measures – firewall rules based on IP ranges or User-agent strings.
The most common mistakes from real audits
Disallow: / carried over from staging to production. Second most common: CSS and JavaScript blocked. Both can be fixed in three minutes – if you notice them in time.Mistake 1: Disallow: / in production
I already described the most expensive mistake from my audits in this article. It almost always happens during a relaunch. Important: Disallow: / prevents crawling, but doesn’t immediately remove already known URLs from the index – they can still appear as entries without a snippet for weeks. My antidote: Every deployment process has an explicit robots.txt check as a mandatory gate – regardless of who executes the deployment.
Mistake 2: CSS and JavaScript blocked
# Bad – prevents correct rendering
User-agent: *
Disallow: /wp-content/
Disallow: /assets/js/
Disallow: /assets/css/Google renders pages like a browser. If Googlebot cannot load CSS and JavaScript, it sees a broken page and evaluates it accordingly. CSS and JS files do not belong on the blocklist.
Mistake 3: Combining noindex and robots.txt
If a URL is blocked in the robots.txt, Google cannot read the noindex tag – because Googlebot doesn’t visit the page. The URL can still be indexed (through external links), but then without content and without a snippet. Rule of thumb: Either robots.txt block (prevent crawling) or noindex (prevent indexing, allow crawling) – never both at the same time. Google explicitly confirms this in the noindex documentation.
Mistake 3b: There is no noindex in the HTML head for PDFs and non-HTML files
A mistake I regularly see in audits for publisher and e-commerce websites: PDFs shouldn’t be indexed – but only a noindex meta tag was set, which would have worked for an HTML page. PDFs don’t have a <head> section. The only way to tell Google that a PDF file should not be indexed is the X-Robots-Tag in the HTTP response header – confirmed by Google in the official documentation.
For Apache in the .htaccess:
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>For Nginx:
location ~* \.pdf$ {
add_header X-Robots-Tag "noindex, nofollow";
}X-Robots-Tag – the pattern is identical to the noindex problem above.Mistake 4: Relative instead of absolute sitemap URL
# Wrong
Sitemap: /sitemap.xml
# Right
Sitemap: https://www.domain.de/sitemap.xmlThe Sitemap directive expects a complete, absolute URL. Relative paths are not interpreted correctly by many crawlers.
Mistake 5: Crawl-delay for Googlebot
I still see this regularly in audits: Crawl-delay: 10 for Googlebot. Google ignores this completely. The entry is not harmful, but useless – and it suggests that someone thinks they achieved something with it.
Mistake 6: No testing before go-live
Pushing the robots.txt live without a prior test is like a deployment without a review. Just because it usually goes well doesn’t make it a good practice.
Testing robots.txt: my 5-step protocol
| Tool | What it can do | Where |
|---|---|---|
| Google robots.txt Tester | Official Google implementation, tests URL against User-agent | Google Search Console → Settings |
| Screaming Frog | Mass test: which crawled URLs are blocked? | Desktop tool |
| Search Console → Page indexing | “Blocked by robots.txt” URLs after go-live | Search Console → Indexing |
My protocol for every robots.txt change:
- Make the change locally, add a comment with the date.
- Test against three representative URLs in the Google robots.txt Tester: one that should be blocked; one that should pass through; a borderline URL (e.g., the Allow exception).
- Test at least five URL variants for wildcard rules.
- Deploy.
- Check Search Console 48 hours later: no unexpected “blocked by robots.txt” warnings in the Page indexing report.
Frequently Asked Questions (FAQ)
Do I even need a robots.txt?
Technically no – without a robots.txt, Googlebot crawls everything. Practically, I always recommend one, at least to communicate the sitemap and block obvious crawler traps. The three minutes of effort are worth it.
What happens if my robots.txt is unreachable?
A 5xx server error makes Googlebot temporarily pause the crawling of the domain. A 404 (not found) is interpreted by Google as “no restrictions” and it crawls everything. The robots.txt must always respond with HTTP 200.
Can I stop AI training but still appear in AI answers?
Yes – and that is a relevant strategic decision in 2026. Disallow: / for GPTBot stops the training crawl. The llms.txt independently controls what an AI prioritizes for a specific query. Both levels function independently of each other.
Does robots.txt help against scrapers?
No. Intentional scrapers ignore robots.txt. The file is aimed at cooperative crawlers. Rate limiting, IP blocking, and – if necessary – legal action are what help against real scraping.
What happens in case of conflicts between Allow and Disallow?
With Google, the more specific rule wins. Allow: /produkte/sale/ overrides Disallow: /produkte/ because the Allow path is longer and therefore more specific. The logic can differ with other crawlers – when in doubt, test it.
Can I have multiple robots.txt files for one domain?
No. There is always exactly one robots.txt per domain origin, in the root directory. Subdomains (blog.domain.de) each need their own file under blog.domain.de/robots.txt.
Conclusion: build simply, control purposefully, always test
Start with the minimum. Block obvious crawler traps. Communicate the sitemap. Make a conscious decision on which AI crawlers you control and how – that is no longer a theoretical question in 2026. And observe whether the llms.txt makes sense for your setup – it’s not a formal standard yet, but is already actively used by several AI systems.
If you want to dive deeper into the topic: The Crawling & Indexing Guide explains how Googlebot embeds the robots.txt into the overall process from crawling to indexing – including index tiers and the Caffeine system. The robots.txt is the entrance. What happens behind it is the really interesting story.
Disallow: / in production? ✓ — CSS and JS not blocked? ✓ — Sitemap as absolute URL? ✓ — Google robots.txt Tester passed? ✓ — AI crawler strategy defined (GPTBot, ClaudeBot, Google-Extended)? ✓



