Robots.txt Patterns Every Founder Should Know
Five robots.txt patterns that solve crawl issues. Step-by-step guide for founders shipping SEO-ready sites without agency help.
Why Your Robots.txt Matters (Even If You Think It Doesn't)
You shipped. Your product works. Users love it. But Google's crawlers? They're probably wasting time on pages you don't want indexed, missing pages you do want indexed, or getting throttled by your server because you never told them where to go.
Robots.txt is a 20-line text file that controls what search engines crawl on your site. Most founders skip it entirely or misconfigure it in ways that tank organic visibility. The brutal truth: a broken robots.txt won't kill your site overnight, but it will silently kill your SEO momentum over months.
This guide covers five robots.txt patterns that solve the most common crawl issues. You'll learn exactly what each pattern does, why it matters, and how to implement it in under five minutes. No agency required.
Prerequisites: What You Need Before You Start
Before you write your first robots.txt directive, make sure you have these in place:
Access to your domain's root directory. Robots.txt must live at yoursite.com/robots.txt. If you're on Webflow, Shopify, Squarespace, or WordPress, your hosting platform has a built-in interface. If you're on a custom stack, you need file access to your server root.
Google Search Console verified. You can't diagnose crawl issues or test robots.txt changes without GSC. If you haven't set it up yet, follow this step-by-step guide to verify your domain in Google Search Console in under 10 minutes. GSC lets you see what Google is trying to crawl and what it's blocked from crawling.
A basic understanding of your site structure. Know which pages you want indexed, which you want to hide, and which sections get hammered by bot traffic. If you're unsure, check your server logs or your analytics platform.
Familiarity with robots.txt syntax. Robots.txt uses simple directives: User-agent (which bot this applies to), Disallow (what to block), and Allow (what to permit). The official Google Search Central documentation on robots.txt is the source of truth, but we'll cover the practical patterns you actually need.
If you're starting from scratch, read this foundational guide on writing your first robots.txt file. It'll take 10 minutes and give you a template to customize.
Pattern 1: Block Low-Value Pages to Preserve Crawl Budget
The Problem: Google crawls your site on a budget. It allocates a finite number of crawl requests per day based on your domain's authority and server speed. If you have thousands of low-value pages—admin panels, duplicate content, staging environments, thank-you pages—Google wastes crawl budget on them instead of indexing your high-value content.
The Solution:
User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /thank-you/
Disallow: /api/
Disallow: /dashboard/
Disallow: /*.pdf$
Disallow: /*?sort=
Disallow: /*?filter=
What This Does:
User-agent: *applies this rule to all bots (Google, Bing, etc.).Disallow: /admin/blocks the entire/admin/folder and everything inside it.Disallow: /staging/prevents crawling of your staging environment.Disallow: /thank-you/hides thank-you pages that don't need organic traffic.Disallow: /api/stops bots from crawling API endpoints (which confuses search engines and wastes crawl budget).Disallow: /dashboard/blocks user dashboards and account pages.Disallow: /*.pdf$blocks all PDF files (the$means "end of URL").Disallow: /*?sort=andDisallow: /*?filter=prevent crawling of filtered/sorted versions of the same content (duplicate content killer).
Why It Works: By blocking low-value pages, you force Google to spend its crawl budget on pages that actually matter—your blog posts, product pages, and core content. This is especially critical for founders with limited domain authority. Every crawl request counts.
When to Use This: If you're seeing coverage issues in Google Search Console (errors, warnings, excluded pages), or if your site has a lot of dynamic parameters, filters, or admin sections, this pattern is your first move.
For deeper context on how robots.txt interacts with other SEO files, check out this guide on robots.txt, sitemaps, and canonicals—most founders misconfigure all three.
Pattern 2: Allow Specific Bots While Blocking Others (Strategic Bot Control)
The Problem: Not all bots are equal. Some bots are useful (Google, Bing, social media crawlers). Others are parasites (scrapers, bad actors, aggressive crawlers that hammer your server). If you don't control which bots can crawl, you'll waste server resources and bandwidth on bots that don't help your SEO.
The Solution:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: facebookexternalhit
Allow: /
User-agent: Twitterbot
Allow: /
What This Does:
User-agent: *andDisallow: /blocks all bots by default.- Then you explicitly
Allow: /for specific bots: Googlebot (Google's crawler), Bingbot (Bing's crawler), and social media crawlers (Facebook, Twitter). - Any bot not listed is blocked.
Why It Works: This is a whitelist approach. It's aggressive, but it protects your server from being hammered by bad bots and scrapers. If a bot isn't explicitly allowed, it can't crawl. This is especially useful if you're on a shared hosting plan or if your server has limited bandwidth.
When to Use This: If your server logs show traffic from aggressive crawlers (like AhrefsBot, SemrushBot, or random user agents), or if your bandwidth is getting crushed by bot traffic, implement this pattern. You can always add more bots to the whitelist if needed.
Pro Tip: Check your server logs to see which bots are actually crawling your site. You might be surprised. Tools like Moz's beginner's guide to robots.txt have a list of common bot user agents you can reference.
Pattern 3: Protect Sensitive Directories Without Using robots.txt Alone
The Problem: Founders often try to hide sensitive pages (like /admin/, /private/, or /user-data/) using robots.txt alone. This is a critical mistake. Robots.txt is a suggestion, not a lock. A determined attacker or a misconfigured bot can ignore robots.txt and crawl these pages anyway. Sensitive data should never be protected by robots.txt alone.
The Solution:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /internal/
Disallow: /user-data/
But also implement these alongside robots.txt:
- Use HTTP authentication (password protect directories at the server level).
- Use a firewall rule to block access to sensitive paths from non-admin IPs.
- Use noindex meta tags on sensitive pages as a second layer.
Why It Works: Robots.txt tells well-behaved bots to stay out. But passwords, firewalls, and noindex tags provide real security. This is defense in depth.
When to Use This: If your site handles user data, payments, or admin functions, use all three layers. Robots.txt is just the first line of defense.
For a detailed decision tree on when to use robots.txt vs. noindex, read this guide on noindex vs. robots.txt. It covers the exact scenarios where each tool is appropriate.
Pattern 4: Block Duplicate Content and Parameter-Based Pages
The Problem: Many sites generate multiple URLs for the same content. E-commerce sites have filter parameters (?color=red&size=large). Blogs have sorting options (?sort=date). Tracking parameters get appended (?utm_source=email). Each variation is a separate URL, so Google sees them as different pages. This fragments your authority and wastes crawl budget on duplicates.
The Solution:
User-agent: *
Disallow: /*?*
Allow: /*?product_id=
Allow: /*?page=
Disallow: /*&
Disallow: /search?
Disallow: /filter?
Disallow: /*utm_
Disallow: /*fbclid=
Disallow: /*gclid=
What This Does:
Disallow: /*?*blocks all URLs with query parameters (the?symbol).Allow: /*?product_id=andAllow: /*page=whitelist the specific parameters you need crawled (product IDs and pagination).Disallow: /*&blocks URLs with multiple parameters (the&symbol).Disallow: /search?,/filter?, and tracking parameters block search, filter, and analytics parameters that create duplicates.
Why It Works: By blocking parameter-based duplicates, you consolidate authority on your canonical URLs. Google crawls fewer pages, but the pages it crawls matter more. This is especially powerful for e-commerce and content sites.
When to Use This: If your site has any dynamic parameters, filters, or sorting options, use this pattern. Check Yoast's ultimate guide to robots.txt for more examples of parameter blocking.
Pro Tip: Use canonical tags alongside this pattern. Robots.txt tells bots not to crawl duplicates. Canonical tags tell bots which version is the original. Together, they're unstoppable.
Pattern 5: Allow Crawling While Controlling Crawl Rate (Speed Control)
The Problem: Google crawls your site based on its crawl budget. If your server is slow or overloaded, Google might crawl fewer pages per day, or it might crawl so aggressively that it slows down your site for real users. This creates a catch-22: you want Google to crawl everything, but you don't want Google to break your site.
The Solution:
User-agent: Googlebot
Crawl-delay: 2
Request-rate: 1/5s
User-agent: Bingbot
Crawl-delay: 1
User-agent: *
Crawl-delay: 5
Request-rate: 1/10s
What This Does:
Crawl-delay: 2tells Googlebot to wait 2 seconds between crawl requests.Request-rate: 1/5stells Googlebot to crawl at a rate of 1 request per 5 seconds.- Bingbot gets a shorter delay (1 second) because Bing crawls less aggressively than Google.
- All other bots get a longer delay (5 seconds) to protect your server.
Why It Works: This pattern prevents your server from being overwhelmed while still allowing Google to crawl your content. It's especially useful if you're on shared hosting or if your server struggles under load.
When to Use This: If your server logs show high CPU usage or response times spike during Google's crawl, implement crawl delays. Also use this if you're on a limited bandwidth plan.
Important Note: Google doesn't always respect Crawl-delay or Request-rate directives. The preferred way to control crawl rate is through Google Search Console's crawl stats and settings. Use robots.txt crawl delays as a backup, not your primary tool.
For a comprehensive look at how crawl budget affects your entire SEO strategy, check out the 100-day SEO roadmap for founders. It covers crawl optimization as part of a larger technical SEO foundation.
Step-by-Step Implementation Guide
Step 1: Audit Your Current Robots.txt
First, check if you have a robots.txt file at all. Visit yoursite.com/robots.txt in your browser. If you see a 404, you don't have one yet. If you see a file, review it for errors.
In Google Search Console, go to Crawl → Robots.txt Tester. This tool lets you test specific URLs against your robots.txt rules. Enter a URL and see if it's allowed or blocked.
Step 2: Identify Pages to Block
Make a list of pages you want to block from crawling:
- Admin panels and dashboards
- Staging or development environments
- Duplicate content (filtered, sorted, or parameterized versions)
- Private user pages (though protect these with passwords too)
- API endpoints
- PDFs or files you don't want indexed
- Thank you pages, confirmation pages
- Search results pages on your own site
For each page, note the URL pattern. For example, /admin/*, /staging/*, /*?filter=*.
Step 3: Write Your Robots.txt
Start with one of the patterns above. Customize it for your site. Here's a template:
# Allow all bots by default
User-agent: *
Allow: /
# Block low-value pages
Disallow: /admin/
Disallow: /staging/
Disallow: /thank-you/
Disallow: /api/
Disallow: /*?filter=
Disallow: /*?sort=
Disallow: /*utm_
# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml
Start conservative. Block only the pages you're absolutely sure about. You can always be more aggressive later.
Step 4: Upload Your Robots.txt
For WordPress: Use a plugin like Yoast SEO or Rank Math. Both have built-in robots.txt editors.
For Webflow: Go to Project Settings → SEO → Robots.txt. Paste your rules and save.
For Shopify: Go to Settings → Files → robots.txt. Edit and save.
For Next.js or custom stacks: Create a public/robots.txt file in your project root. Deploy.
For other platforms: Check your hosting provider's documentation. Most have a file manager or control panel where you can upload files to the root directory.
Step 5: Test Your Robots.txt
Go back to Google Search Console's Robots.txt Tester. Test URLs that should be allowed and URLs that should be blocked. Verify that your rules work as expected.
Also test with Google's official robots.txt testing tool. It's the source of truth.
Step 6: Submit Your Sitemap
Add this line to your robots.txt:
Sitemap: https://yoursite.com/sitemap.xml
Replace yoursite.com/sitemap.xml with your actual sitemap URL. If you don't have a sitemap yet, follow this guide to generate one for your stack. It covers Next.js, Webflow, Shopify, Lovable, WordPress, and Framer.
Step 7: Monitor in Google Search Console
After you deploy your robots.txt, check Google Search Console's Coverage report in 24-48 hours. You should see:
- Fewer "Excluded" pages (because you're blocking them with robots.txt)
- More "Indexed" pages (because Google is focusing crawl budget on pages that matter)
- No new errors or warnings
If you see unexpected errors, check the Coverage report's detailed breakdown. This plain-English guide to coverage issues explains what each error means and how to fix it.
Pro Tips and Common Mistakes
Mistake 1: Blocking Pages You Actually Want Indexed
Robots.txt is easy to mess up. A single / in the wrong place can block your entire site. Always test before deploying.
Mistake 2: Using robots.txt to Hide Sensitive Data
Robots.txt is a suggestion. Malicious actors ignore it. Use passwords, firewalls, and noindex tags to protect sensitive pages.
Mistake 3: Blocking Your Sitemap
Never disallow your sitemap. Sitemaps should always be crawlable.
Mistake 4: Over-Blocking Your Site
Startups often block too much too fast. Block low-value pages, yes. But don't block entire sections of your site unless you're absolutely sure. You can always tighten up later.
Mistake 5: Forgetting to Update robots.txt When Your Site Changes
If you add new sections, delete old pages, or reorganize your site, update robots.txt accordingly. Set a reminder to review it quarterly.
How Robots.txt Fits Into Your Broader SEO Strategy
Robots.txt is one piece of a larger technical SEO foundation. It works alongside:
Sitemaps: While robots.txt tells bots where NOT to go, sitemaps tell bots where TO go. Read this guide on robots.txt, sitemaps, and canonicals to understand how they work together.
Canonical tags: These tell search engines which version of a page is the "original." They work with robots.txt to consolidate authority on duplicate content.
Noindex meta tags: These tell search engines not to index a page, but they still allow crawling. Use this decision tree to understand when to use noindex vs. robots.txt.
Google Search Console: This is where you monitor crawl issues, test robots.txt, and see which pages Google is trying to crawl. Set it up in 10 minutes if you haven't already.
For a complete technical SEO foundation, check out the free SEO tool stack every founder should set up. It covers GSC, GA4, Bing Webmaster Tools, Lighthouse, and keyword research tools—all free, all essential.
Real-World Examples
Example 1: E-Commerce Site
You sell products with filters (color, size, price). Each filter combination creates a new URL. Your robots.txt should look like:
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?price=
Disallow: /*&
Allow: /
This blocks filtered pages while allowing crawling of your main product pages and categories.
Example 2: SaaS with Admin Panel
You have a public marketing site and a private app. Your robots.txt should look like:
User-agent: *
Disallow: /app/
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Allow: /
This blocks everything private while allowing crawling of your marketing content.
Example 3: Blog with Dynamic Parameters
Your blog has sorting and filtering options. Your robots.txt should look like:
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page= (if you're using pagination)
Allow: /
Wait—should you block pagination? That depends. If you have a large blog, pagination creates separate URLs for each page. Google can handle this, but if your pagination parameters are ugly, block them and use a sitemap instead.
Tools to Help You Build and Test
Google Search Central: The official robots.txt documentation is your source of truth. Bookmark it.
Google Search Console: Test your robots.txt rules and monitor coverage issues.
Ahrefs: Their robots.txt guide includes real-world examples from major sites.
Moz: The beginner's guide is clear and comprehensive.
Semrush: Their robots.txt guide includes optimization tips.
Neil Patel: Beginner's guide with a free generator tool if you want a starting point.
Search Engine Journal: Complete guide with generator and examples.
Yoast: The ultimate guide covers advanced patterns and edge cases.
Quarterly Review Checklist
Robots.txt isn't set-it-and-forget-it. Review it quarterly. Here's a checklist:
- Check Google Search Console's Coverage report for unexpected excluded pages.
- Review your server logs for bots you're blocking (are they bots you actually want to block?).
- Check if you've added new pages or sections that need robots.txt rules.
- Test 5-10 URLs in the robots.txt tester to ensure rules are working.
- Review your crawl stats in GSC. Is Google crawling the right pages?
- Check if your sitemap is still being crawled and indexed.
For a deeper quarterly review process, check out this founder's guide to quarterly SEO reviews. It's a 90-minute template you can repeat every quarter.
Summary: Five Patterns, One Outcome
You now know five robots.txt patterns that solve the most common crawl issues:
- Block low-value pages to preserve crawl budget for content that matters.
- Block specific bots to protect your server from aggressive crawlers.
- Protect sensitive directories with robots.txt, passwords, and noindex tags (defense in depth).
- Block duplicate content from filters, sorting, and parameters to consolidate authority.
- Control crawl rate to prevent your server from being overwhelmed.
Implement the patterns that apply to your site. Start conservative. Test before deploying. Monitor in Google Search Console. Update quarterly.
Robots.txt is a small file with outsized impact. Get it right, and you'll save crawl budget, improve indexing, and accelerate organic growth. Get it wrong, and you'll silently tank your SEO for months.
You shipped. Now make sure Google can crawl what matters.
Next Steps
Your robots.txt is only one part of your technical SEO foundation. After you deploy it:
- Generate a sitemap if you don't have one. This guide covers every stack.
- Verify your domain in Google Search Console. Step-by-step guide here.
- Review your canonicals and www/non-www settings. This guide explains the right setup.
- Run a full technical SEO audit. 14-day bootcamp for busy founders gives you one win per day.
Or, if you want everything done in under 60 seconds, Seoable delivers a domain audit, brand positioning, keyword roadmap, and 100 AI-generated blog posts for a one-time $99 fee. No monthly subscriptions. No agency overhead. Ship faster, rank higher.
Get the next one on Sunday.
One short email a week. What is working in SEO right now. Unsubscribe in one click.
Subscribe on Substack →