Robots.txt file explained

Imagine spending six months building a high-quality website, publishing weekly blog content, and building backlinks, only to realize zero pages are showing up in Google search results. For 80% of the site owners we work with who face this issue, the culprit is a misconfigured robots.txt file. A robots.txt file is one of the most misunderstood, yet impactful components of technical SEO. It sits in the root directory of your website, acting as a set of instructions for search engine crawlers that visit your site. When set up correctly, it ensures crawlers spend their time on your most valuable pages, improving your rankings and organic traffic. When set up wrong, it can deindex your entire site overnight. This guide will walk you through everything you need to know, from basic syntax to advanced rules for large e-commerce sites. You’ll learn how to create, test, and audit your robots.txt file, avoid common mistakes, and use it to optimize your crawl budget for better SEO results.

What Is a Robots.txt File? Core Definition and Purpose

For anyone seeking a Robots.txt file explained in plain English, start here: it is a plain text file named robots.txt that lives in the root directory of your website (e.g., example.com/robots.txt). It follows the Robot Exclusion Standard, a protocol first introduced in 1994 to help website owners control how automated crawlers interact with their site. Every major search engine, including Google, Bing, and Yahoo, follows the rules set in this file. The file does not use any special code or formatting, only simple text directives that tell crawlers which pages or directories they are allowed to access, and which they should skip. It is not a security tool, as malicious bots and scrapers will ignore its rules entirely. Its primary purpose is to manage crawl budget, the number of pages a search engine will crawl on your site during a single visit. Example: if you visit google.com/robots.txt, you can see Google’s own rules, which block crawlers from accessing certain internal paths like /search?q=* to reduce server load. Actionable tip: open your own site’s robots.txt file right now by typing yourdomain.com/robots.txt in your browser to see if you already have one. Common mistake: assuming robots.txt can block all bots, which leads owners to store sensitive files in directories blocked by robots.txt, thinking they are private.

How the Robots.txt File Works: Crawl Directives 101

Search engine crawlers (also called spiders or bots) visit your site to discover and index new content. The first thing most crawlers do when they land on your site is check for a robots.txt file at the root directory. If the file exists, they read its rules before crawling any other pages. The file works on a per-crawler basis, using a user-agent directive to specify which crawler the rules apply to. The * wildcard applies to all crawlers that follow the robots exclusion standard. After the user-agent line, you add disallow or allow directives to block or permit access to specific paths. Finally, you can add a sitemap directive to tell crawlers where your XML sitemap is located, which helps them find all your important pages faster. Example: if you have a private directory for internal team files at /internal-team-docs/, you would add Disallow: /internal-team-docs/ to stop crawlers from accessing it. Actionable tip: list your most critical rules first in the file, as some crawlers may stop reading after a certain number of lines. Common mistake: assuming that allow rules override disallow rules for all paths, when in fact allow only overrides disallow for the exact path specified. For example, if you have Disallow: /private/ and Allow: /private/public/, crawlers can access /private/public/ but not other subfolders in /private/.

Robots.txt File Syntax: Every Rule You Need to Know

The robots.txt file uses only four core directives, all written in plain text with no special formatting. The first is the user-agent directive, which specifies which crawler the following rules apply to. For example, User-agent: Googlebot applies rules only to Google’s main crawler, while User-agent: * applies to all crawlers. The second is the disallow directive, which tells the crawler not to crawl a specific path. Disallow: / means block all pages, while Disallow: /cart/ blocks only the /cart/ directory. The third is the allow directive, which permits crawling a specific path even if a broader disallow rule applies. The fourth is the sitemap directive, which lists the URL of your XML sitemap.

Key Syntax Rules

All rules are case-sensitive, so Disallow: /Private/ is different from Disallow: /private/. Every directive must be on its own line, with no spaces at the start of the line. You can use the * wildcard to match any sequence of characters, and the $ symbol to match the end of a URL. Example: Disallow: /search* blocks all URLs that start with /search/, while Disallow: /product.html$ blocks only the exact /product.html URL, not /product.html?color=red. Actionable tip: always use absolute paths starting with / for all disallow and allow rules to avoid errors. Common mistake: adding spaces or special characters to directive names, e.g., Dis allow: /private/ which crawlers will ignore entirely.

Step-by-Step Guide: How to Create and Upload a Robots.txt File

This Robots.txt file explained walkthrough will take you through creating your first file in under 10 minutes, even if you have no technical experience.

Step 1: Check for an existing file

Visit yourdomain.com/robots.txt in your browser. If you see a 404 error, you don’t have a file yet. If you see text, you already have one, so download it to edit instead of starting from scratch.

Step 2: Create the file in a plain text editor

Use Notepad (Windows) or TextEdit (Mac, set to plain text mode). Do not use Microsoft Word or Google Docs, as they add hidden formatting that will break your robots.txt file.

Step 3: Add your rules

Start with User-agent: * to apply rules to all crawlers. Add disallow rules for pages you want to block, e.g., Disallow: /cart/, Disallow: /checkout/. Add your sitemap at the end: Sitemap: https://yourdomain.com/sitemap.xml.

Step 4: Save the file correctly

Save the file as robots.txt. Make sure your computer isn’t adding a .txt extension automatically (e.g., robots.txt.txt). Turn off “hide file extensions” in your system settings to check this.

Step 5: Upload to the root directory

Use FTP (FileZilla is a free tool) to upload the file to the top level of your site, not a subfolder. If you use WordPress, you can use a plugin like Yoast SEO to edit your robots.txt without FTP.

Step 6: Test the file

Use the Google Search Console Robots.txt Tester to confirm your rules work as intended.

Step 7: Submit your sitemap

If you haven’t already, submit your sitemap in Google Search Console to speed up indexing. Actionable tip: keep a backup of your old robots.txt file before making changes, so you can revert if something breaks. Common mistake: uploading the file to a subfolder like /blog/robots.txt, which crawlers will not find.

5 Most Common Robots.txt Mistakes (and How to Fix Them)

As we’ve covered in this Robots.txt file explained guide, accidental misconfigurations are the leading cause of sudden organic traffic drops. These are the five most frequent errors we see, and how to fix them. 1. Blocking the entire site with Disallow: / This rule tells all crawlers to skip every page on your site, which will deindex you from search engines within days. Fix: Never add this rule unless you are taking your site offline permanently. 2. Blocking CSS and JS files Google needs to crawl your CSS and JS files to render your pages correctly and understand your site’s layout. Blocking these files can lead to lower rankings. Fix: Add Allow: /wp-content/plugins/ and Allow: /wp-content/themes/ for WordPress sites, or the equivalent paths for your CMS. 3. Using relative paths instead of absolute paths For example, Disallow: private/ instead of Disallow: /private/ tells crawlers to block the private folder only if it appears in the current path, which almost never works. Fix: Always start all paths with a / to indicate the root directory. 4. Forgetting the sitemap directive Crawlers will still find your sitemap eventually, but adding the Sitemap: directive speeds up discovery of your pages. Fix: Add Sitemap: https://yourdomain.com/sitemap.xml at the very end of your file. 5. Blocking high-value pages by accident For example, adding Disallow: /product* to block duplicate product pages, but accidentally blocking all pages starting with /product/ including your main product listings. Fix: Test every rule in Google Search Console before uploading. Actionable tip: audit your robots.txt file once per quarter to catch errors early.

Robots.txt vs. Meta Robots Tag: Key Differences You Need to Know

Many site owners confuse robots.txt rules with meta robots tags, but the two serve completely different purposes. Robots.txt controls crawling (whether a crawler can access a page), while meta robots tags control indexing (whether a page can appear in search results). The table below breaks down the key differences:

Feature	Robots.txt	Meta Robots Tag
Purpose	Control which pages crawlers can access	Control how pages are indexed and served in search results
Scope	Site-wide or path-specific for all/a specific crawler	Page-specific
Impact on Crawling	Stops crawlers from fetching the page entirely	Allows crawling, but restricts indexing/serving
Crawl Budget Impact	Reduces waste by blocking low-value pages from being crawled	No impact on crawl budget (page is still crawled)
Ability to Block Resources	Can block CSS, JS, image files from being crawled	Cannot block resources, only HTML pages
Compliance	Followed by most major search engines, ignored by malicious bots	Followed by all major search engines

Example: If you want to stop crawl budget waste on your site’s internal search results pages, use Disallow: /search* in robots.txt. If you want to stop a specific blog post from appearing in search results, add a meta robots noindex tag to that post’s HTML head. Actionable tip: use robots.txt for crawl management, meta robots for index management. Never use robots.txt to try to noindex a page, as Google may still index the URL if it has external links. Common mistake: using Disallow: /private-page/ to stop a page from being indexed, only to have it show up in search results with a “no snippet available” label because Google found links to it from other sites.

How Robots.txt Impacts Crawl Budget (and Why It Matters for SEO)

Crawl budget is the number of pages Google will crawl on your site during a single visit, determined by your site’s authority and server speed. For small sites with under 1,000 pages, crawl budget is rarely an issue. For large e-commerce sites with 100,000+ pages, mismanaging crawl budget can mean Google takes months to index new product pages. Robots.txt is the most effective tool for optimizing crawl budget, as it lets you block crawlers from wasting time on low-value pages that don’t need to be indexed. Low-value pages include internal search results, faceted navigation filters (e.g., /product?color=red), tag pages, and author pages. Example: A home goods e-commerce client had 150,000 total pages, but 80,000 were duplicate faceted navigation pages. We added Disallow: /product?* to their robots.txt file, which freed up crawl budget to focus on core product and category pages. Within 6 weeks, 90% of their new product pages were indexed within 48 hours of publishing, and organic traffic to product pages increased 18%. Actionable tip: audit your site for low-value pages using this crawl budget guide, then block them in robots.txt. Common mistake: blocking high-value pages like category pages by accident, which cuts off crawl budget to your most important content.

Real-World Case Study: Fixing a Robots.txt Error That Dropped Organic Traffic by 80%

In early 2023, we worked with a SaaS company that launched a redesigned website after 18 months of development. The launch went smoothly, but within 3 days, their organic traffic dropped from 12,000 monthly visits to 2,400 – an 80% decline. No technical SEO changes had been made except the site redesign. Our first step was to check their robots.txt file, which read: User-agent: * Disallow: /. This rule was copied from their staging site, where it was added to prevent staging content from being indexed. The development team forgot to remove the rule before pushing the new site live.

Solution

We took three steps: 1. Removed the Disallow: / rule immediately. 2. Added Disallow: /staging/ and Disallow: /dev/ rules to block future staging content. 3. Added the sitemap directive for their new XML sitemap. 4. Tested the updated file in Google Search Console to confirm no errors. 5. Submitted the updated robots.txt and sitemap via GSC, and requested re-indexing of their top 50 priority pages.

Result

70% of lost traffic recovered within 10 days, and 100% recovery was reached 21 days after the fix. Because crawl budget was no longer wasted on blocked pages, their new blog content started ranking 3x faster than before the redesign. Monthly organic leads increased 22% over their pre-redesign baseline. Actionable tip: always check the robots.txt file on staging sites, and ensure staging-specific rules are not pushed to live environments. Common mistake: assuming that because a site is accessible in a browser, it’s accessible to crawlers – browsers don’t follow robots.txt rules, but crawlers do.

Top Tools to Test and Audit Your Robots.txt File

You don’t need to guess whether your robots.txt file is working correctly – these four free and paid tools will help you test and audit your rules. 1. Google Search Console Robots.txt Tester – Free tool built into GSC that lets you paste a robots.txt file or test your live file, then enter a URL to see if it’s allowed or blocked for a specific crawler. Use case: Verify new rules work before uploading them live. 2. Ahrefs Site Audit – Paid tool that crawls your site and flags robots.txt errors, including blocked high-value pages, blocked resources, and invalid syntax. Use case: Monthly site audits for mid-sized and large sites. 3. SEMrush Site Audit – Paid tool that identifies crawlability issues caused by robots.txt, including wasted crawl budget on disallowed pages and missing sitemap directives. Use case: Large e-commerce sites managing 10k+ URLs. 4. Moz Pro Crawl Test – Paid tool that simulates how search crawlers interact with your robots.txt file and site structure, including advanced wildcard rules. Use case: Testing complex rules for large sites. Example: Use the GSC tool to test your homepage, 5 core product pages, and your sitemap URL every time you update your robots.txt file. Actionable tip: never skip testing, even for small changes – one misplaced / can block your entire site. Common mistake: relying on text editors to check syntax, which won’t catch path errors or invalid directives.

Advanced Robots.txt Rules for Large Sites and E-Commerce

Small sites can get away with basic robots.txt rules, but large sites with 10,000+ pages need advanced rules to manage crawl budget and avoid duplicate content issues. The * wildcard is your best friend here: it matches any sequence of characters, so you can block entire categories of low-value URLs with one rule. Example: Disallow: /search* blocks all URLs that start with /search/, including /search?q=shoes and /search?q=boots. Disallow: /product?* blocks all product pages with query parameters, like /product?color=red&size=10. You can also use the $ symbol to match the end of a URL: Disallow: /thank-you$ blocks only the /thank-you page, not /thank-you/new or /thank-you/123. For sites that target specific regions, you can set rules for specific crawlers: User-agent: Googlebot-Image Disallow: / to block Google’s image crawler from accessing your site if you don’t want your images indexed. Example: A fashion e-commerce client with 200,000 product pages used these rules to block faceted navigation, internal search, and duplicate tag pages: Disallow: /filter*, Disallow: /search*, Disallow: /tag*, Disallow: /author*. This reduced their crawled pages by 65%, and new product pages were indexed 4x faster. Actionable tip: only use wildcards if you are 100% sure they won’t block unintended pages – test every wildcard rule in GSC first. Common mistake: overusing wildcards, e.g., Disallow: /p* to block /product pages, but accidentally blocking /privacy-policy or /press-releases.

Quick AEO Answers: Common Robots.txt Questions

These short answers are optimized to appear in featured snippets and AI search results, answering the most common user questions directly.

Does robots.txt block all search engines? No, only crawlers that follow the robots exclusion standard. Malicious bots, scrapers, and some private crawlers will ignore robots.txt rules entirely.

Can I use robots.txt to noindex a page? Not reliably. Robots.txt stops crawlers from accessing a page, but Google may still index the URL if it finds links to it from other sites. Use a meta robots noindex tag instead for index control.

Where do I upload my robots.txt file? Always upload it to the root directory of your domain (e.g., example.com/robots.txt). Uploading it to a subfolder like example.com/blog/robots.txt will make it invisible to most crawlers.

How often should I update my robots.txt file? Audit and update your file whenever you launch new site sections, remove old content, or change crawl priorities. For most sites, a quarterly check is sufficient unless you make frequent major changes.

How to Check If a Page Is Blocked by Robots.txt

Even if you’ve tested your robots.txt file, it’s important to periodically check high-priority pages to ensure they aren’t blocked. The easiest way to do this is using the Google Search Console Robots.txt Tester. Enter the URL of the page you want to check, select the crawler (e.g., Googlebot), and click test. If the result says “Allowed”, the page is accessible. If it says “Blocked”, the page is blocked by your robots.txt rules. You can also check manually by looking at your robots.txt file and seeing if any disallow rules apply to the page’s path. Example: if your page is at example.com/blog/how-to-use-robots-txt, check if any disallow rules include /blog/, /blog/how-to-use-robots-txt, or wildcards that would match the path. Actionable tip: test your homepage, top 10 traffic-driving pages, and all sitemap URLs once per month. If you find a blocked high-value page, remove the conflicting rule immediately and test again. Common mistake: assuming that if a page is not explicitly listed in robots.txt, it’s allowed – while this is true (the default is allow), explicit rules are easier to audit and less prone to errors.

Frequently Asked Questions

What happens if I don’t have a robots.txt file?

Nothing negative – crawlers will assume they can access all public pages on your site. You only need a robots.txt file if you want to block specific crawlers or pages from being crawled.

Can I have multiple user-agent rules in one robots.txt file?

Yes, you can add separate user-agent blocks for different crawlers (e.g., one for Googlebot, one for Bingbot) to set custom rules for each. Each block should start with a User-agent: line, followed by the relevant disallow/allow rules.

Does robots.txt affect paid search ads?

No, robots.txt only applies to organic search crawlers. Google Ads bots ignore robots.txt rules, so your paid ads will run normally even if the landing pages are blocked by robots.txt.

How do I block a specific crawler with robots.txt?

Add a user-agent line for that crawler (e.g., User-agent: Bingbot) then add Disallow: / to block all pages, or Disallow: /private/ to block a specific path. You can find a list of crawler user-agent strings on Moz’s guide to crawler user-agents.

Can I use wildcards in robots.txt?

Yes, the * wildcard matches any sequence of characters, and $ matches the end of a URL. For example, Disallow: /search* blocks all URLs starting with /search, and Disallow: /product.html$ blocks only the exact /product.html URL.

Will a 404 error for robots.txt hurt my SEO?

No, as mentioned earlier, crawlers will treat a missing robots.txt as permission to crawl all pages. You only need to fix a 404 for robots.txt if you are trying to block specific content, in which case you need to create the file.

Byvebnox