Robots.txt: What You Need to Know

Your Guide to Robots.txt

Robots.txt is a simple, but powerful, file that webmasters use in presenting a website to Google. However, even a small error in your Robots.txt file can wreak havoc with how your website is being crawled and indexed. In this article, we’ll discuss what is a Robots.txt file, why it’s so important, and how you can create and optimize a Robots.txt file for your website.

What is a Robots.txt File?

Robots.txt file, also called robots exclusion protocol (REP) is a text file that webmasters use to tell robots which pages on their site can be crawled and which can’t be.

The first thing a crawler does when it visits a site is to check its Robots.txt for instructions on how to crawl it.

Robots.txt files have different types. The most common command on a Robots.txt file looks like this:

Robots.txt file with a disallow directive

The asterisk after user-agent signifies that this instruction is for all web robots that will visit the site. The forward slash after disallow represents that none of the pages on the website can be visited.

Why is a Robots.txt File Important?

Now you may look at the directives above and wonder, “Why aren’t we allowing a crawler to visit our website? Doesn’t everybody want their website to be crawled and indexed?”

Well, to use the favorite words of SEOs: it depends.

In most cases, you don’t want Googlebots to crawl and index certain pages of your website. That’s why a Robots.txt file, and its Disallow and Noindex directives are so important.

One reason could be that your website may have lots of pages. And a crawler’s job is to crawl every single one of them. Having lots of pages makes it longer for the crawler to go through your entire website. This trickles down negatively on your ranking because there is a crawl budget that a crawler needs to adhere to. And there are two key considerations to it: crawl limit and crawl demand.

According to Google, crawl limit is “the number of simultaneous parallel connections GoogleBot may use to crawl the site, as well as the time it has to wait between the fetches”. The crawl limit can change based on crawl health which is how quickly a site responds to the crawler and the limit set in the search console.

Crawl demand on the other hand refers to the demand from indexing. This means popular URLs will be crawled more often to maintain freshness in the index.

To this end, we can define crawl budget as the number of URLs a crawler can and wants to crawl. With an optimized Robots.txt file, you tell the crawler which pages are most important for your website and can therefore be included in the crawl budget. That’s why Robots.txt files are super important.

Where to Find the Robots.txt File?

Here’s how you can check if you have a Robots.txt file for your website (and you can do this exercise to check for any website for that matter). Open up your browser and in the address bar type in your website’s domain and add “/robots.txt” at the end. So for example “domain.com” will look like “domain.com/robots.txt”.

If you see something like this, that means you have a Robots.txt file:

Check if you have Robot.txt

If it returns with nothing to show for that means your file is empty and if the results show a 404 error, you’d want to have that look into and get it fixed right away.

With this exercise, you’d know if your website has a Robots.txt file or not. You will have to create one from scratch if it doesn’t exist. For that, make sure you’re using a plain text editor like a Notepad for Windows or a TextEdit for Mac.

To look for your Robots.txt file, you need to go to your website’s root directory. The root directory can be found in your hosting website’s account and then going to the file management or FTP section. You will find something like this:

File Management Section

Look for the Robots.txt file and open it up to start editing. If there is any text, delete it and keep the file.

For WordPress users, the Robots.txt file may appear at yourwebsite.com/robots.txt but it may not show in the files. This can be attributed to the fact that WordPress makes virtual Robots.txt file if there are none in your root directory. If this happens with you, you’d need to create one from scratch. You will learn this in our next section.

How to Create a Robots.txt File?

Before you get going with creating the Robots.txt file, try to get yourself familiarized with the syntax used in it. Here is a link from Google with all the basic terminologies.

Then open up your plain text editor and the first thing you’d want to do is set up your user-agent term. Since we want it to apply to all web robots, we’ll use an asterisk like this:

Defining user agents in Robots.txt

Then type disallow but don’t write anything after that. We want all of our pages on the website to be visited by the crawler. Here’s how it would look:

Command to crawl all pages

To link your XML sitemap, this is what you type:

Link your XML Sitemap with a robots.txt file

This is how a basic Robots.txt file looks like. But it’s time to give it a spin in terms of optimizing it.

As mentioned earlier, the more smartly your crawl budget is used, the better it is for your SEO. So for example, there is no use of including a page in the crawl budget which is used to login to the backend of a website. Hence you can go ahead and remove it from the crawl budget like this:

Remove backend login page from crawling using a robots.txt file

A similar directive can be used to tell the crawler not to crawl specific pages. For example, if you don’t want the page “www.domain.com/abc/” to be included in the crawl budget, you can take the part “/abc”, add a slash at the end and write it after Disallow like this:

Remove a specific page from crawling using a robots.txt file

You can also use this directive when you have a print-friendly version of a page or when you are doing AB testing and include any one version in the Disallow directive. And since your Thank You pages should only be shown to qualified leads, you can use this directive for them as well.

Whereas the Disallow directive blocks the crawler from visiting the page, it doesn’t necessarily mean that the page won’t be indexed. So to prevent indexing, you can use the Noindex directive in unison with the Disallow directive.

Testing Everything Out!

Once you’ve created your Robots.txt file or customized it, you then need to make sure that everything is working smoothly, especially your Robots.txt file. Using the Google Search Console, you can validate your crawl directives. You can also sign in to your webmasters’ account and make use of Google’s free Robots.txt tester tool.

Once in the dashboard, select your website, and in the sidebar on the left, click “crawl”. Then click on the Robots.txt tester in the drop-down and replace any old code with your new one. Hit the “Test” button and if it turns to “Allowed”, congratulations, your Robots.txt is all good to go.

Robot.txt tester

While optimizing your Robots.txt is often overlooked in the SEO process, a little effort on that front can’t do any harm. So go ahead and implement your learnings from this blog on your website. We hope you will be pleasantly surprised with the results.

Leave a Reply

Your email address will not be published. Required fields are marked *