Where is the Robots.txt file and what does it do?

Where_is_the_Robots_txt_file_and_what_does_it_do

Robots.txt files are a standard used by websites in order to communicate with web crawlers and other web robots. Knowing whether or not you require robots.txt files with your new web hosting can be challenging to estimate. This article aims to highlight how robots.txt files work and whether or not you require them for website optimisation.

What Is A Robots.txt File

Before web crawlers such as Google Bots search your websites content, they search for the robots.txt file. This file will contain specific instructions as to which files and pages the web crawler can and cannot access. This file is used by search engines such as Google in order to map out your websites content to decide how your website will be ranked.

How Can I Use Robots.txt File?

Prevent Server Throttling: When a web crawler scans a website without a robots.txt file it will go through all pages, all scripts and all pictures. This can have negative effects on your website’s performance during this period. Your web server will be busy dealing with requests from the crawler potentially leading to a decrease in performance. This can result in slower loading webpages for your users. Prevent this by blocking web crawlers from accessing certain scripts and images that do not require indexing for website optimisation. This will ensure the crawler only scans page you want indexed.

Improve Your Search engine rankings: Search engines rank websites by using robots.txt files. Optimizing your robots.txt file ensures good SEO practises increasing your chance of ranking.

Block Images or WebPages Appearing In Search Results: You may specialise in selling photos on your website. If a search engine indexes your images within an image search people may steal your content for their own usage without paying you royalties. To prevent this, you can block search engines from accessing your images, this helps to prevent unauthorised usage of your work.

Do I Need A Robots.txt File?

Most websites use robots.txt files however not every website requires one. Knowing if you require a Robots.txt file is important. Here are a few guidelines to follow when deciding.

When Do I Need To Use Robots.txt

  • Your website may contain content you don’t want search engines to rank. Using a robots.txt file allows you to block this content from being indexed.
  • Advertising can be challenging if you block crawlers from accessing your pages. You do not want to block advertising crawlers as this can prevent your website from being advertised.
  • You may still be working on your website therefore you do not want it to be ranked within search engines until completed. You can block web crawlers completely within your robots.txt file.

When Do I Not Require Robots.txt?

  • If you do not need to block certain pages from appearing within search rankings you do not require robots.txt.
  • You want all pages indexed within search engines

Examples Of Robots.txt Files

To indicate how Robots.txt files work here are a couple examples.

1. Allow Full Access

When required you can indicate to web crawlers that they have full access. Most web crawlers will scan all folders.

User-agent:*
Allow:

2. Allow Access To Certain Folders

If you want to indicate to a web crawler that I has access to certain folders you can do so by defining the folder directory.

User-agent:*
Allow: /Directory/

3. Block All Access

Use this to block web crawlers from accessing all files on your server. This will have a negative effect on search engines rankings as search engines cannot scan your website therefore won’t index any pages.

User-agent:*
Disallow:

4. Block Access To Folders

Use this to block web crawlers from accessing certain folders. This is useful to block access to sensitive folders containing personal information

User-agent:*
Disallow: /Folder Name/

3. Block Access To Files

Use this to block web crawlers from accessing certain files or pages within your website. This is useful for pages you don’t want ranked.

User-agent:*
Disallow: /filename.html

6. Blocks Access To Certain Crawlers

This will block access to certain crawlers however crawlers not defined will still have access.

User-agent: Crawler Name
Disallow: /

7. Allow Access To Certain Crawlers

This will indicate to certain crawlers which sections they are allowed to access. Only the defined crawler will read this.

User-agent: Crawler Name
Disallow:

“User-Agent: *” indicates that this section applies to all robots. Using “User-Agent: Googlebot” ensures this section only applies to the Google Bots.

The “Allow:” section indicates to web crawlers what pages or folders they are allowed to access and index. This is useful as it allows you to specify certain pages you require indexing ensuring crawlers focus on these pages.

The “Disallow:” section indicates to robots what pages or folders they are not allowed to access. This can be used to prevent

How Do I Make A robots.txt file

Creating a robots.txt file for your new web hosting is easy and simple providing instructions for crawlers such as Google bots.

This is can be done by opening a text editor such as notepad. Include information regarding which user-agent each section is for and what files or folders can or cannot be accessed.

Where Should I Put My robots.txt file?

When a web crawler scans your website, it will look for the robots.txt file first. This is done by taking your websites url (www.yourwebiste.com) and adds (/robots.txt) to the end of it (www.monsterhost.com/robots.txt). When adding your robots.txt file it is important to ensure you place it within the same directory as your index.html file. It is important to remember that your file must be called “robots.txt” not “Robots.txt” or “robot.txt”.

Is robots.txt A Security Feature

No a robots.txt file is not a security feature, regardless of the robots.txt content anyone can access folders that are not setup with the proper security. A robots.txt file is a simple text document that web crawlers follow however by no means are web crawlers prevented from scanning restricted directories.