What is a robots.txt file

Prevent Search Engines and Bots
From crawling or Indexing their sites

How does a robots.txt file work?

Updated: September 12, 2022
By: RSH Web Editorial Staff
robots.txt file
Menu

robots.txt file

A robots.txt file is a text file that tells web robots (also known as spiders or crawlers) which pages on your website to crawl and which to ignore.

When a robot crawls a website, it reads the robots.txt file to check for instructions on which pages it should crawl and which it should ignore.

General Information

When a crawler accesses a website, it requests for a file named "robots.txt". If such a file is found, the crawler checks it for the website indexation instructions.

The file is located in the Website's root directory that specifies for the Bots what pages and files you want or do not want them to crawl or index.

Website owners normally want to be noticed by the Search Engines
But there are cases when it is not wanted. For example if you store sensitive data, or you need to save bandwidth by not indexing sites with a multitude of images.

You can typically view the file by typing the full URL for the homepage and then adding /robots.txt
https://rshweb.com/robots.txt
The file has no links so users will not stumble upon it, but most web crawler bots will look for this file first before crawling the rest of the site.

Uses of a "robots.txt" file

The most important use of a robots.txt file is to maintain privacy from the Internet.

Not everything on our website should be showed to the public or the Search Engines.

NOTE: There can be only one robots.txt file for the website. Robots.txt file for add-on domains or subdomains need to be placed in the corresponding document root.

Blog Hosting

How to create a "robots.txt" file

The robots txt file is created in your web-site's root folder, "/yourwebsite.com/robot.txt"

A "robots.txt" text file is basically just a simple text file made with any text editor such as Notepad++, and then can be uploaded to your website with a FTP program

You can also use the cPanel File Manager to create this file right on your website
Also see Setting up and using the FTP Interface in cPanel

WordPress also has plugins designed just to make robot.txt files
Virtual Robots.txt
Robots.txt Editor
Robots.txt Quick Editor
Booter - Bots and Crawlers Manager
Multisite Robots.txt Manager

The basic syntax for the robots txt file

• User-agent: [The name of the robot for which you are writing these rules]

• Disallow: [page, folder or path where you want to hide]

• Allow: [page, folder or path where you want to unhide]

• Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.

• Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.

Example 1

If you want to allow crawl everything, then use this code (All Search Engine)
User-agent: *
Disallow:

Example 2

If you want to Disallow to crawl everything (All search Engine)
User-agent: *
Disallow: /

Example 3

If you want to Disallow for the specific folder (All search Engine)
User-agent: *
Disallow: /folder name/

Example 4

If you want to Disallow for the specific file (All search Engine)
User-agent: *
Disallow: /filename.html

Example 5

If you want to Disallow for a folder but allow the crawling of one file in that folder (All search Engine)
User-agent: *
Disallow: /folderxyz/
Allow: /folderxyz/anyfile.html

Example 6

Allow only one specific robot access in website
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:

Example 7

To exclude a single robot
User-agent: BadBotName
Disallow: /

Example 8

If you want to allow for the sitemap file crawling
User-agent: *
Sitemap: http://www.yourwebsite.com/sitemap.xml

Example 9

WordPress Robots txt File
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /readme.html
Disallow: /xmlrpc.php
Allow: /wp-content/uploads/
Sitemap: https://yourwebsite.com/sitemap.xml

Small Business Hosting

Common search engine bot user agent names

Google:
Googlebot
Googlebot-Image (for images)
Googlebot-News (for news)
Googlebot-Video (for video)

Bing
Bingbot
MSNBot-Media (for images and video)

DuckDuckGo
DuckDuckBot
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

Baidu
Baiduspider
Baidu Web Search
Baidu Image Search

For a complete list see perishablepress.com of search engine bot user agent names

Tip: Do not disallow files in the robots txt file that you want Bots to crawl or especially to hide. By doing this you are telling everyone about those files. We would recommend putting them inside a folder and Hide that folder

Other common mistakes are typos, misspelled directories, user-agents, missing colons after "User-agent" and "Disallow", etc

When your robots.txt files gets complicated, it is easy for an error to slip in.

Tweet  Share  Pin  Tumble  Email

We welcome your comments, questions, corrections and additional information relating to this article. Please be aware that off-topic comments will be deleted.
If you need specific help with your account, feel free to contact us anytime
Thank you

More Articles Of Interest

We are a reliable, professional and secure web host for your business. Providing you with the best website hosting that make your website secure, reliable and fast