- Basics of robots.txt syntax
- Examples of usage
Robots.txt and SEO
- Removing exclusions of images
- Adding reference to your sitemap.xml file
- Miscellaneous remarks
Robots.txt is a text file located in the site's root directory that specifies for search engines' crawlers and spiders what website pages and files you want or don't want them to visit. Usually site owners strive to be noticed by search engines, but there are cases when it's not needed: for instance, if you store sensitive data or you want to save bandwidth by not indexing excluding heavy pages with images.
When a crawler accesses a site, he requests for a file named "/robots.txt" in the first place. If such a file is found, the crawler checks it for the website indexation instructions.
NOTE: there can be only one robots.txt file for the website. Robots.txt file for addon domain needs to be placed in the corresponding document root.
Google's official stance on the robots.txt file
Robots.txt file consists of lines which contain two fields: line with a user-agent name (search engine crawlers) and one or several lines starting with the directive
Robots.txt has to be created in UNIX text format.
Basics of robots.txt syntax
Usually robots.txt file contains something like this:
Disallow: /tmp/ Disallow: /~different/
In this example three directories: "/cgi-bin/", "/tmp/" and "/~different/" are excluded from indexation.
NOTE: every directory is written on a separate line. You can't write «Disallow: /cgi-bin/ /tmp/» in one line, nor can you break up one directive Disallow or User-agent in several lines - use a new line to separate directives from each other.
«Star» (*) in User-agent field means "any web crawler". Consequently, directives of the type «Disallow: *.gif» or «User-agent: Mozilla*» are not supported - please pay attention to such logical mistakes as they are most common ones.
Other common mistakes are typos - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, and it's easy for an error to slip in, there are some validation tools that come in handy: http://tool.motoricerca.info/robots-checker.phtml