Shield Privacy from Bots & Put the ‘Wall Up’ with Robots.txt

What would you do when you want to just shoo away the search engine spiders from entering your website?
Put a board that reads – ‘Please do not sneak in.’ You might be having a few pages you wish to hide from spying eyes of search engines and block them from indexing the images, style sheets or JavaScript on your site and save some bandwidth. This is where the robots.txt works. It puts up a kind of wall that tells search engines to just march off from the pages.

Robots.txt is a text file on a site that instructs the search engines for not indexing certain pages. As a norm, these search engines do not disappoint you and obey the books. However, for hiding the sensitive data, you cannot confide in robots.txt fully. Even so, the robots.txt is known for its own worth. Let’s find out how.

Where is Robots.txt Required?

A robots.txt has certain protocols to follow for the search traffic while on the web. This traffic looks around for a robots.txt file and obeys its signals. This makes web crawling organized by not letting search engines to index restricted pages.
I am coming to repeat again that though search engines are asked to follow what a robots.txt file has to say, still some crooked engines may not like to comply with its protocols.

For that matter, a robots.txt file is also a hacker’s first choice and you should opt for either firewall or password protection. Let’s know why robots.txt file is the best.

  • It keeps reserved areas guarded from crawling.
  • It restricts search engines from indexing the scripts, confidential details or other kinds of code.
  • It escapes indexing of duplicate content like print versions of html pages.
  • It is helpful in Auto-discovery of XML Sitemaps.

A robots.txt file should be located in the root domain and be always termed in lower case. If the file is placed in a subdirectory or elsewhere, it will go unnoticed by the engines as the latter checks for it only in the root domain.

Syntax for Robots.txt

You can create a robots.txt file in any text editor but in HTML.
Create a regular text file that’s ASCII-encoded. After naming it, you should put all the instructions for the search engines to obey.
A standard structure of robots.txt includes a list of user agents, and the files and directories to disallow indexing.
The syntax for a robots.txt file follows a format considering User-Agent and Disallow.

  • User-Agent: is a search engine crawler for which the protocols are put up to follow. For example, Googlebot, etc.
  • Disallow: the pages that are restricted to visit. You can feed as many pages as you want.
  • De-index: the pages and prohibit the search engines from indexing. Though invalidated, de-indexing is supported by Google, but Yahoo and Live Search do not back up this.
  • Each User-Agent/Disallow group sets apart by a blank line. However, there should not be any blank line within a group, i.e. between User-agent line and the last Disallow.
  • The hash (#) symbol can be used for including comments in a robots.txt file. It can be used either for all the lines or at the end of a line; and whatever is written after a (#) can be overlooked.
  • The file names and the directories are case-sensitive. For example, despite having the same letters and meaning, the words – confidential, Confidential or CONFIDENTIAL – stand differently for a search engine.

Examples of a Robots.txt Format

A robots.txt file defines two kinds of authorization: (1) Allow indexing and (2) Disallow indexing.
There is a specific format for the above two. For example, to allow indexing of everything, the format is:

User-agent: *
Disallow:

Format to disallow indexing is as follows:

User-agent: *
Disallow: /

Format to disallow indexing of a specific folder is:

User-agent: *
Disallow: /folder_name/

Format to disallow Googlebot from indexing of a folder apart from allowing the indexing of one file in that folder will be as given below:

User-agent: Googlebot
Disallow: /folder1_name/
Allow: /folder1_name/file_name.html

Seeing the above order, you might have understood that it’s quite easy for anyone to make serious mistakes in the above process. Therefore, be careful while specifying the instructions.

robots.txt
Photo credit: tvol

Setting up Crawl Rate

Owing to heavy trespassing by the search engines, websites may want to put a check on them or slow down their process. Isn’t it amazing that the engines give their consent for setting individual crawl priorities? Crawl delay is employed to slow down crawling on a website.

So, crawl delay is used when there is too much load due to multiple visits by the search engines. For instance, if several Googlebot spiders ding several pages on your site repeatedly in a short time, it results in slowing down your server. That very time, you want to fix it right there.

There are crawlers who support a parameter for crawl delay and can adjust the period of seconds to months depending on the preference. The crawl delay code of a robots.txt looks like:

User-agent: *
Crawl-delay: 15

>>>Note: 15 is in seconds

Google has effective crawling priorities and does not believe in ensuing any command regarding crawl delay. Therefore, one does not necessarily need to change it unless have a too big website or several web pages. Sites like Twitter, Facebook, LinkedIn, which repeatedly undergo updation can employ this.

Block URLs in Search Engines with Robots.txt

Have you thought of blocking access to URLs that end in a certain pattern? If no, you can learn it here. The Disallow line has a list of restricted pages. For barring access to these pages, the entry should start with a forward slash (/) in the given manner:

User-agent: *
Disallow: /*?

You can also use a $ character to point out matching the end of that particular URL. For example, for restricting a URL ending with .php, use the entry in the below specified manner:

User-agent: yahoo-slurp
Disallow: /*.php$

Advanced Techniques for Robots.txt

A robots.txt file offers for a better and advanced control over crawling. Search engines work together to advance the functionality of a robots.txt file. Some of them are discussed below:

Allow Directive: Only Google, Ask and Yahoo support the Allow directive. It works almost reverse from the Disallow and enables to draw out those pages, which may be crawled. This is especially advantageous after the entire site or its big sections have been disallowed.

No-index Directive: Though only confined to Google, No-index directive lays benefits in getting rid of unnecessary listing in the search results. The syntax of No-index directive is an exact copy of the Disallow.

Sitemap: The XML Sitemap updates search engines on important pages of your website. A spider finds the XML Sitemap file as it operates in an Auto-discovery method. Affix the line (Sitemap: location_of_sitemap) to a robots.txt file for informing Google and others about a Sitemap file.
A location_of_sitemap includes the full URL, like http://www. myexample.com/sitemap.xml. You may place it anywhere in your file as the user-agent line does not affect it. The key search engines maintain the Sitemap directives together with Ask, Google, Live Search and Yahoo. Moreover, the Auto-Discovery verifies and submits Sitemaps in the search engines directly through webmaster consoles like Bing Webmaster Center and Google Webmaster Central.

Which are the Issues with robots.txt?

Sometimes, restricted pages may be present in Google’s index and therefore become visible in the search results. This happens when another site links to them. This results in getting URL and other available information visible in the search results. However, even then no content will be indexed. To deflect this, use a No-index robots Meta tag so that the page is not disallowed in a robots.txt file. During crawling, bots distinguish the No-index tag and drop the URL from the index.

Conclusion
Alright! What’s stopping you? A robots.txt file has a plenty of affirmatives around, you just need to take good care.

First, always use a correctly written robots.txt file.
Second, a robots.txt file is accessible by anyone, so do not use it to mask files or directories on your site. To keep off search engines from indexing files or folders like “password_list.txt” or “hidden_folder,” avoid giving their full names.

I hope I have been able to explain to you how to use Robot.txt to their full extent. If I have missed anything your valuable feedback is appreciated.

Alan Smith is an avid tech blogger with vast experience in various IT domains, currently associated with SPINX Inc., a Los Angeles based Web Design and development company. Follow Alan on Google + and Twitter.



Share this post using these icons:
Facebooktwittergoogle_pluslinkedinmail

Related Posts