Friday, July 20, 2012
What Is robots.txt file? How To Create A robots.txt File For Website? Can A Blogger Have Or Create And Add A robots.txt File To The Server?. A robots.txt file is such a command on a website that defines how web crawlers (also called "spiders", or "search engines". Like Google, Yahoo search, Bing, etc.) should act on the website. It's esentially used to restrict access to a website by search engines that crawl the web. However, there are some crawlers, poeple say they're "bad crawlers", might ignore that robots.txt file, and instead they'll access what the robots.txt says "don't". Thus, you better not publish a content of secret on your website for you think it'll be safe with robots.txt file.

Every search engine defaultly will check if a robots.txt file exists on a website before they access any page on the website. In case of Google search, though it won't crawl or index the content of pages blocked by robots.txt file, it may still index the URLs if it finds them on other pages on the web, and will also show them on the search result. The content shown could be the URL of the page or potentially other publicly available information like anchor text in links on the page.

To use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have the access to it, then you have to use the robots meta tag to restrict its access by crawlers instead of using robots.txt file. But i'm not about to discuss about that robots meta tag here, maybe you can search through Google if you wanna know about it.

Now let's see how very simple to create a robots.txt file

1. The simplest robots.txt file uses two rules:

*User-agent: It's for the robot name that you want to prevent from accessing a directory on your website.
*Disallow: Url of the page or directory that you don't want it to be crawled.

Below is an example robots.txt with two different command:

User-agent: Googlebot
Disallow: /folder1/

User-agent: *
Disallow: /folder2/

If you create a robots.txt file as described above (two commands), Googlebot (google search engine) will ignore the second command but obey the first command, while any other bots will obey the second. But if you create a robots.txt file using only the second command above, Googlebot will obey it along with any other bots, cuz the asterisk "*" means all bots. You can include as many entries as you want, or even include multiple Disallow lines and multiple user-agents in one entry. Each section in the robots.txt file is separate and does not build upon previous sections.

Explanation

*User-agent: is a specific search engine robot. There are lots of bot on the internet. Try to go here, you'll find many many bots where your brain won't be realy good enough to remember them. In case of Google, it uses several different bots (user-agents). The one that it uses for the web search is called Googlebot. "Googlebot-Mobile" and "Googlebot-Image" follow the rules that you set up for Googlebot, but you can also set up specific rules for each bot since each of them own their own name.

*Disallow: Lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a "/".

Here are the common patterns along with the explanation for each one of them. You can take advantage of them to create a robots.txt file.

1. To block the entire site, you can use a forward slash:
User-agent: *
Disallow: /


2. To block a directory and everything in it, follow the directory name with a forward slash:
User-agent: *
Disallow: /junk-directory/


3. To block a page, list the page:
User-agent: *
Disallow: /private_file.html


4. To remove a specific image from Google Images:
User-agent: Googlebot-Image
Disallow: /images/john_smith.jpg


5. To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: /


6. To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$


7. To prevent pages (all pages) on your site from being crawled while still displaying AdSense ads, you can disallow all bots other than Mediapartners-Google:
User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /


The above two lines will prevent your website from being crawled by Googlebot (Web search), but not Google adsense.

The Below Patterns Only Apply To Googlebot

*To match a sequence of characters, you can use an asterisk (*). For instance, to block access to all subdirectories that begin with "private":
User-agent: Googlebot
Disallow: /private*/


*To block access to all URLs that include a question mark (?). More specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string:
User-agent: Googlebot
Disallow: /*?


*To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$


You can use this pattern matching in combination with the Allow directive. For instance, if a "?" indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a "?" may be the version of the page that you want to include. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /*?$
Disallow: /*?


The Disallow: /*? directive will block any URL that includes a ? Mark. More specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string. - The Allow: /*?$ directive will allow any URL that ends in a ? mark. More specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?.

Rules to save robots.txt file

*Save the file as robots.txt.
*directives in robots.txt are case-sensitive. For instance, Disallow: /the_great_john_smith.html would block http://j-smith-site.blogspot.com/the_great_john_smith.html, but would allow http://j-smith-site.blogspot.com/The_great_john_smith.html.
*Googlebot ignores white-space (in particular empty lines) and unknown directives in the robots.txt.
*Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain. A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance:

http://j-smith-site.blogspot.com/robots.txt is a valid location, but http://j-smith-site.blogspot.com/2012/07/robots.txt is not.

*In addition, Googlebot supports submission of Sitemap files through robots.txt file if you wanna make it.

You can test your robots.txt using Google Webmaster Tools to make sure that it's working good. Below is how to make it:

1. Go to Google Webmaster Tools Home Page and click the site you want.
2. In the left-side menu, click Health.
3. Click Blocked URLs.
4. Click the Test robots.txt if it's not already selected.
5. Copy the content of your robots.txt file and paste it into the first box, and fill the second box with your website url if it's not filled already.
6. In the User-agents list, select the user-agents you want.

Any changes you make in this tool will not be saved for your website, it's just a tester tool.

This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard robots.txt protocol. It understands Allow: directives as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.

Lastly, if you wanna know the demo, you can take a look at my robots.txt file. It's here: John Smith's Blog's robots.txt File. And to know further about how to add robots.txt file to blogger, you can take a look at my previous post: Adding robots.txt File To Blogger.

0 comments:

Post a Comment