Robot.txt
What is Robot.txt file?
Robot.txt is a file that a webmaster creates to tell the search engine crawler how to crawl and index pages on their website. It is an important part of SEO to create a robot.txt file and submit it to search engine tools by a good webmaster. Robot.txt is a part of Robot Extrusion Protocol(REP). REP is a group of web standards that regulate the process how robots crawls and web, access and index content and serve that content to the users.
Most of the spammers use web robots to scan emails and many other uses.
Example Robot.txt:
Here are some examples of robot.txt in action for a specific URL suppose www.example.com
Blocking all web crawlers from all content
User-agent: *
Disallow: /
Using this syntax in a robot.txt file would tell a web crawler not to crawl any page or homepage for a given website(www.example.com).
Allowing all web crawlers access to all content
User-agent: *
Disallow:
Using this syntax in a robot.txt file would tell a web crawler to crawl all pages on www.example.com including homepage.
Blocking a specific web crawler from specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
This syntax tells only Google crawler not to crawl any pages that contain the URL string www.example.com/example-subfolder/
Blocking a specific web crawler from a specific web page
User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html
This syntax will disallow the bing web crawler not to crawl specific page at URL www.example.com/example-subfolder/blocked-page.html
There are two important considerations that should be kept in mind while using robot.txt. They are following:
1. Robots can ignore your /robots.txt specially the malware robots that are made for secuirity purposes and email address harvesters used by spammers will pay no attention.
2. The robot.txt file is a publically available file. Anyone can see what sections of your server you dont want robots to use.
How does robots.txt work?
Search engines have two main jobs to do. They are following:
1. To crawl the web so that new content could be find. For example if a new website is created and it is published. How will people know that it is published until it is crawled by search engines crawlers and shown as result of searches.
2. Indexing the content to show the relevant post upon someones searching by using keywords. For example i am creating a post on robot.txt, when googlebot will index my this post, it will be shown as a search result when some use keyword "robot.txt" in google search. Same is the case with other search engines for example bing, yahoo, yandex etc.
Why do you need robot.txt?
Robot.txt controls the crawler access to the areas of your site. If you want to crawl specific pages of your website you can use specific robot.txt files. But it could be dangerous if you disallow Googlebot to crawl your entire site because Google robot.txt is difficult to handle.
Some common uses of robot.txt includes:
1. Prevent duplicate content(This job is done better by meta robots)
2. Keeping entire section of a website private.
3. You can get more traffic by indexing and let your website crawling because it is useful step of SEO
4. Preventing search engines to index some specific pages if you need
No comments