• Robot.txt

    Robot.txt file importance and details

     

    What is Robot.txt file?

     Robot.txt is a file that a webmaster creates to tell the search engine crawler how to crawl and index pages on their website. It is an important part of SEO to create a robot.txt file and submit it to search engine tools by a good webmaster. Robot.txt is a part of Robot Extrusion Protocol(REP). REP is a group of web standards that regulate the process how robots crawls and web, access and index content and serve that content to the users.
    Most of the spammers use web robots to scan emails and many other uses.

     

    Example Robot.txt:

    Here are some examples of robot.txt in action for a specific URL suppose www.example.com
    Robot.txt file URL for website: www.example.com/robots.txt

    Example of multiple sitemaps in robot.txt file in a website

    Blocking all web crawlers from all content

    User-agent: *
    Disallow: / 
    Using this syntax in a robot.txt file would tell a web crawler not to crawl any page or homepage for a given website(www.example.com).

    Allowing all web crawlers access to all content

    User-agent: *
    Disallow:
    Using this syntax in a robot.txt file would tell a web crawler to crawl all pages on www.example.com including homepage.

    Blocking a specific web crawler from specific folder

    User-agent: Googlebot
    Disallow: /example-subfolder/
    This syntax tells only Google crawler not to crawl any pages that contain the URL string www.example.com/example-subfolder/

    Blocking a specific web crawler from a specific web page

    User-agent: Bingbot
    Disallow: /example-subfolder/blocked-page.html
    This syntax will disallow the bing web crawler not to crawl specific page at URL www.example.com/example-subfolder/blocked-page.html 

    There are two important considerations that should be kept in mind while using robot.txt. They are following:
    1. Robots can ignore your /robots.txt specially the malware robots that are made for secuirity  purposes and email address harvesters used by spammers will pay no attention.
    2. The robot.txt file is a publically available file. Anyone can see what sections of your server you dont want robots to use.

    How does robots.txt work?

    Search engines have two main jobs to do. They are following:
    1. To crawl the web so that new content could be find. For example if a new website is created and it is published. How will people know that it is published until it is crawled by search engines crawlers and shown as result of searches.
    2. Indexing the content to show the relevant post upon someones searching by using keywords. For example i am creating a post on robot.txt, when googlebot will index my this post, it will be shown as a search result when some use keyword "robot.txt" in google search. Same is the case with other search engines for example bing, yahoo, yandex etc.

    Why do you need robot.txt?

    Robot.txt controls the crawler access to the areas of your site. If you want to crawl specific pages of your website you can use specific robot.txt files. But it could be dangerous if you disallow Googlebot to crawl your entire site because Google robot.txt is difficult to handle.
    Some common uses of robot.txt includes:
    1. Prevent duplicate content(This job is done better by meta robots)
    2. Keeping entire section of a website private.
    3. You can get more traffic by indexing and let your website crawling because it is useful step of SEO
    4. Preventing search engines to index some specific pages if you need

    No comments