Home Computers Prohibition of indexing in robots txt. How to prevent indexing of required pages

Prohibition of indexing in robots txt. How to prevent indexing of required pages

Whenever a site is accessed, search robots first look for and read the robots.txt file. It contains special directives that control the behavior of the robot. A hidden danger for any site can come from both the absence of this file and its incorrect configuration. I propose to study in more detail the issue of setting up robots.txt in general and for the WordPress CMS in particular, and also pay attention to common errors.

Robots.txt file and robot exception standard

All search engines understand instructions written in a special file according to the robot exclusion standard. For these purposes, a regular text file called robots.txt is used, located in the root directory of the site. If placed correctly, the contents of this file can be viewed on any website by simply adding /robots.txt after the domain address. For example, .

Instructions for robots allow you to prohibit scanning files/directories/pages, limit the frequency of access to the site, specify a mirror and an XML map. Each instruction is written on a new line in the following format:

[directive]: [value]

The entire list of directives is divided into sections (entries), separated by one or more empty lines. A new section begins with one or more User-agent instructions. The entry must contain at least one User-agent and one Disallow directive.

Text after the # (hash) symbol is considered a comment and is ignored by search robots.

User-agent directive

User-agent— the first directive in the section, reports the names of the robots for which the following rules are intended. The asterisk in the value denotes any name, only one section with instructions for all robots is allowed. Example:

# instructions for all robots User-agent: * ... # instructions for Yandex robots User-agent: Yandex ... # instructions for Google robots User-agent: Googlebot ...

Disallow directive

Disallow— a basic directive that prohibits scanning URLs/files/directories whose names fully or partially match those specified after the colon.

Advanced search robots like Yandex and Google understand the special character * (asterisk), which denotes any sequence of characters. It is not advisable to use substitution in a section for all robots.

Examples of the Disallow directive:

# empty value allows all files and/or directories starting with the characters "wp-" to be crawled User-agent: * Disallow : /wp- # prohibits scanning files page-1.php, page-vasya.php, page-news-345.php # any sequence of characters can be used instead of * User-agent: * Disallow: /page-*.php

Allow directive (unofficial)

Allow Allows scanning of specified resources. Officially, this directive is not included in the robot exception standard, so it is not advisable to use it in the section for all robots (User-agent: *). An excellent example of use is to allow resources from a directory that was previously prohibited from indexing by the Disallow directive to be crawled:

# prohibits scanning resources starting with /catalog # but allows scanning the page /catalog/page.html User-agent: Yandex Disallow: /catalog Allow: /catalog/page.html

Sitemap (unofficial)

Sitemap— a directive indicating the sitemap address in XML format. This directive is also not described in the exception standard and is not supported by all robots (works for Yandex, Google, Ask, Bing and Yahoo). You can specify one or more cards - all will be taken into account. Can be used without User-agent after an empty line. Example:

# one or more maps in XML format, the full URL is indicated Sitemap: http://sitename.com/sitemap.xml Sitemap: http://sitename.com/sitemap-1.xml

Host directive (Yandex only)

Host— a directive for the Yandex robot, indicating the main mirror of the site. The issue of mirrors can be studied in more detail in the Yandex help. This instruction can be indicated either in the section for Yandex robots or as a separate entry without a User-agent (the instruction is cross-sectional and in any case will be taken into account by Yandex, and other robots will ignore it). If Host is specified several times in one file, only the first one will be taken into account. Examples:

# indicate the main mirror in the section for Yandex User-agent: Yandex Disallow: Host: sitename.com # main mirror for a site with an SSL certificate User-agent: Yandex Disallow: Host: https://sitename.com # or separately without User- agent after the empty line Host: sitename.com

Other directives

Yandex robots also understand the Crawl-delay and Clean-param directives. Read more about their use in the help documentation.

Robots, robots.txt directives and search engine index

Previously, search robots followed the robots.txt directives and did not add resources “prohibited” there to the index.

Today things are different. If Yandex obediently excludes from the index the addresses prohibited in the robots file, then Google will act completely differently. He will definitely add their index, but the search results will contain the inscription “ The web page description is not available due to restrictions in the robots.txt file".

Why does Google add pages that are prohibited in robots.txt to the index?

The answer lies in a little Google trick. If you carefully read the webmaster help, everything becomes more than clear:

Google shamelessly reports that directives in robots.txt are recommendations, not direct commands to action.

This means that the robot takes the directives into account, but still acts in its own way. And he can add a page to the index that is prohibited in robots.txt if he encounters a link to it.

Adding an address to robots.txt does not guarantee that it will be excluded from Google's search engine index.

Google index + incorrect robots.txt = DUPLICATES

Almost every guide on the Internet says that closing pages in robots.txt prevents them from being indexed.

This was the case before. But we already know that such a scheme does not work for Google today. And what’s even worse is that everyone who follows such recommendations makes a huge mistake - closed URLs end up in the index and are marked as duplicates, the percentage of duplicate content is constantly growing and sooner or later the site is punished by the Panda filter.

Google offers two really workable options for excluding a website from its resource index:

  1. closing with a password(applies to files like .doc, .pdf, .xls and others)
  2. adding a robots meta tag with the noindex attribute V (applies to web pages):

The main thing to consider:

If you add the above meta tag to a web page that prohibits indexing, and additionally prohibit crawling of the same page in robots.txt, then Google robot will not be able to read the prohibited meta tag and will add the page to the index!
(that’s why he writes in the search results that the description is limited in robots.txt)

You can read more about this problem in Google Help. And there is only one solution here - open access to robots.txt and configure a ban on indexing pages using a meta tag (or password, if we are talking about files).

Robots.txt examples for WordPress

If you carefully read the previous section, it becomes clear that Today you should not practice excessive banning of addresses in robots.txt, at least for Google. It is better to manage page indexing through the robots meta tag.

Here is the most banal and yet completely correct robots.txt for WordPress:

User-agent: * Disallow: Host: sitename.com

Surprised? Still would! Everything ingenious is simple 🙂 On Western resources, where there is no Yandex, recommendations for compiling robots.txt for WordPress come down to the first two lines, as shown by the authors of WordPress SEO by Yoast.

A properly configured SEO plugin will take care of canonical links and the robots meta tag with the value noindex, and the admin pages are password-protected and do not need to be blocked from indexing (the only exception can be the login and registration pages on the site - make sure that they have a robots meta tag with the value noindex). It is better to add a sitemap manually in the search engine webmaster and at the same time make sure that it is read correctly. The only thing left and important for RuNet is to indicate the main mirror for Yandex.

Another option, suitable for the less daring:

User-agent: * Disallow: /wp-admin Host: sitename.com Sitemap: http://sitename.com/sitemam.xml

The first section prohibits indexing for all robots of the wp-admin directory and its contents. The last two lines indicate a site mirror for the Yandex robot and a site map.

Before changing your robots.txt...

If you decide to change the directives in robots.txt, then first take care of three things:

  1. Make sure that there are no additional files or directories in the root of your site whose contents should be hidden from scanning (these could be personal files or media resources);
  2. Turn on canonical links in your SEO plugin (this will exclude URLs with query parameters like http://sitename.com/index.php?s=word)
  3. Set up robots meta tag output with meaning noindex on pages that you want to hide from indexing (for WordPress these are archives by date, tag, author and pagination pages). This can be done for some pages in the SEO plugin settings (All In One SEO has incomplete settings). Or display it yourself using a special code: /* ========================================================== ================================ * Add your * ================================================================= ========================= */ function my_meta_noindex () ( if (//is_archive() OR // any archive pages - for a month, for year, by category, by author //is_category() OR // archives of categories is_author() OR // archives of articles by author is_time() OR // archives of articles by time is_date() OR // archives of articles by any dates is_day( ) OR // archives of articles by day is_month() OR // archives of articles by month is_year() OR // archives of articles by year is_tag() OR // archives of articles by tag is_tax() OR // archives of articles for a custom taxonomy is_post_type_archive () OR // archives for a custom post type //is_front_page() OR // static home page //is_home() OR // main blog page with the latest posts //is_singular() OR // any post types - single posts, pages, attachments, etc. //is_single() OR // any single post of any type of post (except attachments and Pages) //is_page() OR // any single Page (“Pages” in the admin panel) is_attachment() OR // any attachment page is_paged() OR // any and all pagination pages is_search() // site search results pages) ( echo ""." "."\n"; ) ) add_action("wp_head", "my_meta_noindex", 3); /* ========================== ================================================================= */

    In lines starting with // the meta tag will not be displayed (each line describes which page the rule is intended for). By adding or removing two slashes at the beginning of a line, you can control whether the robots meta tag will be displayed or not on a certain group of pages.

In a nutshell what to close in robots.txt

When setting up the robots file and indexing pages, you need to remember two important points that put everything in its place:

Use the robots.txt file to control access to server files and directories. The robots.txt file plays the role of an electronic sign “No entry: private territory”

Use the robots meta tag to prevent content from appearing in search results. If a page has a robots meta tag with the noindex attribute, most robots will exclude the entire page from search results, even if other pages link to it.

The technical aspects of the created site play no less important role for website promotion in search engines than its content. One of the most important technical aspects is site indexing, i.e. determining the areas of the site (files and directories) that can or cannot be indexed by search engine robots. For these purposes, robots.txt is used - this is a special file that contains commands for search engine robots. The correct robots.txt file for Yandex and Google will help you avoid many unpleasant consequences associated with site indexing.

2. The concept of the robots.txt file and the requirements for it

The /robots.txt file is intended to instruct all search robots (spiders) to index information servers as defined in this file, i.e. only those directories and server files that are not described in /robots.txt. This file should contain 0 or more records that are associated with a particular robot (as determined by the value of the agent_id field) and indicate for each robot or for all of them at once what exactly they do not need to index.

The file syntax allows you to set restricted indexing areas, both for all and for specific robots.

The robots.txt file has special requirements, failure to comply with which may result in the search engine robot not reading it correctly or in general rendering the file incapacitated.

Primary requirements:

  • All letters in the file name must be capitalized, that is, they must be lowercase:
  • robots.txt is correct,
  • Robots.txt or ROBOTS.TXT – incorrect;
  • the robots.txt file must be created in Unix text format. When copying this file to a website, the ftp client must be configured for text file exchange mode;
  • the robots.txt file must be placed in the root directory of the site.

3. Contents of the robots.txt file

The robots.txt file includes two entries: "User-agent" and "Disallow". The names of these entries are not case sensitive.

Some search engines also support additional entries. So, for example, the Yandex search engine uses the “Host” record to determine the main mirror of a site (the main mirror of a site is a site that is in the search engine index).

Each entry has its own purpose and can appear several times, depending on the number of pages and/or directories being blocked from indexing and the number of robots you contact.

The expected line format for the robots.txt file is as follows:

post_name[optional

spaces] : [optional

spaces] meaning[optional spaces]

For a robots.txt file to be considered valid, there must be at least one "Disallow" directive present after each "User-agent" entry.

A completely empty robots.txt file is equivalent to no robots.txt file, which implies permission to index the entire site.

User-agent entry

The “User-agent” entry must contain the name of the search robot. In this entry, you can tell each specific robot which pages of the site to index and which not.

An example of a “User-agent” record, where all search engines are accessed without exception and the “*” symbol is used:

An example of a “User-agent” record, where only the Rambler search engine robot is contacted:

User-agent: StackRambler

Each search engine robot has its own name. There are two main ways to find out its (name):

on the websites of many search engines there is a specialized “webmaster help” section, in which the name of the search robot is often indicated;

When viewing web server logs, namely when viewing calls to the robots.txt file, you can see many names that contain the names of search engines or part of them. Therefore, all you have to do is select the desired name and enter it into the robots.txt file.

"Disallow" entry

The “Disallow” record must contain instructions that indicate to the search robot from the “User-agent” record which files and/or directories are prohibited from indexing.

Let's look at various examples of the “Disallow” recording.

Example of an entry in robots.txt (allow everything for indexing):

Disallow:

Example (the site is completely prohibited from . The “/” symbol is used for this): Disallow: /

Example (the file “page.htm” located in the root directory and the file “page2.htm” located in the directory “dir” are prohibited for indexing):

Disallow: /page.htm

Disallow: /dir/page2.htm

Example (the directories “cgi-bin” and “forum” and, therefore, all contents of this directory are prohibited for indexing):

Disallow: /cgi-bin/

Disallow: /forum/

It is possible to block a number of documents and (or) directories starting with the same characters from indexing using only one “Disallow” entry. To do this, you need to write the initial identical characters without a closing slash.

Example (the directory “dir” is prohibited for indexing, as well as all files and directories starting with the letters “dir”, i.e. files: “dir.htm”, “direct.htm”, directories: “dir”, “directory1” ", "directory2", etc.):

"Allow" entry

The "Allow" option is used to denote exceptions from non-indexable directories and pages that are specified by the "Disallow" entry.

For example, there is a record like this:

Disallow: /forum/

But in this case, it is necessary that the page page1 be indexed in the /forum/ directory. Then the following lines will be required in the robots.txt file:

Disallow: /forum/

Allow: /forum/page1

Sitemap entry

This entry indicates the location of the sitemap in xml format, which is used by search robots. This entry specifies the path to this file.

Sitemap: http://site.ru/sitemap.xml

"Host" entry

The “host” record is used by the Yandex search engine. It is necessary to determine the main mirror of the site, i.e. if the site has mirrors (a mirror is a partial or complete copy of the site. The presence of resource duplicates is sometimes necessary for owners of highly visited sites to increase the reliability and availability of their service), then using the “Host” directive you can choose the name under which you want to be indexed. Otherwise, Yandex will select the main mirror on its own, and other names will be prohibited from indexing.

For compatibility with search robots, which do not accept the Host directive when processing the robots.txt file, it is necessary to add a “Host” entry immediately after the Disallow entries.

Example: www.site.ru – main mirror:

Host: www.site.ru

“Crawl-delay” recording

This entry is perceived by Yandex. It is a command for the robot to take a specified amount of time (in seconds) between indexing pages. Sometimes this is necessary to protect the site from overloads.

Thus, the following entry means that the Yandex robot needs to move from one page to another no earlier than after 3 seconds:

Comments

Any line in robots.txt that begins with the "#" character is considered a comment. Comments are allowed at the end of directive lines, but some robots may not recognize the line correctly.

Example (the comment is on the same line as the directive):

Disallow: /cgi-bin/ #comment

It is advisable to place the comment on a separate line. A space at the beginning of a line is allowed, but not recommended.

4. Examples of robots.txt files

Example (comment is on a separate line):

Disallow: /cgi-bin/#comment

An example of a robots.txt file that allows all robots to index the entire site:

Host: www.site.ru

An example of a robots.txt file that prohibits all robots from indexing a site:

Host: www.site.ru

An example of a robots.txt file that prohibits all robots from indexing the directory “abc”, as well as all directories and files starting with the characters “abc”.

Host: www.site.ru

An example of a robots.txt file that prevents the “page.htm” page located in the root directory of the site from being indexed by the Googlebot search robot:

User-agent: googlebot

Disallow: /page.htm

Host: www.site.ru

An example of a robots.txt file that prohibits indexing:

– to the “googlebot” robot – the page “page1.htm” located in the “directory” directory;

– to the “Yandex” robot – all directories and pages starting with the symbols “dir” (/dir/, /direct/, dir.htm, direction.htm, etc.) and located in the root directory of the site.

User-agent: googlebot

Disallow: /directory/page1.htm

User-agent: Yandex

5. Errors related to the robots.txt file

One of the most common mistakes is inverted syntax.

Wrong:

Disallow: Yandex

Right:

User-agent: Yandex

Wrong:

Disallow: /dir/ /cgi-bin/ /forum/

Right:

Disallow: /cgi-bin/

Disallow: /forum/

If, when processing error 404 (document not found), the web server displays a special page, and the robots.txt file is missing, then it is possible that a search robot, when requesting a robots.txt file, is given that same special page, which is not a file at all indexing management.

Error related to incorrect use of case in the robots.txt file. For example, if you need to close the “cgi-bin” directory, then in the “Disallow” entry you cannot write the name of the directory in upper case “cgi-bin”.

Wrong:

Disallow: /CGI-BIN/

Right:

Disallow: /cgi-bin/

Error related to missing opening slash when closing a directory from indexing.

Wrong:

Disallow: page.HTML

Right:

Disallow: /page.HTML

To avoid the most common mistakes, the robots.txt file can be checked using Yandex.Webmaster or Google Webmaster Tools. The check is carried out after downloading the file.

6. Conclusion

Thus, the presence of a robots.txt file, as well as its compilation, can affect the website’s promotion in search engines. Without knowing the syntax of the robots.txt file, you can prevent possible promoted pages, as well as the entire site, from being indexed. And, conversely, competent compilation of this file can greatly help in promoting a resource; for example, you can block documents that interfere with the promotion of necessary pages from indexing.

I was faced with the task of excluding pages containing a certain query string (unique reports for the user, each of which has its own address) from indexing by search engines. I solved this problem for myself, and also decided to fully understand the issues of allowing and prohibiting site indexing. This material is dedicated to this. It covers not only advanced use cases for robots.txt, but also other, lesser-known ways to control site indexing.

There are many examples on the Internet of how to exclude certain folders from indexing by search engines. But a situation may arise when you need to exclude pages, and not all, but containing only the specified parameters.

Example page with parameters: site.ru/?act=report&id=7a98c5

Here act is the name of the variable whose value report, And id- this is also a variable with a value 7a98c5. Those. the query string (parameters) comes after the question mark.

There are several ways to block pages with parameters from indexing:

  • using the robots.txt file
  • using rules in the .htaccess file
  • using the robots meta tag

Controlling indexing in the robots.txt file

Robots.txt file

File robots.txt is a simple text file that is placed in the root directory (folder) of the site and contains one or more entries. Typical example of file content:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

In this file, three directories are excluded from indexing.

Remember that the line with " Disallow" must be written separately for each URL prefix you want to exclude. That is, you cannot write " Disallow: /cgi-bin/ /tmp/" into one line. Also remember the special meaning of empty lines - they separate blocks of records.

Regular expressions are not supported in any string User-agent, neither in Disallow.

The robots.txt file should be located in the root folder of your site. Its syntax is as follows:

User-agent: * Disallow: /folder or page prohibited for indexing Disallow: /other folder

As a value User-agent indicated * (asterisk) - this matches any value, i.e. The rules are intended for all search engines. Instead of an asterisk, you can specify the name of the specific search engine for which the rule is intended.

More than one directive can be specified Disallow.

You can use wildcard characters in your robots.txt file:

  • * denotes 0 or more instances of any valid character. Those. this is any string, including an empty one.
  • $ marks the end of the URL.

Other characters, including &, ?, =, etc. are taken literally.

Prohibiting indexing of a page with certain parameters using robots.txt

So I want to block addresses like (instead of MEANING can be any string): site.ru/?act=report&id=VALUE

The rule for this is:

User-agent: * Disallow: /*?*act=report&id=*

In him / (slash) means the root folder of the site, followed by * (asterisk), it means “anything.” Those. this can be any relative address, for example:

  • /page.php
  • /order/new/id

Then follows ? (question mark), which is interpreted literally, i.e. like a question mark. Therefore, what follows is the query line.

Second * means anything can be in the query string.

Then comes a sequence of characters act=report&id=*, in it act=report&id= is interpreted literally as is, and the last asterisk again means any line.

Prohibition of indexing by search engines, but permission for crawlers of advertising networks

If you have closed your site from indexing for search engines, or have closed certain sections of it, then AdSense advertising will not be shown on them! Placing advertisements on pages that are closed from indexing may be considered a violation in other affiliate networks.

To fix this, add to the very beginning of the file robots.txt the following lines:

User-agent: Mediapartners-Google Disallow: User-agent: AdsBot-Google* Disallow: User-Agent: YandexDirect Disallow:

With these lines we allow bots Mediapartners-Google, AdsBot-Google* And YandexDirect index the site.

Those. the robots.txt file for my case looks like this:

User-agent: Mediapartners-Google Disallow: User-agent: AdsBot-Google* Disallow: User-Agent: YandexDirect Disallow: User-agent: * Disallow: /*?*act=report&id=*

Prevent all pages with a query string from being indexed

This can be done as follows:

User-agent: * Disallow: /*?*

This example blocks all pages containing in the URL ? (question mark).

Remember: a question mark immediately after the domain name, e.g. site.ru/? is equivalent to an index page, so be careful with this rule.

Prohibiting indexing of pages with a certain parameter passed by the GET method

For example, you need to block URLs that contain the parameter in the query string order, the following rule is suitable for this:

User-agent: * Disallow: /*?*order=

Prevent indexing of pages with any of several parameters

Let's say we want to prevent pages that contain a query string or parameter from being indexed dir, or parameter order, or parameter p. To do this, list each of the blocking options in separate rules, something like this:

User-agent: * Disallow: /*?*dir= Disallow: /*?*order= Disallow: /*?*p=

How to prevent search engines from indexing pages that have several specific parameters in their URLs

For example, you need to exclude from indexing the page the contents parameter in the query string dir, parameter order and parameter p. For example, a page with this URL should be excluded from indexing: mydomain.com/new-printers?dir=asc&order=price&p=3

This can be achieved using the directive:

User-agent: * Disallow: /*?dir=*&order=*&p=*

Instead of parameter values ​​that may change constantly, use asterisks. If a parameter always has the same value, then use its literal spelling.

How to block a site from indexing

To prevent all robots from indexing the entire site:

User-agent: * Disallow: /

Allow all robots full access

To give all robots full access to index the site:

User-agent: * Disallow:

Either just create an empty /robots.txt file, or don't use it at all - by default, everything that is not prohibited for indexing is considered open. Therefore, an empty file or its absence means permission for full indexing.

Prohibiting all search engines from indexing part of the site

To close some sections of the site from all robots, use directives of the following type, in which replace the values ​​with your own:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

Blocking individual robots

To block access to individual robots and search engines, use the robot's name in the line User-agent. In this example, access is denied to BadBot:

User-agent: BadBot Disallow: /

Remember: many robots ignore the robots.txt file, so this is not a reliable means of stopping a site or part of it from being indexed.

Allow the site to be indexed by one search engine

Let's say we want to allow only Google to index the site, and deny access to other search engines, then do this:

User-agent: Google Disallow: User-agent: * Disallow: /

The first two lines give permission to the Google robot to index the site, and the last two lines prohibit all other robots from doing so.

Ban on indexing all files except one

Directive Allow defines paths that should be accessible to specified search robots. If the path is not specified, it is ignored.

Usage:

Allow: [path]

Important: Allow must follow before Disallow.

Note: Allow is not part of the standard, but many popular search engines support it.

Alternatively, using Disallow you can deny access to all folders except one file or one folder.

How to check the operation of robots.txt

IN Yandex.Webmaster there is a tool for checking specific addresses to allow or deny them indexing according to the robots.txt file of your file.

To do this, go to the tab Tools, select Robots.txt analysis. This file should download automatically, if there is an old version, then click the button Check:

Then into the field Are URLs allowed? enter the addresses you want to check. You can enter many addresses at once, each of them must be placed on a new line. When everything is ready, press the button Check.

In column Result if the URL is closed for indexing by search robots, it will be marked with a red light; if it is open, it will be marked with a green light.

IN Search Console there is a similar tool. It's in the tab Scanning. Called Robots.txt file inspection tool.

If you have updated the robots.txt file, then click on the button Send, and then in the window that opens, click the button again Send:

After this, reload the page (F5 key):

Enter the address to verify, select the bot and click the button Check:

Prohibiting page indexing using the robots meta tag

If you want to close the page from indexing, then in the tag write down:

to indicate what type of files are prohibited for indexing.

For example, a ban on indexing all files with the .PDF extension:

Header set X-Robots-Tag "noindex, nofollow"

Prohibition for indexing all image files (.png, .jpeg, .jpg, .gif):

Header set X-Robots-Tag "noindex"

Blocking access to search engines using mod_rewrite

In fact, everything that was described above DOES NOT GUARANTEE that search engines and prohibited robots will not access and index your site. There are robots that “respect” the robots.txt file, and there are those that simply ignore it.

Using mod_rewrite you can block access for certain bots

RewriteEngine On RewriteCond %(HTTP_USER_AGENT) Google RewriteCond %(HTTP_USER_AGENT) Yandex RewriteRule ^ - [F]

The above directives will block access to Google and Yandex robots for the entire site.

report/

RewriteEngine On RewriteCond %(HTTP_USER_AGENT) Google RewriteCond %(HTTP_USER_AGENT) Yandex RewriteRule ^report/ - [F]

If you are interested in blocking access for search engines to individual pages and sections of a site using mod_rewrite, then write in the comments and ask your questions - I will prepare more examples.

13 observations on “ How to exclude from indexing pages with certain parameters in the URL and other techniques for controlling site indexing by search engines
  1. Taras

    the closest in meaning, but here is the folder

    If, for example, you need to close only one folder for indexing report/, then the following directives will completely block access to this folder (a response code of 403 Access Denied will be issued) for Google and Yandex scanners.

When visiting a site, a search robot uses a limited amount of resources for indexing. That is, a search robot can download a certain number of pages in one visit. Depending on the update frequency, volume, number of documents, and many others, robots may come more often and download more pages.

The more and more often pages are downloaded, the faster information from your site gets into search results. In addition to the fact that pages will appear in searches faster, changes to the content of documents will also take effect faster.

Fast site indexing

Fast indexing of site pages helps combat the theft of unique content, thanks to its freshness and relevance. But the most important thing. Faster indexing allows you to track how certain changes affect the site’s position in search results.

Poor, slow site indexing

Why is the site poorly indexed? There can be many reasons, and here are the main reasons for slow site indexing.

  • Site pages load slowly. This may cause the site to be completely excluded from the index.
  • The site is rarely updated. Why would a robot often come to a site where new pages appear once a month?
  • Non-unique content. If the site contains (articles, photographs), the search engine will reduce trust (trust) in your site and reduce the consumption of resources for its indexing.
  • Large number of pages. If the site has many pages and not , then indexing or re-indexing all the pages of the site can take a very long time.
  • Complex site structure. The confusing structure of the site and the large number of attachments make it very difficult to index the site's pages.
  • Lots of extra pages. Every site has landing pages, the content of which is static, unique and useful for users, and side pages, such as login or filter pages. If such pages exist, there are usually a lot of them, but not all of them are indexed. And the pages that land compete with the landing pages. All these pages are regularly re-indexed, using up the already limited resource allocated to indexing your site.
  • Dynamic pages. If there are pages on the site whose content does not depend on dynamic parameters (example: site.ru/page.html?lol=1&wow=2&bom=3), as a result, many duplicates of the landing page site.ru/page.html may appear.

There are other reasons for poor site indexing. However, the most common mistake is.

Remove everything unnecessary from indexing

There are many opportunities to rationally use the resources that search engines allocate for site indexing. And it is robots.txt that opens up wide possibilities for managing site indexing.

Using the Allow, Disallow, Clean-param and others directives, you can effectively distribute not only the attention of the search robot, but also significantly reduce the load on the site.

First, you need to exclude everything unnecessary from indexing using the Disallow directive.

For example, let's disable the login and registration pages:

Disallow: /login Disallow: /register

Let's disable indexing of tags:

Disallow: /tag

Some dynamic pages:

Disallow: /*?lol=1

Or all dynamic pages:

Disallow: /*?*

Or let's eliminate pages with dynamic parameters:

Clean-param: lol&wow&bom /

On many sites, the number of pages found by the robot may differ from the number of pages in the search by 3 or more times. That is, more than 60% of the site’s pages do not participate in the search and are ballast that must either be entered into the search or get rid of it. By excluding non-target pages and bringing the number of pages in the search closer to 100%, you will see a significant increase in the speed of site indexing, an increase in positions in search results and more traffic.

More details about site indexing, impact of indexing on search results, site pages, others ways to speed up site indexing And reasons for poor site indexing read in the following posts. In the meantime.

Throw away unnecessary ballast and quickly get to the top.

Most robots are well designed and do not cause any problems for website owners. But if the bot was written by an amateur or “something went wrong,” then it can create a significant load on the site it crawls. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, these are analogues of browsers, but without the page viewing function).

Robots.txt - user-agent directive and search engine bots

Robots.txt has a very simple syntax, which is described in great detail, for example, in Yandex help And Google help. It usually indicates which search bot the following directives are intended for: bot name (" User-agent"), allowing (" Allow") and prohibiting (" Disallow"), and "Sitemap" is also actively used to indicate to search engines exactly where the map file is located.

The standard was created quite a long time ago and something was added later. There are directives and design rules that will only be understood by robots of certain search engines. In RuNet, only Yandex and Google are of interest, which means that you should familiarize yourself with their help on compiling robots.txt in particular detail (I provided the links in the previous paragraph).

For example, previously it was useful for the Yandex search engine to indicate that your web project is the main one in a special “Host” directive, which only this search engine understands (well, also Mail.ru, because their search is from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its functions, like those of other search engines, are performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which spelling option is the main one - .

Now let's talk a little about the syntax of this file. Directives in robots.txt look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one “Disallow” directive after each “User-agent” entry. An empty file assumes permission to index the entire site.

User-agent

"User-agent" directive must contain the name of the search bot. Using it, you can set up rules of behavior for each specific search engine (for example, create a ban on indexing a separate folder only for Yandex). An example of writing “User-agent” addressed to all bots visiting your resource looks like this:

User-agent: *

If you want to set certain conditions in the “User-agent” only for one bot, for example, Yandex, then you need to write this:

User-agent: Yandex

Name of search engine robots and their role in the robots.txt file

Bot of every search engine has its own name (for example, for a rambler it is StackRambler). Here I will give a list of the most famous of them:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

Major search engines sometimes have except the main bots, there are also separate instances for indexing blogs, news, images, etc. You can get a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a rule for prohibiting indexing, which all types of Google robots must follow, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only ban, for example, the indexing of pictures by specifying the Googlebot-Image bot as the User-agent. Now this is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.txt

I'll give you a few simple ones. examples of using directives with an explanation of his actions.

  1. The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. This is given empty directive Disallow. User-agent: * Disallow:
  2. The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets this to Disallow with "/" in the value field. User-agent: * Disallow: /
  3. In this case, all bots will be prohibited from viewing the contents of the /image/ directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: /image/
  4. To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html

    Looking ahead a little, I’ll say that it’s easier to use the asterisk (*) symbol so as not to write the full path:

    Disallow: /*private_file.html

  5. In the example below, the directory “image” will be prohibited, as well as all files and directories starting with the characters “image”, i.e. files: “image.htm”, “images.htm”, directories: “image”, “ images1", "image34", etc.): User-agent: * Disallow: /image The fact is that by default at the end of the entry there is an asterisk, which replaces any characters, including their absence. Read about it below.
  6. By using Allow directives we allow access. Complements Disallow well. For example, with this condition we prohibit the Yandex search robot from downloading (indexing) everything except web pages whose address begins with /cgi-bin: User-agent: Yandex Allow: /cgi-bin Disallow: /

    Well, or this obvious example of using the Allow and Disallow combination:

    User-agent: * Disallow: /catalog Allow: /catalog/auto

  7. When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus defining certain logical expressions.
    1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prohibits all search engines from indexing files with the “.php” extension: User-agent: * Disallow: *.php$
    2. Why is it needed at the end? $ sign? The fact is that, according to the logic of compiling the robots.txt file, a default asterisk is added at the end of each directive (it’s not there, but it seems to be there). For example, we write: Disallow: /images

      Implying that this is the same as:

      Disallow: /images*

      Those. this rule prohibits the indexing of all files (web pages, pictures and other types of files) whose address begins with /images, and then anything follows (see example above). So, $ symbol it simply cancels the default asterisk at the end. For example:

      Disallow: /images$

      Only prevents indexing of the /images file, but not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (having such an extension), so as not to catch anything unnecessary:

      Disallow: *.php$

  • In many engines, users (human-readable Urls), while system-generated Urls have a question mark "?" in the address. You can take advantage of this and write the following rule in robots.txt: User-agent: * Disallow: /*?

    The asterisk after the question mark suggests itself, but, as we found out just above, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It won’t be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages being included in the index.

  • Sitemap and Host directives (for Yandex) in Robots.txt

    To avoid unpleasant problems with site mirrors, it was previously recommended to add a Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

    Host directive - indicates the main mirror of the site for Yandex

    For example, earlier if you have not yet switched to a secure protocol, it was necessary to indicate in Host not the full URL, but the domain name (without http://, i.e..ru). If you have already switched to https, then you will need to indicate the full URL (such as https://myhost.ru).

    A wonderful tool for combating duplicate content - the search engine simply will not index the page if a different URL is registered in Canonical. For example, for such a page of my blog (page with pagination), Canonical points to https://site and there should be no problems with duplicating titles.

    But I digress...

    If your project is created on the basis of any engine, then Duplicate content will occur with a high probability, which means you need to fight it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case Google may ignore the ban, but it will no longer be able to give a damn about the meta tag ( brought up that way).

    For example, in WordPress, pages with very similar content can be indexed by search engines if indexing of both category content, tag archive content, and temporary archive content is allowed. But if, using the Robots meta tag described above, you create a ban on the tag archive and temporary archive (you can leave the tags and prohibit indexing of the content of the categories), then duplication of content will not occur. How to do this is described in the link given just above (to the OlInSeoPak plugin)

    To summarize, I will say that the Robots file is intended for setting global rules for denying access to entire site directories, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions just above.

    Now let's look at specific examples of robots designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not radically) from each other. True, they will all have one thing in common, and this moment is connected with the Yandex search engine.

    Because In RuNet, Yandex has quite a lot of weight, then we need to take into account all the nuances of its work, and here we The Host directive will help. It will explicitly indicate to this search engine the main mirror of your site.

    For this, it is recommended to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) may lead to negative consequences and incorrect indexing.

    It’s hard to say what the situation really is, because search algorithms are a thing in themselves, so it’s better to do as advised. But in this case, we will have to duplicate in the User-agent: Yandex directive all the rules that we set User-agent: *. If you leave User-agent: Yandex with an empty Disallow:, then in this way you will allow Yandex to go anywhere and drag everything into the index.

    Robots for WordPress

    I will not give an example of a file that the developers recommend. You can watch it yourself. Many bloggers do not at all limit Yandex and Google bots in their walks through the content of the WordPress engine. Most often on blogs you can find robots automatically filled with a plugin.

    But, in my opinion, we should still help the search in the difficult task of sifting the wheat from the chaff. Firstly, it will take a lot of time for Yandex and Google bots to index this garbage, and there may not be any time left to add web pages with your new articles to the index. Secondly, bots crawling through garbage engine files will create additional load on your host’s server, which is not good.

    You can see my version of this file for yourself. It’s old and hasn’t been changed for a long time, but I try to follow the principle “don’t fix what isn’t broken,” and it’s up to you to decide: use it, make your own, or steal from someone else. I also had a ban on indexing pages with pagination until recently (Disallow: */page/), but recently I removed it, relying on Canonical, which I wrote about above.

    But in general, the only correct file for WordPress probably doesn't exist. You can, of course, implement any prerequisites in it, but who said that they will be correct. There are many options for ideal robots.txt on the Internet.

    I will give two extremes:

    1. you can find a megafile with detailed explanations (the # symbol separates comments that would be better deleted in a real file): User-agent: * # general rules for robots, except Yandex and Google, # because for them the rules are below Disallow: /cgi-bin # folder on hosting Disallow: /? # all request parameters on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a subdirectory /wp/ where the CMS is installed ( if not, # the rule can be deleted) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author archive Disallow: /users/ # author archive Disallow: */ trackback # trackbacks, notifications in comments about the appearance of an open # link to an article Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeddings Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if you don't use it, # the rule can be deleted) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm= # links with utm tags Disallow: *openstat= # links with openstat tags Allow: */uploads # open the folder with the files uploads User-agent: GoogleBot # rules for Google (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside /wp - (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder and etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # images in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins so as not to block JS and CSS User-agent: Yandex # rules for Yandex (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not blocking # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar # Specify one or more Sitemap files (no need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps like the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify a port, indicate it). The Host command is understood by # Yandex and Mail.RU, Google does not take it into account. Host: www.site.ru
    2. But you can use an example of minimalism: User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https://site. ru/sitemap.xml

    The truth probably lies somewhere in the middle. Also, don’t forget to add the Robots meta tag for “extra” pages, for example, using the wonderful plugin - . It will also help you set up Canonical.

    Correct robots.txt for Joomla

    User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/

    In principle, almost everything is taken into account here and it works well. The only thing is that you should add a separate User-agent: Yandex rule to insert the Host directive, which defines the main mirror for Yandex, and also specify the path to the Sitemap file.

    Therefore, in its final form, the correct robots for Joomla, in my opinion, should look like this:

    User-agent: Yandex Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /component/tags* Disallow: /*mailto/ Disallow: /*.pdf Disallow : /*% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: /* .jpg?*$ Allow: /*.png?*$ Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow : /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /*mailto/ Disallow: /*. pdf Disallow: /*% Disallow: /index.php Sitemap: http://path to your XML format map

    Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and images. This was written specifically for Google, because its Googlebot sometimes complains that indexing of these files, for example, from the folder with the theme used, is prohibited in the robots. He even threatens to lower his ranking for this.

    Therefore, we allow this whole thing to be indexed in advance using Allow. By the way, the same thing happened in the example file for WordPress.

    Good luck to you! See you soon on the pages of the blog site

    You might be interested

    Domains with and without www - the history of their appearance, the use of 301 redirects to glue them together
    Mirrors, duplicate pages and Url addresses - an audit of your site or what could be the cause of failure during its SEO promotion SEO for beginners: 10 main points of a technical website audit
    Bing webmaster - center for webmasters from the Bing search engine
    Google webmaster - Search Console tools (Google Webmaster)
    How to avoid common mistakes when promoting a website
    How to promote a website yourself by improving internal keyword optimization and removing duplicate content
    Yandex Webmaster - indexing, links, site visibility, region selection, authorship and virus checking in Yandex Webmaster

    New on the site

    >

    Most popular