Thứ Năm, 10 tháng 2, 2022

What is robots.txt file? 3 Simple ways to create robots.txt Wordpress

Have you ever imagined how a search engine robot can analyze the data of a website for indexing?

Sometimes you want Googlebot to quickly index your website or not to index a particular page.

So what to do now? 

I can give you the answer right away: Create a robots.txt file for WordPress right away! To understand the robots.txt file and how to create it, I will bring you the following useful article.

This article will guide you:

  • Understand the concept of what is a robots.txt file ?

  • Basic structure of a robots.txt . file

  • Are there any notes when creating robots.txt WordPress

  • Why you need robots.txt for your website

  • How to create a complete file for your website

Start learning!

What is robots.txt file?

The robots.txt file is a simple .txt text file. This file is part of the Robots Exclusion Protocol (REP) that contains a group of Web standards that specify how Web Robots (or Search Engine Robots) crawl the web, access, index content, and serve it up. that content to the user.

What is robots.txt?

Learn about robots.txt indexing your website

REP also includes commands like Meta Robots, Page-Subdirectory, Site-Wide Instructions. It instructs Google's tools to process links. (eg Follow or Nofollow link).

In fact, creating robots.txt Wordpress helps webmasters be more flexible and proactive in allowing or not allowing Google Index engine bots certain parts of their site.

Syntax of robots.txt . file

The syntax is considered the own language of the robots.txt files. There are 5 common terms that you will come across in a robots.txt file. These include:

  • User-agent: This section is the name of the web crawlers that access the data. (eg: Googlebot, Bingbot, ...)

  • Disallow: Used to notify User-agents not to collect any specific URL data. Only 1 Disallow line can be used per URL.

  • Allow(Googlebot search engine only): The command tells Googlebot that it will visit a page or subdirectory. Although pages or its subfolders may not be allowed.

  • Crawl-delay: Tells the Web crawler how many seconds it must wait before loading and crawling the page's content. Note, however, that the Googlebot search engine does not recognize this command. You set the crawl rate in Google Search Console.

  • Sitemap: Used to provide the locations of any XML Sitemaps associated with this URL. Note this command is only supported by Google, Ask, Bing and Yahoo engines.

Pattern – Matching

In fact, Wordpress robots.txt files are quite complicated to block or allow bots as they allow the use of Pattern-Matching feature to cover a wide range of URL options.

All Google and Bing tools allow the use of 2 regexes to identify pages or subdirectories that SEOs want to exclude. These two characters are the asterisk (*) and the dollar sign ($).

  • *is a wildcard for any string of characters – This means it is applicable to all Bots of the Google tools.

  • $is the character that matches the end of the URL.

Basic format of robots.txt . file

The robots.txt file has the following basic format:

User-agent:

Disallow:

Allow:

Crawl-delay:

Sitemap:

However, you can still leave out sections Crawl-delayand Sitemap. This is the basic format of the complete Wordpress robots.txt. However, in reality, the robots.txt file contains more lines User-agentand more user directives.

For example, the command lines: Disallow, Allow, Crawl-delay, … In the robots.txt file, you specify for many different bots. Each command is usually written separately, separated by a line.

In a Wordpress robots.txt file you can specify multiple commands for the bots by writing them consecutively with no lines. However, in case a robots.txt file has many commands for a type of bot, by default the bot will follow the command written clearly and completely.

Standard robots.txt file

To block all Web crawlers from collecting any data on the website including the home page. Let's use the following syntax:

User-agent: *

Disallow: /

To allow all crawlers access to all content on the website including the homepage. Let's use the following syntax:

User-agent: *

Disallow:

To block crawlers, Google's crawler ( User-agent: Googlebot ) does not crawl any pages that contain the URL string www.example.com/example-subfolder/. Let's use the following syntax:

User-agent: Googlebot

Disallow: /example-subfolder/

To block Bing's crawler ( User-agent: Bing ) from crawling the specific page at www.example.com/example-subfolder/blocked-page. Let's use the following syntax:

User-agent: Bingbot

Disallow: /example-subfolder/blocked-page.html

Example for standard robots.txt file

Here is an example of a robots.txt file that works for the website www.example.com:

User-agent: *

Disallow: /wp-admin/

Allow: /

Sitemap: https://www.example.com/sitemap_index.xml

In your opinion, what does the robots.txt file structure mean? Let me explain. This proves that you allow all Google tools to follow the link www.example.com/sitemap_index.xml to find the robots.txt file and analyze it. The same index of all the data in the pages of your website except the www.example.com/wp-admin/ page.

Don't forget to sign up for a free 3-day trial of Entity Mastermind - SEO skill level to help you X10 Organic Traffic after 6 months.

Trial Entity Mastermind

Why do you need to create robots.txt file?

Creating robots.txt for your website helps you control bots' access to certain areas of your website. And this can be extremely dangerous if you accidentally make a few mistakes that make Googlebot unable to index your website. However, creating a robots.txt file is still really useful for many reasons:

  • Prevent Duplicate Content from appearing in your website (note that Meta Robots are usually a better choice for this)

  • Keep some parts of the page private

  • Keep internal search results pages from showing up on SERPs

  • Specify the location of the Sitemap

  • Prevents Google Index tools from certain files on your site (images, PDFs, ...)

  • Use the Crawl-delay command to set the time. This will prevent your server from being overloaded when crawlers load a lot of content at once.

If you don't want to prevent Web crawlers from crawling your website, you don't need to create robots.txt at all.

get free materials

How does the robots.txt file work?

Search engines have 2 main tasks:

  1. Crawl (scratch/analyze) data on web pages to discover content

  2. Index that content in response to user searches

file robots.txt, file robots.txt chuẩn cho wordpress

Search engines have 2 main tasks: Crawl & Index

To crawl the website's data, the engines will follow the links from one page to another. Ultimately, it crawls through billions of different web pages. This crawling process is also known as “Spidering”.

After arriving at a website, before spidering, the Google engine bots will look for the Wordpress robots.txt files. If it finds a robots.txt file, it will read that file first before proceeding to the next steps.

The robots.txt file will contain information about how Google tools should crawl your website. Here these bots will be guided with more specific information for this process.

If the robots.txt file does not contain any directives for User-agents or if you do not create a robots.txt file for the website, the bots will proceed to crawl other information on the web.

Where is the robots.txt file located on a website?

When you create a WordPress website, it automatically creates a robots.txt file located just below the server root directory.

For example, if your site is located in the root directory of gtvseo.com, you will be able to access the robots.txt file at gtvseo.com/robots.txt, the initial output will look like this:

User-agent: *

Disallow: /wp-admin/

Disallow: /wp-includes/

As I said above, the part after User-agent: * means that the rule is applied to all types of bots everywhere on the website. In this case, this file will tell bots that they are not allowed in the wp-admin and wp-includes directory files. Very reasonable, isn't it, because these 2 folders contain a lot of sensitive information files.

Remember this is a virtual file, which WordPress defaults to upon installation and cannot be edited (though it should still work). Usually, the standard WordPress robots.txt file location is located in the root directory, often called public_html and www (or website name). And to create your own robots.txt file, you need to create a new file to replace the old file placed in that original directory.

robots.txt wordpress

In the section below, I will show you many ways to create a new robots.txt file for WordPress very easily. But first, do your research on the rules you should use in this file.

How to check if the website has a robots.txt file?

If you are wondering if your website has a robots.txt file. Enter your Root Domain, then add /robots.txt to the end of the URL. If you don't have a .txt page showing up, then your website is probably not creating robots.txt for Wordpress. Very simple! Similarly, you can check if my website gtvseo.com generates a robots.txt file by doing the following:

Type Root Domain ( gtvseo.com ) > insert /robots.txt at the end (result is gtvseo.com/robots.txt ) > Press Enter. And wait for the results to know right away!

robots txt, seo standard robots.txt file

How to check the file robots.txt

What rules should be added in the WordPress robots.txt file?

So far, they all handle one rule at a time. But what if you want to apply different rules to different bots? 

You just need to add each set of rules in the User-agent declaration for each bot. 

For example, If you want to create one rule that applies to all bots and another that applies only to Bingbot, you can do it like this:

User-agent: *

Disallow: /wp-admin/

User-agent: Bingbot

Disallow: /

Here, all bots will be blocked from accessing /wp-admin/ but Bingbot will be blocked from accessing your entire site.

get free materials

3 How to create a simple robots.txt Wordpress file

If, after checking, you find that your website does not have a robots.txt file, or you simply want to change your robots.txt file. Please refer to 3 ways to create robots.txt for Wordpress below:

1. Use Yoast SEO

You can edit or create a robots.txt file for Wordpress on the Wordpress Dashboard itself with a few simple steps. Log in to your website, when you log in you will see the interface of the Dashboard page.

On the left side of the screen, click SEO > Tools > File editor.

robots.txt wordpress, what does the robot.txt file do?

Go to the Tools section of SEO

create robots.txt for wordpress

Click File Editor to start creating Robots.txt

The file editor feature will not appear if your WordPress does not have a file editing manager enabled. So enable them via FTP (File Transfer Protocol).

You will now see the robots.txt and .htaccess file sections – this is where you can create the robots.txt file.

what is robots.txt file, custom robots.txt

Adjust and create robots.txt files directly on Yoast SEO

2. Through the All in One SEO Plugin set

Or you can use the All in One SEO Plugin to create a WordPress robots.txt file quickly. This is also a utility plugin for WordPress – Simple, easy to use.

To create a WordPress robots.txt file, you must go to the main interface of the All in One SEO Pack Plugin. Select All in One SEO > Features Manager > Click Active for robots.txt

At this point, the interface will appear many interesting features:

robot.txt, create robot.txt file for website

Click Activate to activate Robots.txt

And then, the robots.txt section will appear as a new tab in the large All in One SEO folder. You can create and modify the robots.txt Wordpress file here.

create robots txt file for website

Create and adjust the Wordpress robots.txt file here.

However, this set of plugins is a bit different from the Yoast SEO I just mentioned above.

All in One SEO blurs out the information of the robots.txt file instead of you being able to edit the file like the Yoast SEO tool. This can make you a bit passive when editing the Wordpress robots.txt file. However, positively speaking, this factor will help you limit the damage to your website. Especially some Malware bots will harm your website without your expectation.

3. Create and upload robots.txt file via FTP

If you don't want to use a plugin to create your Wordpress robots.txt file, then I have a way for you – Create your own robots.txt file manually for your Wordpress.

wordpress robots.txt, disallow robots.txt

Upload file robots.txt qua FTP

It only takes you a few minutes to create this Wordpress robots.txt file manually. Use Notepad or Textedit to create a Wordpress robots.txt file template according to the Rule I introduced at the beginning of the article. Then upload this file via FTP without using a plugin, this process is very simple and does not take you too much time.

Some rules when creating robots.txt . file

  • To be found by bots, the Wordpress robots.txt files must be placed in the top-level directories of the site.

  • Txt is case sensitive. So the file must be named robots.txt. (not Robots.txt or robots.TXT, ...)

  • Do not put /wp-content/themes/ or /wp-content/plugins/ in the Disallow section . That will prevent the tools from correctly seeing how your blog or website looks.

  • Some User-agents choose to ignore your standard robots.txt files. This is quite common with nefarious User-agents like:

    • Malware robots (bots of malicious code)

    • Email Address Scraping Processes

  • Robots.txt files are generally available and publicly available on the web. You just need to add /robots.txt to the end of any Root Domain to see that site's directives. This means that anyone can see the pages you want or don't want to crawl. So don't use these files to hide user's personal information.

  • Each Subdomain on a Root Domain will use separate robots.txt files. This means that both blog.example.com and example.com should have separate robots.txt files (blog.example.com/robots.txt and example.com/robots.txt). In short, this is considered the best way to indicate the location of any sitemaps associated with the domain at the bottom of the robots.txt file.

Read more: 13 Errors that cause “SEO never to be on TOP”

Some notes when using robots.txt . file

Make sure you're not blocking any content or parts of your site that you want Google to index.

Links on pages blocked by robots.txt will not be tracked by bots. Unless these links have links to other pages (pages not blocked by robots.txt, Meta Robots, etc.). Otherwise, the linked resources may not be crawled and indexed.

Link juice will not be passed from blocked pages to landing pages. So if you want to flow Link juice through these pages then you should use another method instead of creating robots.txt WordPress.

The robots.txt file should not be used to prevent sensitive data (such as private user information) from appearing in SERP results. Because this website containing personal information may be linked to many other websites. So bots will ignore the directives of the robots.txt file on your Root Domain or homepage, so this site can still be indexed.

If you want to block this site from search results, use another method instead of creating a robots.txt file for WordPress such as password protection or Noindex Meta Directive . Some search engines have a lot of User-agent. For example, Google uses Googlebot for free searches and Googlebot-Image for image searches.

Most User-agents from the same engine follow the same rule. Therefore you do not need to specify commands for each User-agent. However, doing this can still help you adjust the way the website content is indexed.

Search engines will cache the content of the WordPress robots.txt file. However it still usually updates the content in the cache at least once a day. If you change your files and want to update your files faster then immediately use the Submit function of the robots.txt File Inspector.

Frequently asked questions about robots.txt

Here are some frequently asked questions, which may be your questions about robots.txt now:

What is the maximum size of robots.txt file?

500 kilobytes (approx.).

Where is the Wordpress robots.txt file located on the website?

At the location: domain.com/robots.txt.

How to edit robots.txt WordPress?

You can do it manually or use one of the many WordPress SEO plugins like Yoast which allows you to edit robots.txt from the WordPress backend.

What if Disallow on Noindex content in robots.txt?

Google will never see the Noindex directive because it cannot crawl the page data.

I use the same robots.txt file for multiple sites. Can I use a full URL instead of a relative path?

No, the commands in the robots.txt file (except the code Sitemap:) apply only to relative paths.

How can I suspend all of my site's crawling?

You can suspend all crawling by returning an HTTP 503 result code for every URL, including the robots.txt file. You should not change the robots.txt file to block crawling.

How to block all Web Crawler?

All you need to do is go to Settings > Reading and check the box next to the Search Engine Visibility option.

create robots.txt

Tick ​​“Discourage search engines from indexing this site” to block all web crawlers from indexing your site.

Once selected, WordPress adds this line to the header of your site:

meta name='robots' content='noindex,follow'

WordPress also changes your site's robots.txt file and adds these lines:

User-agent: *

Disallow: /

These lines tell robots (web crawlers) not to index your pages. However, it is entirely up to the search engines to accept this request or ignore it.

Block Google crawlers and search engines :

To block crawlers, Google crawler (User-agent: Googlebot) does not crawl any pages that contain the URL string www.example.com/example-subfolder/. Please use the following syntax:

User-agent: Googlebot

Disallow: /example-subfolder

Block Bing's crawlers :

Please use the following syntax:

User-agent: Bingbot

Disallow: /example-subfolder/blocked-page.html

How are robots.txt, Meta robots and X-robots different?

First, robots.txt is a text file while Meta robots and X-robots are Meta Directives. In addition, the functions of these 3 types of Robots are also completely different.

Meta Robots are pieces of code that provide instructions to crawlers on how to crawl or index web page content.

There is no description for this result due to this site's robots.txt

Robot Meta tag

It is placed in the <head> section of the web page and looks like:

<meta name="robots" content="noindex" />

The X-robot is part of the HTTP header sent from the web server. Unlike the robots meta tag, this tag is not placed in the HTML of a page (ie the <head> section of the web page).

file robots.txt wordpress

X-robots

X-Robots are used to prevent search engines from indexing specific file types like images or PDFs, even for non-HTML files.

Any directive that is available in the robots meta tag can be specified as an X-Robots.

By allowing you to control how specific file types are indexed, X-Robots provides more flexibility than the Meta robots tag and robots.txt file.

Creating the robots.txt file dictates the indexing of the entire site or directory. Meanwhile, Meta robots and X-robots can dictate indexing at the individual page level.

Conclude

Now it's your turn! Do you know what robots.txt file is ? Checked if my website has robots.txt file or not. Create and edit your own Wordpress robots.txt file to help search engine bots crawl and index your site quickly.

If, after reading this detailed article, you still find it difficult to understand, you can completely consider enrolling in an SEO training course or program at GTV!

Good luck!

#gtvseo #gtv_seo #robots_txt_la_gi

Thông tin liên hệ:

GTV SEO

Địa chỉ: Số 91, Đường số 6, Khu dân cư Cityland Park Hills, Phường 10, Quận Gò Vấp, TP. HCM

SĐT: 0919-009-319

Email: info@gtvseo.com

Không có nhận xét nào:

Đăng nhận xét