Blog

Common Crawling & Indexing Issues: Using Robots.txt, Robots Meta, and Canonical Tags Correctly

Published on: 
February 17, 2021
Updated on: 
September 15, 2022
by
Tory Gray
Tory Gray

One of the most frequent technical SEO issues we get from clients goes something like this: “Google is indexing pages I blocked in my Robots.txt file! What the heck?!”

To which we have to reply:

Technically, that’s working as expected. Crawling is an entirely different process. Disallowing crawling does not prohibit indexing. But here’s what you can do…

For clarity, here we'll break down what these mechanisms are, the core tools to resolve index and crawl errors, and the common pitfalls to avoid when implementing them.

It's important to understand that a) crawling is not the same thing as indexing, and b) these processes are run by entirely different teams within Google. As such, controlling one does not inherently control the other.

"Crawling":

Crawling is defined as a bot, script, or program that visits a webpage and grabs (or "crawls") content and links from it. In other words, crawling is the fundamental first step for search engines as they explore your website.

Googlebot robot figure
Doesn't Googlebot look friendly? 

When it comes to crawling and SEO, the major question at hand is: do you give search engines permission to access the site or a specific page—and to the content they find there?

In Google’s case, the crawler itself is generally referred to as “Googlebot”. There’s much to learn about how Googlebot prioritizes what pages to crawl and what pages to skip (read 5 things you didn’t know about Googlebot for more of a deep dive.) 

Simply put, if a page is not crawled, it won’t be indexed and won’t be displayed in the search results.

Here's the (frequently used) list of web crawlers and their respective User Agents as of August 2020 (with an admittedly US-based perspective):

Google Search:

Google (Desktop): Googlebot
Google (Mobile/Smartphone): Googlebot (it's a mobile-first world, ya'll!)
Google Videos: Googlebot, Googlebot-Video 
Google Images:Googlebot, Googlebot-Image
Google News: Googlebot-News
Mobile Apps Android: Adsbot-Google-Mobile-Apps 
AdsBot: AdsBot-Google
AdsBot Mobile: AdsBot-Google-Mobile
Adsense: MediaPartners-Google

Bing Search:

Bingbot: Bingbot

DuckDuckGo Search:

DuckDuckGo's bot: DuckDuckBot

Baidu Search:

Baidu's bot: Baiduspider

Yandex Search:

Yandex's bot: Yandexbot

Yahoo Search:

Yahoo's bot: Slurp

"Indexing":

Do you give search engines permission to include a website, or specific URL on that site, in their search engine index (e.g. the websites shown they can show in search results)

In Google’s case, the indexing engine is called Caffeine. If/when the search engine decides to include your webpage in their search results, it becomes “indexed.”

Indexing typically follows crawling, though you should note that crawling (of your site/page) isn’t technically required. Google and other search engines may opt to include a page/site in their index after crawling links to your site/page, either from your site or from external websites. These links can make your pages appear valuable to Google, and therefore worthy of indexation.

If you are struggling with assets you don’t want to be indexed, check out How to Deindex "Stuff" from Google Quickly & Effectively.

What about “Rendering”?

Rendering is, well, another thing altogether. To “render” a web page, your browser (e.g. Chrome, Safari, IE, Firefox, etc.) will process all the HTML, JavaScript, and CSS (e.g. the code) in order to construct the layout and visible page you view when browsing a website (here’s a better, more in-depth explanation for those who may be interested.)

It’s a little out of scope for our conversation here, but the important thing to keep in mind is that a) it’s yet another important SEO step, and b) rendering can and does impact indexation.

The more "work" it takes to render something (time, budget and resources for search engines), the less likely it is to rank (and/or the higher quality & value it has to be/provide in order to qualify to rank, and rank well. The popularity of your brand, and the page in question (read as: lots of backlinks!), matters a LOT more in this situation.) Put simply, the harder you make Google work to rank your website, the less likely they are to do it.

That’s quite literally the antithesis of technical SEO, in which we do work to make it easier for search engines to crawl your website. This is why SPA SEO, JavaScript SEO, and Rendering SEO are a thing. It’s just more work to rank those sites despite their relative popularity.

Learn more about the different options for rendering.

With that in mind, here’s a guide to the most common tools & means of controlling search engine crawling and indexing, so you can set the correct instructions for your use case—and avoid common indexing issues.

What SEO tags and tools are available to control crawling and indexing?

 “If you are looking for a “robots.txt noindex” option, that no longer exists. It was never an “approved” method, but it was highly effective. Unfortunately, it’s been officially retired.”    

Despite continued interest in robots.txt noindex as a means of controlling crawling, this rule has been replaced with alternative crawler directives. Learn more from Google’s note on retiring unsupported rules in robots.txt, and read on to know which methods best apply to your case.

Robots.txt File

Example Robots.txt file and syntax
Example Robots.txt file and syntax

Definition: Robots.txt is a text file that sets crawling instructions for bots/spiders (like Googlebot) for a specific website, generally to set boundaries around what is and is not permissible to crawl. Think of it as the set of instructions for how search engines should use your website, and what they can and can NOT access.

Basic Rules for the Robots.txt file:

  • It MUST exist at /robots.txt (off the root domain.)
  • It's case sensitive. For The Gray Dot Company, that's https://thegray.company/robots.txt. /Robots.txt (undercase) is not a valid location.
  • If you don't have a robots file at this exact URL path, it will be considered a 404 page by definition. 
  • Bots can *choose* to follow these instructions... or not. Malicious bots don't care, whereas Googlebot and Bingbot will generally respect your instructions.

Robots.txt Pros:

  • Site-level and directory (or folder)-level control of crawling
  • It's a great crawl budget lever (via blocking non-relevant, duplicate content or low-quality content just like duplicate facets on e-commerce websites)
  • The simple pattern-matching syntax is permissible & pretty user friendly
  • It's easy to edit in most cases

Robots.txt Cons:

  • It can only control crawling... NOT indexation
  • Page-level control was never intended, and if/when you try to do this via the robots.txt file, it gets messy pretty quickly
  • It's publicly accessible, and therefore should not be used to hide sensitive or confidential information
  • It's relatively easy to accidentally block crawling sections of your site that you did not intend to block

Robots.txt Pro Tips: 

Meta Robots Tag

Meta Robots Noindex Code Example
Meta robots code snippet

Definition: Meta Robots is a snippet of code you can add to the <head> section of your websites HTML that gives you page-level control of bot/spider crawling AND indexing. Meta robots is not a required tag. If/when not set, the default value is "index, follow." In other words, everything is accessible until you say otherwise, using one of these tools. This is sometimes referred to as a "noindex tag."

  • Be careful with implementing the meta robots tag via JavaScript—Googlebot may or may not see it or respect it. (Read more about common JavaScript SEO issues.)

Meta Robots Pros:

  • Page-level control means granular control on an individual URL basis
  • It's a highly effective means of controlling page indexation
  • It’s considered a “directive”, meaning that it’s very likely search engines will comply with your instructions. 

Meta Robots Cons:

  • This tag can't be read or found if it's blocked via robots.txt. In other words, if a URL is blocked via robots.txt, Googlebot won't even crawl the page to see what the meta robots tag is set to. This isn't a "con" so much as it's important to be aware of. 
  • It's inefficient at controlling crawl budget at scale; Googlebot will have to check each page first to see if they can crawl/index it—at which time it's potentially already wasted time. 
  • Note that this doesn't matter for small or even most medium-sized sites. Crawl budget concerns are pretty much only a large and very large site issue (e.g. many thousands of pages+) 
  • Much like the robots.txt file, bots don't have to follow your instructions. "Good citizen" bots—like those from most major search engines—will respect it, however.

While other instructions are available, the most common ones control a) indexation (index and noindex) of the page itself, and b) crawling to pages linked on the origin page (follow or nofollow.) 

They can—and are—usually combined, though that's not required. These are in rough order of frequency of use:

  • “noindex, nofollow”: Generally most common use case for why you'd add robots meta tag to your site in the first place (that is, "index,follow" is more common, BUT this is the default that the robots meta tag allows you to override). It's commonly used when you don't want search engines to pay attention to the page at all. 
  • “index, follow”: the page in question is good and valid, and therefore should be included in search engine results. Pages you link to from this page should be crawled as well. 
    This fits the use case for most pages on the internet—any normal publicly accessible & valid page.
  • “noindex, follow": when you want the page itself not included in Google's index (or other search engine's indexes,) but you want, or don't care, if bots crawl & pass page rank to the pages linked from the page this command is set on. (NOTE that over time, Google will ignore your "follow" command due to the noindex instruction being so strong. In other words, a noindex, follow command will eventually be read as a noindex, nofollow command.)
    “index, nofollow”
    : the page in question is good and valid, but you don't want any of the links on the page to be crawled. Perhaps a good use case for sponsored content. In any case, this is pretty uncommon.

           

canonical-tag-example
Canonical tag syntax example

Canonical Tag

Definition: the canonical instruction is a snippet of code included in the <head> section of a website in which you can give page-level insight to search engines about original, near-duplicate, and duplicate content—and which pages are the source of that content. There are other methods of implementing this, but they are MUCH less common.

  • The original source of the content should self-canonicalize (refer to itself as the original source of content); a duplicate or near-duplicate page should generally point to the other, original content page as the canonical source.
  • Important pages don't technically have to self-canonicalize, but it's still a good idea—especially if you have parameter variations of those pages. 

NOTE that canonicalized pages (e.g. pages that point to other pages and not themselves) are both crawlable and indexable.

  • The tag is a "hint" about how you want Google and other search engines to handle that tag, not a "directive" about what they must do. In practice, that means that some canonicalized pages are indexed, and some aren't. The search engine gets to make this decision. 
  • In most cases, using the canonical tag passes the SEO equity that a page has acquired and passes it, instead, to the page referenced in the tag (e.g. itself, or another page) giving it a stronger ability to rank. 

Common Use Cases for Utilizing Canonical Tags: 

  • Syndicated content that's published on multiple websites
  • Sites with technical issues or parameter variations
  • Example 1: the difference between /Robots.txt and /robots.txt (yes, URLs are in fact case sensitive!)
  • Example 2: the difference between /products and /products?category=1 (some parameter variations of URLs are valid & different pages, but many are not. The canonical tag can help you send the right signals about which is which.)
  • Example 3: Which URL do you want ranking? /products?category=tables, or /products/tables. eCommerce facets and filters can be properly handled for—in many cases—by canonicals. 
  • Sites with structural issues, or similar but slightly different target audiences: Sometimes it's desirable to repeat the same page in a different section of the site—perhaps for a different audience with small copy tweaks to speak to the audience in question.
  • Brands that utilize a subdomain strategy to promote certain products as stand-alone products; they often have a page on the core site for that product in addition to a microsite. 

Canonical Implementation Rules: 

  • Full path URL references (including the http:// part) are required 
  • Don't canonicalize to pages that aren't valid, indexable pages themselves (this creates a canonical loop)
  • Don't canonicalize paginated pages to page 1. 
  • Pagination is valid and non-duplicative by nature (Page 2 is not Page 1!); if you want search engines crawling to pages linked via pagination (e.g. the products listed on page 2) this canonical error will result in less crawling and reduced SEO equity getting passed to those pages.
  • Be careful with implementing the canonical tag via JavaScript—Googlebot may or may not see it or respect it. 

Canonical Pros

  • For the most part, it does what it's meant to do - give credit to the original content
  • Canonical tags help consolidate link equity from duplicate or overlapping pages
  • They’re relatively simple to implement with popular CMS plugins, like Yoast 

Canonical Cons

  • The tag can't be read/found if it's blocked via robots.txt; it’s the same problem as the meta robots - if you disallow the page via Robots.txt, Googlebot won't crawl the page to even see the canonical tag.
  • It's relatively easy to make mistakes
  • SEO QA is a critical step to ensure you've set this up correctly. Common mistakes with canonical tags are outlined below.
GSC  Remove URL Tool
Google Search Console's Removal Tool

Google Search Console (GSC) Remove URL Tool

Definition: The Remove URL tool is available in GSC; it allows for explicit site-level, directory-level, AND page-level control of indexation... temporarily. 

  • We generally recommend using this tool for *quickly* deindexing content that you've also blocked via another means; it helps the transition process go much more quickly.
  • URLs you block via this tool should also be blocked via another means—robots.txt OR the meta robots tag. 

Remove URL Tool Pros

  • It's the quickest, most effective means of deindexing URLs and therefore resolving key indexing problems (in terms of the speed with which it removes the URL from Google’s index.)
  • It's private (to you, Google, and anyone with access to your GSC account), meaning you can block the indexation of sensitive information in this tool. No competitors will be able to tell what you've blocked with this tool.

Remove URL Tool Cons

  • If you don't know what you are doing, you can accidentally block critical web pages or sections of your website. Or the whole site… (yes, I've seen this happen.)
  • After 6 months you may need to re-up your URL block since it's, by definition, a temporary request. However, it's not that common pages return to the index without specific work to counteract that. 

Why do they return sometimes? 

  • Too many internal or external links point to that page (basically—there are strong signals that people care about this page.)
  • The page isn't also blocked via robots.txt or meta robots

Bottom line, the Remove URL Tool is a supplementary tool to control indexation, more quickly or more effectively. It shouldn't be your only tool. Here is your guide on permanently deindexing stuff from Google.

Common Issues and Mistakes when using these SEO tools together:

When correct quality assurance measures for SEO are not taken, it's easy to break stuff. On top of these, giving mixed messages to crawlers. Here are some of the most common mistakes we've seen.

Robots.txt and Robots Meta Tags

  • If you disallow a URL, bots can't read the robots meta tag in order to follow those instructions. 
  • This can result in pages that are indexed with no context.

Robots.txt and Canonical Tags

  • If you disallow a URL, bots can't read the canonical tag in order to follow those instructions. 
  • This means that any links that the page has acquired no longer pass SEO equity to the source material. 

Canonical Tags and Meta Robots Tags

  • Whenever possible, we don't recommend using the robots meta tag and the canonical tag on the same page, as they can send conflicting signals. 
  • Note that Google has given the SEO Community conflicting signals on this through the years; most recently they've indicated that it's okay to use them together.
  • We've had personal, painful lessons to the contrary 😔. Therefore, if you choose to mix these tools, we recommend proceeding with caution and monitoring the results closely for issues.

More Pro Tips:   

GSC Inspect URL Tool
Inspect any URL with GSC
GSC Request Indexing Tool
Request indexing of a URL in GSC
  • Don't include disallowed, noindexed, or non-canonical pages in your XML sitemap. Only indexable pages with 200 (okay) status codes should be included (/sitemap.xml.) 
  • If your indexing problem lies in getting existing or new content into Google search, that's a different problem.
  • If it's a new page, you should start with submitting the page in GSC via the Inspect URL tool. Make sure to keep a close eye on GSC errors and fix them as soon as possible.
  • If it's an existing page you've already submitted and Google doesn't deem it worthy of indexation (e.g. you find valid URLs in the "Submitted, not Indexed" Coverage report in GSC), you may have a site/page quality issue or site performance (AKA speed/rendering) issues. 
  • You can explore what Google crawls in detail via your log files, and in aggregate in GSC's Crawl Stats report.

Conclusion 

Don't be surprised when Disallowed pages get indexed—especially when those pages are linked somewhere else on the web, outside of your website. Instead, understand your toolset and resolve the problem the correct way, thus ensuring that you see the right pages in the SERPs.

And if crawling and indexing issues continue to disrupt your site’s SEO harmony, reach out to us!

Work With Us
We’ll help craft, teach, and carry out SEO roadmaps that check all the boxes.
CONNECT THE DOTS WITH US