Blog

FAQs & Common Issues for Crawling & Indexing Websites: Using Robots.txt, Robots Meta, and Canonical Tags Correctly & Effectively

February 17, 2021
by
Tory Gray
Tory Gray

One of the most frequent technical SEO issues we get from clients goes something like this: “Google is indexing pages I blocked in my Robots.txt file! What the heck?!”

To which we have to reply, “Technically, that’s working as expected. Crawling is an entirely different process. Disallowing crawling does not prohibit indexing. But here’s what you can do…”

For clarity, here we'll break down what these mechanisms are, what the core tools are to resolve the index and crawl errors you discover, and the common pitfalls to avoid when implementing them. 

What is the difference between Crawling and Indexing in SEO?

"Crawling": 

googlebot

Do you give search engines permission to access, render, and check the links you’ve referenced on the URL in question (crawling being defined as “a bot, script or program that visits a webpage and grabs content and links from it”)? In Google’s case, the crawling is generally called Googlebot, with some exceptions.

If a page is not crawled, it won’t be indexed and won’t be displayed in the search results. In other words: SEO Step 1.

Here's the (frequently used) list of web crawlers and their respective User Agents as of August 2020 (with an admittedly US-based perspective):

Google Search:
Googlebot (Desktop): 
Googlebot

Googlebot (Mobile/Smartphone): Googlebot (it's a mobile-first world, ya'll!)

Google Videos: Googlebot, Googlebot-Video 

Google Images: Googlebot, Googlebot-Image

Google News: Googlebot-News

Mobile Apps Android: Adsbot-Google-Mobile-Apps 

AdsBot: AdsBot-Google

AdsBot Mobile: AdsBot-Google-Mobile

Adsense: MediaPartners-Google

Bing Search:
Bingbot: Bingbot

DuckDuckGo Search:
DuckDuckBot: DuckDuckBot

Baidu Search:
Baiduspider: Baiduspider

Yandex Search:
Yandexbot: Yandexbot

"Indexing":  

Do you give search engines permission to include a website or specific URL in the search engine index (e.g. the websites shown in search results)? In Google’s case, the indexing engine is called Caffeine. If/when the search engine decides to include your webpage, it becomes “indexed.”

AKA: SEO Step 2.

Pro Tip: It's important to understand that a) crawling is not the same thing as indexing, and b) these processes are run by entirely different teams within Google. As such, controlling for one does not inherently control for the other.

With that in mind, here’s a guide to the most common tools & means of controlling search engine and crawling and indexing, so you can set the correct instructions for your use case - and avoid common indexing issues. They include: 

What about “Rendering”?

Rendering is, well, another thing altogether. To “render” a web page, your browser (e.g. Chrome, Safari, IE, Firefox, etc.) will process all the HTML, JavaScript and CSS (e.g. the code) in order to construct the layout and visible page you view when browsing a website (here’s a better, more in-depth explanation for those who may be interested.)

It’s a little out of scope for our conversation here, but the important thing to keep in mind is that a) it’s yet another important step, and b) rendering very much impacts indexation.

The more work it takes to render something, the less likely it is to rank (and/or the higher quality & value it has to be/provide in order to qualify to rank, and rank well. The popularity of your brand, and the page in question (read as: lots of backlinks!) matters a LOT more in this situation.) Put simply, the harder you make Google work to rank your website, the less likely they are to do it.

That’s literally the antithesis of technical SEO, in which we do work to make it easier for search engines to crawl your website. This is why SPA SEO, JavaScript SEO, and Rendering SEO are a thing. It’s just harder to rank those sites despite their relative popularity.

What are the SEO tags or tools are available to control crawling and indexing?

 “If you are looking for a “robots.txt noindex” option, that no longer exists. It was never an “approved” method, but it was highly effective. Unfortunately, it’s been officially retired.”    

Learn more from Google’s note on retiring unsupported rules in robots.txt.

Robots.txt File

robots.txt example

Definition: Robots.txt is a text file that sets instructions for bots/spiders (like Googlebot) crawling a specific website, generally to set boundaries around what is and is not permissible to crawl. Think of it as the set of instructions for how search engines should use your website, and what they can and can NOT access.

Basic Rules for the Robots.txt file:

  • It MUST exist at /robots.txt (off the root domain. Also - it's case sensitive. For The Gray Dot Company, that's https://thegray.company/robots.txt) 
  • If you don't have a robots file at this exact URL path, it will be considered a 404 page by definition. 
  • Bots can *choose* to follow these instructions... or not. Malicious bots don't care, whereas Googlebot, Bingbot will generally respect your instructions. 

Robots.txt Pros:

  • Site-level and directory (or folder)-level control of crawling
  • It's a great crawl budget lever (via blocking non-relevant, duplicate content or low-quality content)
  • The simple pattern-matching syntax is permissible & pretty user friendly
  • It's easy to edit in most cases

Robots.txt Cons:

  • It can only control crawling.... NOT indexation
  • Page-level control was never intended, and if/when you try to do this via the robots file, it gets messy pretty quickly
  • It's publically accessible, and therefore should not be used to hide sensitive or confidential information
  • It's relatively easy to accidentally block crawling sections of your site that you did not intend to block

Robots.txt Pro Tips: 

Meta Robots Tag

meta-robots-example

Definition: Meta Robots is a snippet of code you can add to the <head> section of your websites HTML that gives you page-level control of bot/spider crawling AND indexing. Meta robots is not a required tag. If/when not set, the default value is "index, follow." In other words, everything is accessible until you say otherwise, using one of these tools. This is sometimes referred to as a "noindex tag."

  • Be careful with implementing the meta robots tag via JavaScript - Googlebot may or may not see it or respect it.

Meta Robots Pros:

  • Page-level control means granular control
  • It's a highly effective means of controlling page indexation
  • It’s considered a “directive”, meaning that it’s very likely search engines will comply with your instructions. 

Meta Robots Cons:

  • This tag can't be read or found if it's blocked via robots.txt. In other words, if a URL is blocked via robots.txt, Googlebot won't even crawl the page to see what the meta robots tag is set to. This isn't a "con" so much as it's important to be aware of. 
  • It's inefficient at controlling crawl budget at scale; Googlebot will have to check each page first to see if they can crawl/index it - at which time it's potentially already wasted time. 
  • Note that this doesn't matter for small or even most medium-sized sites. Crawl budget concerns are pretty much only a large and very large site issue (e.g. many thousands of pages+) 
  • Again, much like the robots.txt tag, bots don't have to follow your instructions. "Good citizen" bots - like these from most major search engines - will respect it, however.

While other instructions are available, the most common ones control a) indexation (index and noindex) of the page itself, and b) crawling to pages linked on the origin page (follow or nofollow.) 

They can -and are - usually combined, though that's not required. These are in rough order of frequency of use:

  • “noindex, nofollow” - Generally most common use case for why you'd add robots meta tag to your site in the first place (that is, "index,follow" is more common, BUT this is the default that the robots meta tag allows you to override). It's commonly used when you don't want search engines to pay attention to the page at all. 
  • “index, follow” - the page in question is good and valid, and therefore should be included in search engine results. Pages you link to from this page should be crawled as well. 
  • This fits the use case for most pages on the internet - any normal publicly accessible & valid page.
  • “noindex, follow” - when you want the page itself not included in Google's index (or other search engine's indexes,) but you want, or don't care, if bots crawl & pass page rank to the pages linked from the page this command is set on. (NOTE that over time, Google will ignore your "follow" command due to the noindex instruction being so strong. In other words, a noindex, follow command will eventually be read as a noindex, nofollow command.)
    “index, nofollow”
    - the page in question is good and valid, but you don't want any of the links on the page to be crawled. Perhaps a good use case for sponsored content. In any case, this is pretty uncommon.

           

canonical-tag-example

         

Canonical Tag

Definition: the canonical instruction is a snippet of code included in the <head> section of a website in which you can give page-level insight to search engines about original, near-duplicate and duplicate content - and which page are the source of that content. There are other methods of implementing this, but they are MUCH less common.

  • The original source of content should self-canonicalize (refer to itself as the original source of content); a duplicate or near-duplicate page should generally point to the other, original content page as the canonical source.
  • Important pages don't technically have to self-canonicalize, but it's still a good idea - especially if you have parameter variations of those pages. 

NOTE that canonicalized pages (e.g. pages that point to other pages and not themselves) are both crawlable and indexable.

  • The tag is a "hint" about how you want Google and other search engines to handle that tag, not a "directive" about what they must do. In practice, that means that some canonicalized pages are indexed, and some aren't. The search engine gets to make this decision. 
  • In most cases, using the canonical tag passes the SEO equity that a page has acquired and passes it, instead, to the page referenced in the tag (e.g. itself, or another page) giving it a stronger ability to rank. 

Common Use Cases for Utilizing Canonical Tags: 

  • Syndicated content that's published on multiple websites
  • Sites with technical issues or parameter variations
  • Example 1: the difference between /Robots.txt and /robots.txt (yes, URLs are in fact case sensitive!)
  • Example 2: the difference between /products and /products?category=1 (some parameter variations of URLs are valid & different pages, but many are not. The canonical tag can help you send the right signals about which is which.)
  • Example 3: Which URL do you want ranking? /products?category=tables, or /products/tables. eCommerce facets and filters can be properly handled for - in many cases - by canonicals. 
  • Sites with structural issues, or similar but slightly different target audiences: Sometimes it's desirable to repeat the same page in a different section of the site - perhaps for a different audience with small copy tweaks to speak to the audience in question.
  • Brands utilizing a subdomain strategy to promote certain products as stand-alone products; they often have a page on the core site for that product in addition to a microsite. 

Canonical Implementation Rules: 

  • Full path URL references (including the http:// part) are required 
  • Don't canonicalize to pages that aren't valid, indexable pages themselves (this creates a canonical loop)
  • Don't canonicalize paginated pages to page 1. 
  • Pagination is valid and non-duplicative by nature (Page 2 is not Page 1!); if you want search engines crawling to pages linked via pagination (e.g. the products listed on page 2) this canonical error will result in less crawling and reduced SEO equity getting passed to those pages.
  • Be careful with implementing the canonical tag via JavaScript - Googlebot may or may not see it or respect it. 

Canonical Pros

  • For the most part, it does what it's meant to do - give credit to the original content 

Canonical Cons

  • The tag can't be read/found if it's blocked via robots.txt (same problem as the meta robots - if you disallow the page via Robots.txt, Googlebot won't crawl the page to even see the canonical tag.)
  • It's relatively easy to make mistakes
  • SEO QA is a critical step to ensure you've set this up correctly. Common mistakes with canonical tags are outlined below.
gsc-remove-url-tool

Google Search Console (GSC) Remove URL Tool

Definition: The Remove URL tool is available in GSC; it= allows for explicit site-level, directory-level AND page-level control of indexation... temporarily. 

  • We generally recommend using this tool for *quickly* deindexing content that you've also blocked via another means; it helps the transition process go much more quickly.
  • URLs you block via this tool should *also* to be blocked via another means - robots.txt OR the meta robots tag. 

Remove URL Tool Pros

  • It's the quickest, most effective means of deindexing URLs and therefore resolving key  indexing problems (in terms of speed it happens at)
  • It's private (to you and Google, anyone with access to your GSC account), meaning you can block indexation of sensitive information in this tool. No competitors will be able to tell what you've blocked with this tool.

Remove URL Tool Cons

  • If you don't know what you are doing, you can accidentally block critical web pages or sections of your website. Or the whole site… (yes, I've seen this happen.)
  • After 6 months you may need to re-up your URL block since it's, by definition, a temporary request. However, it's not that common pages return to the index without specific work to counteract that. 

Why do they return sometimes? 

  • Too many internal or external links point to that page (basically - there are strong signals that people care about this page.)
  • The page isn't also blocked via robots.txt or meta robots

Bottom line, the Remove URL Tool is a supplementary tool to control indexation, more quickly or more effectively. It shouldn't be your only tool.

Common Issues & Mistakes when these SEO tools are used together: 

Robots.txt and Robots Meta Tags

  • If you disallow a URL, bots can't read the robots meta tag in order to follow those instructions. 
  • This can result in pages that are indexed with no context

Robots.txt and Canonical Tags

  • If you disallow a URL, bots can't read the canonical tag in order to follow those instructions. 
  • This means that any links that the page has acquired no longer pass SEO equity to the source material. 

Canonical Tags and Meta Robots Tags

  • Whenever possible, DON'T use the robots meta tag and the canonical tag on the same page, as they can send conflicting signals. 
  • That said, "index, follow" in meta robots and a self-canonical tag aren't the worrisome use cases. Here are the situations you want to avoid: 
  • Noindex a page that self-canonicalizes. If it's a valid, original page why are you noindexing it? It's a conflicting signal that will confuse search engines about what action you want to happen. This may cause them to ignore the instructions in both tags. No telling what they'll do in this scenario. 
  • Noindex a page that canonicalizes elsewhere. In this case, generally, your canonical tag instruction is invalidated and can no longer pass credit to the original source of content.
  • We recommend you pick the right tag and use just that one. Pick the right tool for the job!

More Pro Tips:   

gsc-inspect-url
gsc-request-indexing
  • Don't include disallowed, noindexed, or non-canonical pages in your XML sitemap. Only 200 (okay) status code, indexable pages should be included (/sitemap.xml.) 
  • If your indexing problem lies in getting existing or new content into Google search, that's a different problem.
  • If it's a new page, you should start with submitting the page in GSC via the Inspect URL tool. 
  • If it's an existing page you've already submitted and Google doesn't deem it worthy of indexation (e.g. you find valid URLs in the "Submitted, not Indexed" Coverage report in GSC), you may have a site/page quality issue or site performance (aka: speed/rendering) issues. 
  • You can explore what Google crawls in detail via your log files, and in aggregate in GSC's Crawl Stats report.

Conclusion 

Don't be surprised when Disallowed pages get indexed - especially when those pages are linked somewhere else on the web, outside of your website. Instead, understand your toolset and resolve the problem the correct way, thus ensuring that you see the right pages in the SERPs.

Work With Us
We’ll help teach, mastermind, and carry out SEO roadmaps that check all the boxes.
CONNECT THE DOTS WITH US