What is the difference between Crawling, Indexing, and Rendering in SEO?
It's important to understand that a) crawling is not the same thing as indexing, and b) these processes are run by entirely different teams within Google. As such, controlling one does not inherently control the other.
Crawling is defined as a bot, script, or program that visits a webpage and grabs (or "crawls") content and links from it. In other words, crawling is the fundamental first step for search engines as they explore your website.
When it comes to crawling and SEO, the major question at hand is: do you give search engines permission to access the site or a specific page—and to the content they find there?
In Google’s case, the crawler itself is generally referred to as “Googlebot”. There’s much to learn about how Googlebot prioritizes what pages to crawl and what pages to skip (read 5 things you didn’t know about Googlebot for more of a deep dive.)
Simply put, if a page is not crawled, it won’t be indexed and won’t be displayed in the search results.
Here's the (frequently used) list of web crawlers and their respective User Agents as of August 2020 (with an admittedly US-based perspective):
Google (Desktop): Googlebot
Google (Mobile/Smartphone): Googlebot (it's a mobile-first world, ya'll!)
Google Videos: Googlebot, Googlebot-Video
Google Images:Googlebot, Googlebot-Image
Google News: Googlebot-News
Mobile Apps Android: Adsbot-Google-Mobile-Apps
AdsBot Mobile: AdsBot-Google-Mobile
DuckDuckGo's bot: DuckDuckBot
Baidu's bot: Baiduspider
Yandex's bot: Yandexbot
Yahoo's bot: Slurp
Do you give search engines permission to include a website, or specific URL on that site, in their search engine index (e.g. the websites shown they can show in search results)?
In Google’s case, the indexing engine is called Caffeine. If/when the search engine decides to include your webpage in their search results, it becomes “indexed.”
Indexing typically follows crawling, though you should note that crawling (of your site/page) isn’t technically required. Google and other search engines may opt to include a page/site in their index after crawling links to your site/page, either from your site or from external websites. These links can make your pages appear valuable to Google, and therefore worthy of indexation.
It’s a little out of scope for our conversation here, but the important thing to keep in mind is that a) it’s yet another important SEO step, and b) rendering can and does impact indexation.
The more "work" it takes to render something (time, budget and resources for search engines), the less likely it is to rank (and/or the higher quality & value it has to be/provide in order to qualify to rank, and rank well. The popularity of your brand, and the page in question (read as: lots of backlinks!), matters a LOT more in this situation.) Put simply, the harder you make Google work to rank your website, the less likely they are to do it.
With that in mind, here’s a guide to the most common tools & means of controlling search engine crawling and indexing, so you can set the correct instructions for your use case—and avoid common indexing issues.
What SEO tags and tools are available to control crawling and indexing?
“If you are looking for a “robots.txt noindex” option, that no longer exists. It was never an “approved” method, but it was highly effective. Unfortunately, it’s been officially retired.”
Definition: Robots.txt is a text file that sets crawling instructions for bots/spiders (like Googlebot) for a specific website, generally to set boundaries around what is and is not permissible to crawl. Think of it as the set of instructions for how search engines should use your website, and what they can and can NOT access.
Basic Rules for the Robots.txt file:
It MUST exist at /robots.txt (off the root domain.)
It's case sensitive. For The Gray Dot Company, that's https://thegray.company/robots.txt. /Robots.txt (undercase) is not a valid location.
If you don't have a robots file at this exact URL path, it will be considered a 404 page by definition.
Bots can *choose* to follow these instructions... or not. Malicious bots don't care, whereas Googlebot and Bingbot will generally respect your instructions.
Site-level and directory (or folder)-level control of crawling
The simple pattern-matching syntax is permissible & pretty user friendly
It's easy to edit in most cases
It can only control crawling... NOT indexation
Page-level control was never intended, and if/when you try to do this via the robots.txt file, it gets messy pretty quickly
It's publicly accessible, and therefore should not be used to hide sensitive or confidential information
It's relatively easy to accidentally block crawling sections of your site that you did not intend to block
Robots.txt Pro Tips:
Google provides a tool for testing your robots.txt file to ensure it's working as expected. You can and should test pages that should be accessible and SHOULD NOT be accessible to confirm you set it up correctly.
Definition: Meta Robots is a snippet of code you can add to the <head> section of your websites HTML that gives you page-level control of bot/spider crawling AND indexing. Meta robots is not a required tag. If/when not set, the default value is "index, follow." In other words, everything is accessible until you say otherwise, using one of these tools. This is sometimes referred to as a "noindex tag."
Meta Robots Pros:
Page-level control means granular control on an individual URL basis
It's a highly effective means of controlling page indexation
It’s considered a “directive”, meaning that it’s very likely search engines will comply with your instructions.
Meta Robots Cons:
This tag can't be read or found if it's blocked via robots.txt. In other words, if a URL is blocked via robots.txt, Googlebot won't even crawl the page to see what the meta robots tag is set to. This isn't a "con" so much as it's important to be aware of.
It's inefficient at controlling crawl budget at scale; Googlebot will have to check each page first to see if they can crawl/index it—at which time it's potentially already wasted time.
Note that this doesn't matter for small or even most medium-sized sites. Crawl budget concerns are pretty much only a large and very large site issue (e.g. many thousands of pages+)
Much like the robots.txt file, bots don't have to follow your instructions. "Good citizen" bots—like those from most major search engines—will respect it, however.
While other instructions are available, the most common ones control a) indexation (index and noindex) of the page itself, and b) crawling to pages linked on the origin page (follow or nofollow.)
They can—and are—usually combined, though that's not required. These are in rough order of frequency of use:
“noindex, nofollow”: Generally most common use case for why you'd add robots meta tag to your site in the first place (that is, "index,follow" is more common, BUT this is the default that the robots meta tag allows you to override). It's commonly used when you don't want search engines to pay attention to the page at all.
“index, follow”: the page in question is good and valid, and therefore should be included in search engine results. Pages you link to from this page should be crawled as well. This fits the use case for most pages on the internet—any normal publicly accessible & valid page.
“noindex, follow": when you want the page itself not included in Google's index (or other search engine's indexes,) but you want, or don't care, if bots crawl & pass page rank to the pages linked from the page this command is set on. (NOTE that over time, Google will ignore your "follow" command due to the noindex instruction being so strong. In other words, a noindex, follow command will eventually be read as a noindex, nofollow command.) “index, nofollow”: the page in question is good and valid, but you don't want any of the links on the page to be crawled. Perhaps a good use case for sponsored content. In any case, this is pretty uncommon.
Definition: the canonical instruction is a snippet of code included in the <head> section of a website in which you can give page-level insight to search engines about original, near-duplicate, and duplicate content—and which pages are the source of that content. There are other methods of implementing this, but they are MUCH less common.
The original source of the content should self-canonicalize (refer to itself as the original source of content); a duplicate or near-duplicate page should generally point to the other, original content page as the canonical source.
Important pages don't technically have to self-canonicalize, but it's still a good idea—especially if you have parameter variations of those pages.
NOTE that canonicalized pages (e.g. pages that point to other pages and not themselves) are both crawlable and indexable.
The tag is a "hint" about how you want Google and other search engines to handle that tag, not a "directive" about what they must do. In practice, that means that some canonicalized pages are indexed, and some aren't. The search engine gets to make this decision.
In most cases, using the canonical tag passes the SEO equity that a page has acquired and passes it, instead, to the page referenced in the tag (e.g. itself, or another page) giving it a stronger ability to rank.
Common Use Cases for Utilizing Canonical Tags:
Syndicated content that's published on multiple websites
Sites with technical issues or parameter variations
Example 1: the difference between /Robots.txt and /robots.txt (yes, URLs are in fact case sensitive!)
Example 3: Which URL do you want ranking? /products?category=tables, or /products/tables. eCommerce facets and filters can be properly handled for—in many cases—by canonicals.
Sites with structural issues, or similar but slightly different target audiences: Sometimes it's desirable to repeat the same page in a different section of the site—perhaps for a different audience with small copy tweaks to speak to the audience in question.
Brands that utilize a subdomain strategy to promote certain products as stand-alone products; they often have a page on the core site for that product in addition to a microsite.
Canonical Implementation Rules:
Full path URL references (including the http:// part) are required
Don't canonicalize to pages that aren't valid, indexable pages themselves (this creates a canonical loop)
Don't canonicalize paginated pages to page 1.
Pagination is valid and non-duplicative by nature (Page 2 is not Page 1!); if you want search engines crawling to pages linked via pagination (e.g. the products listed on page 2) this canonical error will result in less crawling and reduced SEO equity getting passed to those pages.
For the most part, it does what it's meant to do - give credit to the original content
Canonical tags help consolidate link equity from duplicate or overlapping pages
They’re relatively simple to implement with popular CMS plugins, like Yoast
The tag can't be read/found if it's blocked via robots.txt; it’s the same problem as the meta robots - if you disallow the page via Robots.txt, Googlebot won't crawl the page to even see the canonical tag.
It's relatively easy to make mistakes
SEO QA is a critical step to ensure you've set this up correctly. Common mistakes with canonical tags are outlined below.
Google Search Console (GSC) Remove URL Tool
Definition: The Remove URL tool is available in GSC; it allows for explicit site-level, directory-level, AND page-level control of indexation... temporarily.
We generally recommend using this tool for *quickly* deindexing content that you've also blocked via another means; it helps the transition process go much more quickly.
URLs you block via this tool should also be blocked via another means—robots.txt OR the meta robots tag.
Remove URL Tool Pros
It's the quickest, most effective means of deindexing URLs and therefore resolving key indexing problems (in terms of the speed with which it removes the URL from Google’s index.)
It's private (to you, Google, and anyone with access to your GSC account), meaning you can block the indexation of sensitive information in this tool. No competitors will be able to tell what you've blocked with this tool.
Remove URL Tool Cons
If you don't know what you are doing, you can accidentally block critical web pages or sections of your website. Or the whole site… (yes, I've seen this happen.)
After 6 months you may need to re-up your URL block since it's, by definition, a temporary request. However, it's not that common pages return to the index without specific work to counteract that.
Why do they return sometimes?
Too many internal or external links point to that page (basically—there are strong signals that people care about this page.)
The page isn't also blocked via robots.txt or meta robots
Bottom line, the Remove URL Tool is a supplementary tool to control indexation, more quickly or more effectively. It shouldn't be your only tool. Here is your guide on permanently deindexing stuff from Google.
Common Issues and Mistakes when using these SEO tools together:
We've had personal, painful lessons to the contrary 😔. Therefore, if you choose to mix these tools, we recommend proceeding with caution and monitoring the results closely for issues.
More Pro Tips:
Don't include disallowed, noindexed, or non-canonical pages in your XML sitemap. Only indexable pages with 200 (okay) status codes should be included (/sitemap.xml.)
If your indexing problem lies in getting existing or new content into Google search, that's a different problem.
If it's a new page, you should start with submitting the page in GSC via the Inspect URL tool. Make sure to keep a close eye on GSC errors and fix them as soon as possible.
If it's an existing page you've already submitted and Google doesn't deem it worthy of indexation (e.g. you find valid URLs in the "Submitted, not Indexed" Coverage report in GSC), you may have a site/page quality issue or site performance (AKA speed/rendering) issues.
You can explore what Google crawls in detail via your log files, and in aggregate in GSC's Crawl Stats report.
Don't be surprised when Disallowed pages get indexed—especially when those pages are linked somewhere else on the web, outside of your website. Instead, understand your toolset and resolve the problem the correct way, thus ensuring that you see the right pages in the SERPs.
And if crawling and indexing issues continue to disrupt your site’s SEO harmony, reach out to us!
Work With Us
We’ll help craft, teach, and carry out SEO roadmaps that check all the boxes.