Most SEOs don’t need to master the ins and outs of technical SEO. But if there’s one foundational topic that everyone in our discipline should understand, it’s crawling and indexing.
Why? URLs would never make their way into search engine results without them, so SEO wouldn’t be a thing. Plus, it’s one of the few instances where YOU get to tell Google what to do. (Don’t tell me that’s not enticing!)
Below, we’ll break down the difference between crawling vs. indexing, how each impacts SEO, and the tools at your disposal to control them.
Crawling and indexing are often part of the same conversation, but they’re different processes and Google handles them through different internal mechanisms.
Crawling is how search engines access and read the content of URLs on your domain — if they’re able to do so based on the rules you set.
Whereas indexing is specific to how and whether search engines surface URLs in results for users.
When we talk about crawling in the world of technical SEO, who - or more accurately, what - does the crawling is an important place to start.
Crawling is executed by a bot, script, or program. Each search engine uses its own distinct bot to crawl a site, known as a search engine crawler. In Google’s case, it’s Googlebot.
When crawling the site, it grabs important elements unique to each URL it can access, including:
We won’t go in-depth into the technical, backend specifics of how search engine crawlers work in this article. Instead, we’ll focus on how crawling impacts your domain’s SEO and what you can do to control it.
When a search-engine bot “crawls” a URL, it’s gathering crucial context to use during indexing and ranking (which, again, are separate processes).
To include a URL in the index (aka the library of URLs surfaced on search engine results pages), the search engine needs access to HTML elements of the page — some of which communicate whether or not to publicly index the page.
If the bot can’t access the HTML, the search engine can’t determine whether the page is available to index or how well it should rank based on the content. The URL won’t show up in search engine results pages in most cases, and if it does, it won’t rank well. (There are some exceptions, which we’ll speak to later in this article.)
For the pages you want to keep Google away from - like order confirmations - that’s good. But for the pages that you want search engine users to find? Not so much.
There are basically two ways for search engines to discover URLs on your domain.
Without at least one of these, Google can’t discover the URL. It can’t crawl what it doesn’t know about!
In classic SEO fashion, it depends. If a search engine crawler can access a page, it doesn’t mean it can automatically crawl the content. On many sites, the search engine has to unpack JavaScript to crawl some or even all of the important elements.
When crawling these types of sites, Google turns to a third - and, once again - separate process called rendering. It’s a bit out-of-scope for our conversation here since it gets into JavaScript SEO. The important thing to remember is that Javascript issues sometimes cause snafus with crawling and indexing — and when it happens, it often happens at scale.
Crawl budget is a term you might hear thrown around, but most sites aren’t big enough to worry about it. Really, only the behemoths of the web — we’re talking thousands upon thousands of URLs, at minimum — run into issues.
In the simplest sense, a site’s crawl budget is the amount of time and resources Google will spend crawling the domain in a given period. (We’ll let them speak to the specifics!)
Here’s the thing about crawlbots. They’re, well… bots. They’re made to do one thing. They don’t have any conception of what to crawl or not crawl. They need directions.
Without guardrails to keep crawlers focused on the right pages, they can aimlessly find their way into the nooks and crannies of your site. Think: subdomains no longer in use, endless variations of search URLs, parameter-based tracking URLs… the list goes on!
By using the tools at your disposal to direct crawlers toward the pages that matter, it focuses resources on pages meant for SEO. That’s a positive because crawlers don’t have to work harder than they need to — or run into as many potential status code errors.
So your site looks good. Plus, there are lots of pages that a crawler simply has no business in!
At the end of this article, we’ll talk about tools at your disposal to control whether or not crawlers can access URLs. For now, here are some basic ground rules when it comes to which URLs to make available or block from crawling.
The “index” is the complete library of URLs that a search engine catalogs as potential search results. When talking about SEO indexing, we’re talking about whether or not URLs are a part of that library — aka “indexed”.
Crawling usually precedes indexing because Google wants information as to whether it should index the page, as well as information about what that page is about.
Google uses a different process, powered by an engine named Caffeine, to facilitate indexing. We don’t need to go in-depth into that here. What’s more important is understanding how what you do as an SEO impacts whether or not search engines like Google can surface the pages of your site within its results.
Read it out loud for the people in the back… indexing isn’t a right. Yes, you need to send all of the proper signals, so search engines know whether they have permission to index URLs.
But there are two things that a search engine needs to determine for indexing:
Just because a search engine has permission to index something doesn’t mean that it will. Let’s get into some of the nuances.
We already gave away the answer on this one.
Indexable URLs should be crawlable AND discoverable too. If a page is indexable, but a search engine never finds it, then it’s not going into the index.
But even if Google finds a URL and crawls it, there’s a chance it might not index the page. Search engines need to make sure that they’re serving users helpful content to create a positive experience.
If the quality of the page or domain is low, Google might decide that the content won’t benefit users. Of course, it’s not always perfect, and low-quality content does make its way into the index.
If you make a page indexable, do it because it’s valuable to users in the first place.
You wouldn’t think so, since crawling generally precedes indexing. But there’s an important caveat.
Sometimes, search engines will assume that a page you don’t want crawled has value in the index, based on signals that are out of your control — primarily, backlinks. In this case, you’ll see a result like the one below.
If this happens, don’t worry. You have tools at your disposal to remove URLs from Google quickly.
First, there’s the obvious answer. For a search engine to recommend a URL, it has to be a part of that search engine’s index.
However, it’s not as simple as more URLs in the index inherently means better SEO performance. A lot of how search engines perceive the authority and relevance of a site has to do with the quality of its content.
When a large proportion of the pages that Google can index are relatively low-quality or duplicative, it can perceive that the domain - as a whole - is as well. An abundance of these pages leads to a thin content issue, which in turn drags EVERYTHING down and calls for content pruning.
Controlling what search engines should and shouldn’t index is one of the best ways to protect your domain. If Google can only index and recommend strong content from your site, it’s going to think the site as a whole is up to that standard.
The crawling and indexing tools at an SEO’s disposal are pretty powerful — and they are very much a toolset. Each has its ideal use cases, positives, and drawbacks. To be most effective, an SEO should apply each based on what it does best. (You can’t get a clean slice of bread with a butter knife!)
The robos.txt is a text file with instructions telling bots/spiders (like Googlebot) what they can and can’t access. It’s implemented at the domain and subdomain level, versus at the page level.
In SEO, it’s a tool to control crawling. It doesn’t directly affect indexing. But like we mentioned, it keeps pages from being indexed in most cases, because Googlebot can’t crawl blocked URLs to assess indexability.
Meta Robots is a snippet of code you can add to the <head> HTML or header response of a URL for page-level control of crawling AND indexing. That’s why it includes two directives in the full code snippet. Usually, both are included, but it’s okay to include one or the other.
The first lets search engines know whether or not to index the page (“index” or “noindex”). The second tells crawlers whether or not to follow the links they find and crawl those pages (“follow” or “nofollow”).
Without including this tag, the default assumed state is to both index and follow that page.
The four most utilized meta robots tag combinations - in order from most frequently used to least - are:
The canonical URL is a snippet of code that’s included in the <head> as well. Rather than providing a directive to search engines, it’s a suggestion.
The tag helps search engines understand the relationship between duplicate or near-duplicate pages - and which of the URLs is the original. But ultimately, the search engine can still crawl each of the pages and choose which to index — i.e. it can and will ignore the suggestion, especially when other signals send conflicting messages.
For example, if a site allows for filtering and faceting on category landing pages, it might have several variations of the same root URL with appended parameters. The canonical URL helps Google understand which is the source page.
The canonical tag is useful for delineating which page takes priority in other common cases too:
This GSC tool is only effective for removing URLs from the index temporarily (~6 months). It needs to be used in addition to other deindexing instructions to ensure URLs aren’t indexed again. So it’s a useful tool, but not a standalone tool.
Learn more about how we recommend using GSC’s Remove URL tool.
Using these tools together is inefficient at best. At worst, it sends conflicting signals to search engines. While it’s fine in some cases, we recommend avoiding it altogether.
Using the robots.txt with either of the other two tools renders them moot because crawlers can’t get to the tags in the <head>.
Using any meta robots on a canonicalized URL runs the risk of sending mixed messages. In fact, since meta robots is a directive and the canonical is a suggestion, they’re at odds by nature.
It’s worth noting that Google now says it’s okay to use the meta robots and canonical tags together, after years of sending conflicting signals. There are some rare scenarios, where it might actually make sense. We’ll let John Mueller speak to that…
“If external links, for example, are pointing at this page then having both of them there kind of helps us to figure out well, you don’t want this page indexed but you also specified another one… So maybe some of the signals we can just forward along.”
-John Mueller, Google Search Advocate
While these rules and tools apply across sites, the ideal setup for crawling and indexing is specific to your domain. After all, no two domains have the exact same content, business, or goals.
Now that you know the differences between crawling and indexing, how they factor into SEO, and the tools at your disposal to control them, you’re well-equipped to figure out what that means for your site.
As always, we’re here to help connect the dots.