Why 100% indexing isn't possible, and why that's okay

While 100% indexing may be “technically” possible, it most likely won’t be possible in reality. When it comes to things like budget crawling, the historical discourse has always been that it’s a problem for large websites (ranked by Google as having 1 million web pages) and medium-sized websites with high content and change frequency. However, in recent months, crawling and indexing have become popular topics in the SEO forums and FAQs for Google on Twitter.

A number of the major changes in coverage that I’ve seen have also been linked to unverified Google updates and extreme volatility in SERP sensors / renderers. Since none of the websites have much in common in terms of stacking, placement, or even technical issues – does this mean that 100% indexing (for most websites) is not currently possible, and that’s okay?

The web is expanding at a speed far beyond its capabilities and tools

Google states, in its documentation, that the web is expanding at a rate far beyond its ability and tools to crawl (and index) every URL. Get the daily research newsletter that marketers depend on. In the same documentation, Google outlines a number of factors that affect crawl ability as well as crawl request, including:

Popularity of URLs (and your content).
degree of modernity
How fast is the site response?
Google’s knowledge (perceived inventory) of the URLs on our site.

From conversations with Google’s John Mueller on Twitter, the popularity of your URL isn’t necessarily affected by the popularity of your brand name and/or domain.

Having direct experience of a major publisher that does not index content based on its uniqueness to similar content already published online – as if it is below the quality threshold and a high enough SERP inclusion value it is not.

That’s why when working with all websites of a certain size or type (such as e-commerce), I’ve pointed out from day one that 100% indexing isn’t always a measure of success.

Surface indexing and layers

Google has been quite open about explaining how the indexing works. They use surface indexing (some of the content is on better servers for faster access) and they use Service Definition File stored in a number of data centers which mainly store the data submitted in the SERP. Oversimplifying this:

The contents of the web page document (the HTML document) are then encoded and stored between the snippets, and the snippets themselves (such as a glossary) are indexed so that they can be searched more quickly and easily for specific keywords (when a user searches). A lot of times, indexing issues are blamed on technical SEO, and if you have noindex or issues and inconsistencies that prevent Google from indexing your content, then it’s technical, but more often than not – it’s a value issue.

The useful purpose and value of the SERP listing

When I talk about value proposition, I’m referring to two concepts from Google’s Quality Rating Guidelines (QRGs):

Useful purpose
Page quality
And taken together, those create what I refer to as the SERP Include Value.

This is why web pages are listed in the “Currently not indexed” category in the Google Search Console coverage report.

In QRGs, Google states:

Remember that if a page has no useful purpose, it should always rank under the lowest quality page, regardless of the page’s needs rating or how well the page is designed. what does that mean? A page can target the right keywords and check the right boxes. But if it is generally duplicated of other content and lacks additional value, Google may not index it. This is where we run into Google’s quality limit, which is the concept of whether or not a page has the “quality” necessary to index it.

An important part of how this quality threshold works is that it approximates real-time and is seamless. Gary Ellis of Google confirmed this on Twitter, that a URL can be initially indexed and then removed when new (better) URLs are found, or even temporarily promoted to “fresh” from manual posting on GSC. The first thing to determine is whether you see the number of pages in the Google Search Console coverage report going from being included to being removed.

Define your collective data

This scheme alone and out of context is enough to make most marketing stakeholders anxious. But how many pages are you interested in? How many pages is this drive worth? You can determine this from your collected data. You will see if your traffic and revenue/leads are going down in your analytics platform, and you will notice in the third party tools that you are losing visibility and ranking of the market in general. Once you’ve determined if you’re seeing valuable pages being deleted by Google, the next step is to find out why and break your search console into further categories.

The main things you need to be aware of and understand are:

Crawled – not currently indexed

This is something I encounter more in e-commerce and real estate than in any other sector. In 2021, the number of new business app sign-ups in the US will shatter previous records, and as companies compete more and more for users, there will be plenty of new content — but probably not as much new, unique information or insight.

Discovered – not currently indexed

When debugging indexing, I see this issue a lot on e-commerce sites or websites that have taken a programmatically critical approach to content creation and published many pages at once. The main reasons why pages are in this crawl category could be because of budget cuts, because you just published a great deal of content and new URLs, which greatly increased the number of crawlable and indexable pages on your site. Help and the creeping budget that Google provides does. You have determined that your site is not suitable for this number of pages.

There is not much you can do to influence this. However, you can help Google move PageRank from important (indexed) pages to these new pages with XML sitemaps, HTML sitemaps, and good internal links. The second reason content falls into this category is its quality – this is common on programmatic content or e-commerce sites with a large number of products and PDPs where the products are the same or variable.

These URLs are identified by Google

Google can detect patterns in URLs, and if it hits a percentage of those pages and doesn’t find any value, it can (and sometimes assumes) that HTML documents with similar URLs will be of the same (low) quality. And choose not to crawl them. Many of these pages are intentionally created with the intention of attracting customers, such as planned site pages or comparison pages aimed at specific users, but these queries are searched with low frequency, may be given a lot of attention, and may not contain.

It’s unique enough compared to other programmed pages that Google doesn’t list suggested low-value content when other options are available. In this case, you need to evaluate and determine whether the objectives can be achieved within the project’s resources and parameters, without having too many pages that prevent crawling and are not seen as valuable.

Duplicate content

Duplicate content is one of the easiest and most popular content in e-commerce, publishing, and programming. If the original content of the page, which has a value proposition, is duplicated on other websites or internal pages, Google will not monetize that resource to index the content. This also relates to value proposition and the concept of utilitarian purpose. I’ve come across many examples where great, reputable websites don’t have content indexed because like any other content out there – it doesn’t offer unique perspectives or a unique value proposition.

Act on impulse

For most large websites and decent sized sites, it becomes difficult to achieve 100% indexing because Google has to process all the new and existing content on the web. What should you do if you find that valuable content is below the quality threshold?

Improve internal linking from “high value” pages: This doesn’t necessarily mean the pages with the most backlinks, but the pages that rank for the most keywords. Trim low-quality and low-value content. If the pages to be trimmed are of low value and have no value (for example, pageviews and conversions), they should be trimmed. Keeping them alive wastes Google’s crawling resources when crawling them, and can affect their quality assumptions based on URL pattern matching and perceived repository.

Why 100% indexing isn’t possible, and why that’s okay

The web is expanding at a speed far beyond its capabilities and tools

Surface indexing and layers

The useful purpose and value of the SERP listing

In QRGs, Google states: