Posted by rjonesx.
It’s all wrong
It always was. Most of us knew it. But with limited resources, we just couldn’t really compare the quality, size, and speed of link indexes very well. FranklyA, most backlink index comparisons would barely pass for a high school science fair project, much less a rigorous peer review.
My most earnest attempt at determining the quality of a link index was back in 2015, before I joined Moz as Principal Search Scientist. But I knew at the time that I was missing a huge key to any study of this sort that hopes to call itself scientific, authoritative or, frankly, true: a random, uniform sample of the web.
But let me start with a quick request. Please take the time to read this through. If you can’t today, schedule some time later. Your businesses depend on the data you bring in, and this article will allow you to stop taking data quality on faith alone. If you have questions with some technical aspects, I will respond in the comments, or you can reach me on twitter at @rjonesx. I desperately want our industry to finally get this right and to hold ourselves as data providers to rigorous quality standards.
Home Getting it right What’s the big deal with random? Why not Common Crawl? How to get random The starting point: Getting seed URLs Selecting based on size of domain Selecting pseudo-random starting points Crawl, crawl, crawl Now what? Defining metrics Size metrics Speed metrics Quality metrics Reality vs. theory Caveats The metrics dashboard Size matters Index Has URL Index Has Domain Highest Backlinks Per URL Highest Root Linking Domains Per URL Highest Backlinks Per Domain Highest Root Linking Domains Per Domain Speed FastCrawl Quality URL Index Status Domain Index Status The Link Index Olympics What’s next? About PA and DA Quick takeaways
Getting it right
One of the greatest things Moz offers is a leadership team that has given me the freedom to do what it takes to “get things right.” I first encountered this when Moz agreed to spend an enormous amount of money on clickstream data so we could make our keyword tool search volume better (a huge, multi-year financial risk with the hope of improving literally one metric in our industry). Two years later, Ahrefs and SEMRush now use the same methodology because it’s just the right way to do it.
About 6 months into this multi-year project to replace our link index with the huge Link Explorer, I was tasked with the open-ended question of “how do we know if our link index is good?” I had been thinking about this question ever since that article published in 2015 and I knew I wasn’t going to go forward with anything other than a system that begins with a truly “random sample of the web.” Once again, Moz asked me to do what it takes to “get this right,” and they let me run with it.
What’s the big deal with random?
It’s really hard to over-state how important a good random sample is. Let me diverge for a second. Let’s say you look at a survey that says 90% of Americans believe that the Earth is flat. That would be a terrifying statistic. But later you find out the survey was taken at a Flat-Earther convention and the 10% who disagreed were employees of the convention center. This would make total sense. The problem is the sample of people surveyed wasn’t of random Americans — instead, it was biased because it was taken at a Flat-Earther convention.
Now, imagine the same thing for the web. Let’s say an agency wants to run a test to determine which link index is better, so they look at a few hundred sites for comparison. Where did they get the sites? Past clients? Then they are probably biased towards SEO-friendly sites and not reflective of the web as a whole. Clickstream data? Then they would be biased towards popular sites and pages — once again, not reflective of the web as a whole!
Starting with a bad sample guarantees bad results.
It gets even worse, though. Indexes like Moz report our total statistics (number of links or number of domains in our index). However, this can be terribly misleading. Imagine a restaurant which claimed to have the largest wine selection in the world with over 1,000,000 bottles. They could make that claim, but it wouldn’t be useful if they actually had 1,000,000 of the same type, or only Cabernet, or half-bottles. It’s easy to mislead when you just throw out big numbers. Instead, it would be much better to have a random selection of wines from the world and measure if that restaurant has it in stock, and how many. Only then would you have a good measure of their inventory. The same is true for measuring link indexes — this is the theory behind my methodology.
Unfortunately, it turns out getting a random sample of the web is really hard. The first intuition most of us at Moz had was to just take a random sample of the URLs in our own index. Of course we couldn’t — that would bias the sample towards our own index, so we scrapped that idea. The next thought was: “We know all these URLs from the SERPs we collect — perhaps we could use those.” But we knew they’d be biased towards higher-quality pages. Most URLs don’t rank for anything — scratch that idea. It was time to take a deeper look.
I fired up Google Scholar to see if any other organizations had attempted this process and found literally one paper, which Google produced back in June of 2000, called “On Near-Uniform URL Sampling.” I hastily whipped out my credit card to buy the paper after reading just the first sentence of the abstract: “We consider the problem of sampling URLs uniformly at random from the Web.” This was exactly what I needed.
Why not Common Crawl?
Many of the more technical SEOs reading this might ask why we didn’t simply select random URLs from a third-party index of the web like the fantastic Common Crawl data set. There were several reasons why we considered, but chose to pass, on this methodology (despite it being far easier to implement).
We can’t be certain of Common Crawl’s long-term availability. Top million lists (which we used as part of the seeding process) are available from multiple sources, which means if Quantcast goes away we can use other providers. We have contributed crawl sets in the past to Common Crawl and want to be certain there is no implicit or explicit bias in favor of Moz’s index, no matter how marginal. The Common Crawl data set is quite large and would be harder to work with for many who are attempting to create their own random lists of URLs. We wanted our process to be reproducible.
How to get a random sample of the web
The process of getting to a “random sample of the web” is fairly tedious, but the general gist of it is this. First, we start with a well-understood biased set of URLs. We then attempt to remove or balance this bias out, making the best pseudo-random URL list we can. Finally, we use a random crawl of the web starting with those pseudo-random URLs to produce a final list of URLs that approach truly random. Here are the complete details.
The first big problem with getting a random sample of the web is that there is no true random starting point. Think about it. Unlike a bag of marbles where you could just reach in and blindly grab one at random, if you don’t already know about a URL, you can’t pick it at random. You could try to just brute-force create random URLs by shoving letters and slashes after each other, but we know language doesn’t work that way, so the URLs would be very different from what we tend to find on the web. Unfortunately, everyone is forced to start with some pseudo-random process.
We had to make a choice. It was a tough one. Do we start with a known strong bias that doesn’t favor Moz, or do we start with a known weaker bias that does? We could use a random selection from our own index for the starting point of this process, which would be pseudo-random but could potentially favor Moz, or we could start with a smaller, public index like the Quantcast Top Million which would be strongly biased towards good sites.
We decided to go with the latter as the starting point because Quantcast data is:
Reproducible. We weren’t going to make “random URL selection” part of the Moz API, so we needed something others in the industry could start with as well. Quantcast Top Million is free to everyone. Not biased towards Moz: We would prefer to err on the side of caution, even if it meant more work removing bias. Well-known bias: The bias inherent in the Quantcast Top 1,000,000 was easily understood — these are important sites and we need to remove that bias. Quantcast bias is natural: Any link graph itself already shares some of the Quantcast bias (powerful sites are more likely to be well-linked)
With that in mind, we randomly selected 10,000 domains from the Quantcast Top Million and began the process of removing bias.
Since we knew the Quantcast Top Million was ranked by traffic and we wanted to mitigate against that bias, we introduced a new bias based on the size of the site. For each of the 10,000 sites, we identified the number of pages on the site according to Google using the “site:” command and also grabbed the top 100 pages from the domain. Now we could balance the “importance bias” against a “size bias,” which is more reflective of the number of URLs on the web. This was the first step in mitigating the known bias of only high-quality sites in the Quantcast Top Million.
The next step was randomly selecting domains from that 10,000 with a bias towards larger sites. When the system selects a site, it then randomly selects from the top 100 pages we gathered from that site via Google. This helps mitigate the importance bias a little more. We aren’t always starting with the homepage. While these pages do tend to be important pages on the site, we know they aren’t always the MOST important page, which tends to be the homepage. This was the second step in mitigating the known bias. Lower-quality pages on larger sites were balancing out the bias intrinsic to the Quantcast data.
And here is where we make our biggest change. We actually crawl the web starting with this set of pseudo-random URLs to produce the actual set of random URLs. The idea here is to take all the randomization we have built into the pseudo-random URL set and let the crawlers randomly click on links to produce the truly random URL set. The crawler will select a random link from our pseudo-random crawlset and then start a process of randomly clicking links, each time with a 10% chance of stopping and a 90% chance of continuing. Wherever the crawler ends, the final URL is dropped into our list of random URLs. It is this final set of URLs that we use to run our metrics. We generate around 140,000 unique URLs through this process monthly to produce our test data set.
Phew, now what? Defining metrics
Once we have the random set of URLs, we can start really comparing link indexes and measuring their quality, quantity, and speed. Luckily, in their quest to “get this right,” Moz gave me generous paid access to competitor APIs. We began by testing Moz, Majestic, Ahrefs, and SEMRush, but eventually dropped SEMRush after their partnership with Majestic.
So, what questions can we answer now that we have a random sample of the web? This is the exact wishlist I sent out in an email to leaders on the link project at Moz:
Size: What is the likelihood a randomly selected URL is in our index vs. competitors? What is the likelihood a randomly selected domain is in our index vs. competitors? What is the likelihood an index reports the highest number of backlinks for a URL? What is the likelihood an index reports the highest number of root linking domains for a URL? What is the likelihood an index reports the highest number of backlinks for a domain? What is the likelihood an index reports the highest number of root linking domains for a domain? Speed: What is the likelihood that the latest article from a randomly selected feed is in our index vs. our competitors? What is the average age of a randomly selected URL in our index vs. competitors? What is the likelihood that the best backlink for a randomly selected URL is still present on the web? What is the likelihood that the best backlink for a randomly selected domain is still present on the web? Quality: What is the likelihood that a randomly selected page’s index status (included or not included in index) in Google is the same as ours vs. competitors? What is the likelihood that a randomly selected page’s index status in Google SERPs is the same as ours vs. competitors? What is the likelihood that a randomly selected domain’s index status in Google is the same as ours vs. competitors? What is the likelihood that a randomly selected domain’s index status in Google SERPs is the same as ours vs. competitors? How closely does our index compare with Google’s expressed as “a proportional ratio of pages per domain vs our competitors”? How well do our URL metrics correlate with US Google rankings vs. our competitors?
Reality vs. theory
Unfortunately, like all things in life, I had to make some cutbacks. It turns out that the APIs provided by Moz, Majestic, Ahrefs, and SEMRush differ in some important ways — in cost structure, feature sets, and optimizations. For the sake of politeness, I am only going to mention name of the provider when it is Moz that was lacking. Let’s look at each of the proposed metrics and see which ones we could keep and which we had to put aside…
Size: We were able monitor all 6 of the size metrics!
Speed: We were able to include this Fast Crawl metric. What is the average age of a randomly selected URL in our index vs. competitors?Getting the age of a URL or domain is not possible in all APIs, so we had to drop this metric. What is the likelihood that the best backlink for a randomly selected URL is still present on the web?Unfortunately, doing this at scale was not possible because one API is cost prohibitive for top link sorts and another was extremely slow for large sites. We hope to run a set of live-link metrics independently from our daily metrics collection in the next few months. What is the likelihood that the best backlink for a randomly selected Domain is still present on the web?Once again, doing this at scale was not possible because one API is cost prohibitive for top link sorts and another was extremely slow for large sites. We hope to run a set of live-link metrics independently from our daily metrics collection in the next few months. Quality: We were able to keep this metric. What is the likelihood that a randomly selected page’s index status in Google SERPs is the same as ours vs. competitors?Chose not to pursue due to internal API needs, looking to add soon. We were able to keep this metric. What is the likelihood that a randomly selected domain’s index status in Google SERPs is the same as ours vs. competitors?Chose not to pursue due to internal API needs at the beginning of project, looking to add soon. How closely does our index compare with Google’s expressed as a proportional ratio of pages per domain vs our competitors?Chose not to pursue due to internal API needs. Looking to add soon. How well do our URL metrics correlate with US Google rankings vs. our competitors?Chose not to pursue due to known fluctuations in DA/PA as we radically change the link graph. The metric would be meaningless until the index became stable.
Ultimately, I wasn’t able to get everything I wanted, but I was left with 9 solid, well-defined metrics.
On the subject of live links:
In the interest of being TAGFEE, I will openly admit that I think our index has more deleted links than others like the Ahrefs Live Index. As of writing, we have about 30 trillion links in our index, 25 trillion we believe to be live, but we know that some proportion are likely not. While I believe we have the most live links, I don’t believe we have the highest proportion of live links in an index. That honor probably does not go to Moz. I can’t be certain because we can’t test it fully and regularly, but in the interest of transparency and fairness, I felt obligated to mention this. I might, however, devote a later post to just testing this one metric for a month and describe the proper methodology to do this fairly, as it is a deceptively tricky metric to measure. For example, if a link is retrieved from a chain of redirects, it is hard to tell if that link is still live unless you know the original link target. We weren’t going to track any metric if we couldn’t “get it right,” so we had to put live links as a metric on hold for now.
Don’t read any more before reading this section. If you ask a question in the comments that shows you didn’t read the Caveats section, I’m just going to say “read the Caveats section.” So here goes…
This is a comparison of data that comes back via APIs, not within the tools themselves. Many competitors offer live, fresh, historical, etc. types of indexes which can differ in important ways. This is just a comparison of API data using default settings.We set the API flags to remove any and all known Deleted Links from Moz metrics but not competitors. This actually might bias the results in favor of competitors, but we thought it would be the most honest way to represent our data set against more conservative data sets like Ahrefs Live. Some metrics are hard to estimate, especially like “whether a link is in the index,” because no API — not even Moz — has a call that just tells you whether they have seen the link before. We do our best, but any errors here are on the the API provider. I think we (Moz, Majestic, and Ahrefs) should all consider adding an endpoint like this. Links are counted differently. Whether duplicate links on a page are counted, whether redirects are counted, whether canonicals are counted (which Ahrefs just changed recently), etc. all affect these metrics. Because of this, we can’t be certain that everything is apples-to-apples. We just report the data at face value. Subsequently, the most important takeaway in all of these graphs and metrics is direction. How are the indexes moving relative to one another? Is one catching up, is another falling behind? These are the questions best answered. The metrics are adversarial. For each random URL or domain, a link index (Moz, Majestic, or Ahrefs) gets 1 point for being the biggest, for tying with the biggest, or for being “correct.” They get 0 points if they aren’t the winner. This means that the graphs won’t add up to 100 and it also tends to exaggerate the differences between the indexes. Finally, I’m going to show everything, warts and all, even when it was my fault. I’ll point out why some things look weird on graphs and what we fixed. This was a huge learning experience and I am grateful for the help I received from the support teams at Majestic and Ahrefs who, as a customer, responded to my questions honestly and openly.
The metrics dashboard