Google’s War on Data and the Clickstream Revolution

Posted on: November 7, 2016 by Local SEO Nerd in Local SEO Strategies

No Comments

Existential threats to SEO

Rand called “Not Provided” the First Existential Threat to SEO in 2013. While 100% Not Provided was certainly one of the largest and most egregious data grabs by Google, it was part of a long and continued history of Google pulling data sources which benefit search engine optimizers.

A brief history

Nov 2010 – Deprecate search API
Oct 2011 – Google begins Not Provided
Feb 2012 – Sampled data in Google Analytics
Aug 2013 – Google Keyword Tool closed
Sep 2013 – Not Provided ramped up
Feb 2015 – Link Operator degraded
Jan 2016 – Search API killed
Mar 2016 – Google ends Toolbar PageRank
Aug 2016 – Keyword Planner restricted to paid

I don’t intend to say that Google made any of these decisions specifically to harm SEOs, but that the decisions did harm SEO is inarguable. In our industry, like many others, data is power. Without access to SERP, keyword, and analytics data, our and our industry’s collective judgement is clouded. A recent survey of SEOs showed that data is more important to them than ever, despite these data retractions.

So how do we proceed in a world in which we need data more and more but our access is steadily restricted by the powers that be? Perhaps we have an answer — clickstream data.

What is clickstream data?

First, let’s give a quick definition of clickstream data to those who are not yet familiar. The most straightforward definition I’ve seen is:

“The process of collecting, analyzing, and reporting aggregate data about which pages users visit in what order.”
– (TechTarget: What is Clickstream Analysis)

If you’ve spent any time analyzing your funnel or looking at how users move through your site, you have utilized clickstream data in performing clickstream analysis. However, traditionally, clickstream data is restricted to sites you own. But what if we could see how users behave across the web — not just our own sites? What keywords they search, what pages they visit, and how they navigate the web? With that data, we could begin to fill in the data gaps previously lost to Google.

I think it’s worthwhile to point out the concerns presented by clickstream data. As a webmaster, you must be thoughtful about what you do with user data. You have access to the referrers which brought visitors to your site, you know what they click on, you might even have usernames, emails, and passwords. In the same manner, being vigilant about anonymizing data and excluding personally identifiable information (PII) has to be the first priority in using clickstream data. Moz and our partners remain vigilant, including our latest partner Jumpshot, whose algorithms for removing PII are industry-leading.

What can we do?

So let’s have some fun, shall we? Let’s start to talk about all the great things we can do with clickstream data. Below, I’ll outline a half dozen or so insights we’ve gleaned from clickstream data that are relevant to search marketers and Internet users in general. First, let me give credit where credit is due — the data for these insights have come from 2 excellent partners: Clickstre.am and Jumpshot.

Popping the filter bubble

It isn’t very often that the interests of search engine marketers and social scientists intersect, so this is a rare opportunity for me to blend my career with my formal education. Search engines like Google personalize results in a number of ways. We regularly see personalization of search results in the form of geolocation, previous sites visited, or even SERP features tailored to things Google knows about us as users. One question posed by social scientists is whether this personalization creates a filter bubble, where users only see information relative to their interests. Of particular concern is whether this filter bubble could influence important informational queries like those related to political candidates. Does Google show uniform results for political candidate queries, or do they show you the results you want to see based on their personalization models?

Well, with clickstream data we can answer this question quite clearly by looking at the number of unique URLs which users click on from a SERP. Personalized keywords should result in a higher number of unique URLs clicked, as users see different URLs from one another. We randomly selected 50 search-click pairs (a searched keyword and the URL the user clicked on) for the following keywords to get an idea of how personalized the SERPs were.

Dropbox – 10
Google – 12
Donald Trump – 14
Hillary Clinton – 14
Facebook – 15
Note 7 – 16
Heart Disease – 16
Banks Near Me – 107
Landscaping Company – 260

As you can see, a highly personalized keyword like “banks near me” or “landscaping company” — which are dependent upon location —receive a large number of unique URLs clicked. This is to be expected and validates the model to a degree. However, candidate names like “Hillary Clinton” and “Donald Trump” are personalized no more than major brands like Dropbox, Google, or Facebook and products like the Samsung Note 7. It appears that the hypothetical filter bubble has burst — most users see the exact same results as one another.

Biased search behavior

But is that all we need to ask? Can we learn more about the political behavior of users online? It turns out we can. One of the truly interesting features of clickstream data is the ability to do “also-searched” analysis. We can look at clickstream data and determine whether or not a person or group of people are more likely to search for one phrase or another after first searching for a particular phrase. We dove into the clickstream data to see if there were any material differences between subsequent searches of individuals who looked for “donald trump” and “hillary clinton,” respectively. While the majority of the searches were quite the same, as you would expect, searching for things like “youtube” or “facebook,” there were some very interesting differences.

For example, individuals who searched for “donald trump” were 2x as likely to then go on to search for “Omar Mateen” than individuals who previously searched for “hillary clinton.” Omar Mateen was the Orlando shooter. Individuals who searched for “Hillary Clinton” were about 60% more likely to search for “Philando Castile,” the victim of a police shooting and, in particular, one of the more egregious examples. So it seems — at least from this early evidence —that people carry their biases to the search engines, rather than search engines pushing bias back upon them.

Getting a real click-through rate model

Search marketers have been looking at click-through rate (CTR) models since the beginning of our craft, trying to predict traffic and earnings under a set of assumptions that have all but disappeared since the days of 10 blue links. With the advent of SERP features like answer boxes, the knowledge graph, and Twitter feeds in the search results, it has been hard to garner exactly what level of traffic we would derive from any given position.

With clickstream data, we have a path to uncovering those mysteries. For starters, the click-through rate curve is dead. Sorry folks, but it has been for quite some time and any allegiance to it should be categorized as willful neglect.

We have to begin building somewhere, so at Moz we start with opportunity metrics (like the one introduced by Dr. Pete, which can be found in Keyword Explorer) which depreciate the potential search traffic available from a keyword based on the presence of SERP features. We can use clickstream data to learn the non-linear relationship between SERP features and CTR, which is often counter-intuitive.

Let’s take a quick quiz.

Which SERP has the highest organic click-through rate?

A SERP with just news
A SERP with just top ads
A SERP with sitelinks, knowledge panel, tweets, and ads at the top

Strangely enough, it’s the last that has the highest click-through rate to organic. Why? It turns out that the only queries that get that bizarre combination of SERP features are for important brands, like Louis Vuitton or BMW. Subsequently, nearly 100% of the click traffic goes to the #1 sitelink, which is the brand website.

Perhaps even more strangely, pages with top ads deliver more organic clicks than those with just news. News tends to entice users more than advertisements.

It would be nearly impossible to come to these revelations without clickstream data, but now we can use the data to find the unique relationships between SERP features and click-through rates.

In production: Better volume data

Perhaps Moz’s most well-known usage of clickstream data is our volume metric in Keyword Explorer. There has been a long history of search marketers using Google’s keyword volume as a metric to predict traffic and prioritize keywords. While (not provided) hit SEOs the hardest, it seems like the recent Google Keyword Planner ranges are taking a toll as well.

So how do we address this with clickstream data? Unfortunately, it isn’t as cut-and-dry as simply replacing Google’s data with Jumpshot or a 3rd party provider. There are several steps involved — here are just a few.

Data ingestion and clean-up
Bias removal
Modeling against Google Volume
Disambiguation corrections

I can’t stress how much attention to detail needs to go into these steps in order to make sure you’re adding value with clickstream data rather than simply muddling things further. But I can say with confidence that our complex solutions have had a profoundly positive impact on the data we provide. Let me give you some disambiguation examples that were recently uncovered by our model.

Keyword	Google Value	Disambiguated
cars part	135000	2900
chopsuey	74000	4400
treatment for mononucleosis	4400	720
lorton va	9900	8100
definition of customer service	2400	1300
marion county detention center	5400	4400
smoke again lyrics	1900	880
should i get a phd	480	320
oakley crosshair 2.0	1000	480
barter 6 download	4400	590
how to build a shoe rack	880	720

Look at the huge discrepancies here for the keyword “cars part.” Most people search for “car parts” or “car part,” but Google groups together the keyword “cars part,” giving it a ridiculously high search value. We were able to use clickstream data to dramatically lower that number.

The same is true for “chopsuey.” Most people search for it, correctly, as two separate words: “chop suey.”

These corrections to Google search volume data are essential to make accurate, informed decisions about what content to create and how to properly optimize it. Without clickstream data on our side, we would be grossly misled, especially in aggregate data.

How much does this actually impact Google search volume? Roughly 25% of all keywords we process from Google data are corrected by clickstream data. This means tens of millions of keywords monthly.

Moving forward

The big question for marketers is now not only how do we respond to losses in data, but how do we prepare for future losses? A quick survey of SEOs revealed some of their future concerns…

Luckily, a blended model of crawled and clickstream data allows Moz to uniquely manage these types of losses. SERP and suggest data are all available through clickstream sources, piggybacking on real results rather than performing automated ones. Link data is already available through third-party indexes like MozScape, but can be improved even further with clickstream data that reveals the true popularity of individual links. All that being said, the future looks bright for this new blended data model, and we look forward to delivering upon its promises in the months and years to come.

And finally, a question for you…

As Moz continues to improve upon Keyword Explorer, we want to make that data more easily accessible to you. We hope to soon offer you an API, which will bring this data directly to you and your apps so that you can do more research than ever before. But we need your help in tailoring this API to your needs. If you have a moment, please answer this survey so we can piece together something that provides just what you need.

Tags:LocalSEO.

Monday	9:00 AM - 7:00 PM
Tuesday	9:00 AM - 7:00 PM
Wednesday	9:00 AM - 7:00 PM
Thursday	9:00 AM - 7:00 PM
Friday	9:00 AM - 7:00 PM
Saturday	9:00 AM - 7:00 PM
Sunday	9:00 AM - 7:00 PM