Since I’ve launched my course on Udemy, I’ve received a lot of inquiries about the best ways to prevent referral spam from clogging up your metrics and skewing your data. There’s a lot of content circulating the web already on this topic so while I’m surprised to hear it remains a top issue, I also understand that spambots are getting more intelligent and it takes a certain level of ongoing maintenance if you want to uphold the accuracy of your data. The good news is that while it’s best to instate these solutions as early as possible for new Google Analytics instances, it’s technically never too late to instate them and segment data retroactively for a more accurate historical picture of your site’s performance.
I’ve formatted this article to walk through each type of referral spam and include details around what each type of spam is, how to identify it, how to prevent it, ending with a guide for updating your data retroactively.
What is ghost spam?
Ghost spam is spam where the bot or crawler never actually accesses your site. This is good news for you because there’s no script being ran that’s attempting to ping your webserver. It’s the least invasive form of spam we’ll encounter. Rather than trying to access your site directly, ghost spam bots either randomly generate Google Analytics tracking ID’s (you know, the ones prefixed with UA-xxx) or use a web crawler to scrape ID’s off your site’s source code, then send referrer data to Google using the analytics measurement protocol. The measurement protocol was provided to support developers in their efforts to primarily tie offline events to online data and support the intake of data from third party sources. The downside of the advanced support from Google is that a group of people will inevitably take advantage of it.
How to identify ghost spam
Fortunately for us, identifying ghost spam is fairly straightforward. Most of the time, when ghost spam appears in your analytics reports, at least one of two conditions will be true. First, the false data will have no hostname value set or contain a hostname that appears like spam. (traffic2cash, copyrightclaims.org, etc.) or they’ll forget to set a value for screen resolution.
You can see what I mean by the presence of (not set) in each of the examples above for hostname and screen resolution. You may have noticed that on the hostname example, the first one in the list lists a hostname other than (none). This is an instance where a hostname value was set, just not a valid one. We’ll clean these examples up as well. Now that we’ve learned how to identify ghost spam, we need to look at what we can do to prevent it.
How to prevent ghost spam
Preventing ghost spam takes just a few steps and require us to implement a new filter. Before continuing, make sure you have an unmodified view with no filters applied. Always keep a backup view with no modifications that serves as a catchall for all site data. This serves as our backup to prevent data loss of any sort. Only move forward below once you have a separate view you’re ready to modify.
Create the filter per the screenshot below. Rather than excluding each spammy domain individually that shows up in the “Source” field from the above examples or excluding where “Hostname” or “Filter” = (not set), we’re going to write an include filter where hostname matches any of our valid domains. This will allow the solution to work without any ongoing maintenance, since the likelihood that our domain names will change is slim. Keep in mind that if you choose to add subdomains as your site grows, you’ll want to update the filter below to capture those new domains.
In my example, I only have one root domain to worry about. However, larger sites may have additional subdomains to account for. In these scenarios, change your Filter Type to Custom, and use regex in the Hostname field as follows
What is crawler spam?
By this point you’ve hopefully learned something new and have already blocked a good portion of referral spam to your site. Now we need to account for the rest, which comes through on our reports under a different type of referral spam called crawler spam. The key difference with crawler spam as opposed to ghost spam is that crawler spam actually does access your site, whereas as ghost spam doesn’t. This type of spam is less common because it takes a lot more resources to build crawlers, but the intent of crawler spam is for site indexing purposes and to lure you in to eventually sell you something. Crawler spam bots also go through this effort in an attempt to get a link back to their site, further pushing their spam domains up higher in Google’s index.
How to identify crawler spam
Since crawler spam actually access your site, the hostname looks valid. That is, it won’t appear as (not set) like ghost spam does for easy identification. Crawler spam would show a valid hostname of chrisboulas.com, further disguising itself as valid traffic, but we know better.
#13 in the image above illustrates this example. Notice that top1-seo-service.com appears to be a spammy domain, but has a valid chrisboulas.com hostname. In this example, we can easily determine that top1-seo-service is spam based on the name, but often times it’s hard to distinguish based on the name alone.
How to prevent crawler spam
Step 1: This is the easiest step. In your view settings, enable “exclude all hits from known bots and spiders.”
Step 2: Create a new exclude filter per the screenshot below. The filter pattern I’m using is(best|100dollars|success|top1)\-seo|(videos|buttons)\-for| anticrawler|^scripted\.|\-gratis|semalt|forum69|7makemoney|sharebutton| ranksonic|sitevaluation|dailyrank|vitaly|video\-|profit\.xyz|rankings\- |dbutton|\-crew|uptime(bot|check)
In the filter verification table, note the before and after comparison. My “before” data is low because I had been previously filtering out crawler spam.
Step 3: No action is required here, but this is technically an ongoing step. As you monitor and report on your data, be mindful of new referral sources that appear in your reports as you’ll need to modify your regular expression in the previous step accordingly.
Updating data retroactively
By this point, we’ve successfully blocked 99% of future referral spam, but what about the spam that already appears on our reports? Fortunately, we can clean up bad data using Google’s advanced segments functionality.
In the above example, we’re using one segment to exclude both ghost and crawler spam types. You’ll notice the conditions used in the segment almost directly mirror those of the individual filters we created. Keep in mind that by default, advanced segments you create will be available on all views under the property, so don’t be scared if you navigate back to your unmodified “all data” profile and see that your traffic dropped substantially. Just check to make sure this segment isn’t turned on for that view.
The solutions I’ve covered here will stop 99% of all ghost or crawler spam, but if you’re a perfectionist like me, there are technically some additional preventative measures we can take to block referral spam. They’re more technical in nature and will require you to have access to your web server or find a developer who does, so for all of you who want to block 99.9% of referral spam, I will do a part 2 on the topic and walk through those methods.