You’ve probably heard the phrase “duplicate content” being thrown around from time to time, and like any savvy webmaster, you’d never dare to publish the same content twice — but have you?
Duplicate content is the equivalent of overdrawing your checking account, except instead of paying costly fees each month you’ll be sacrificing your precious crawl budget. Manifesting itself in several forms, duplicate content may be one of the most elusive and widely overlooked problems that can affect your site’s ability to rank. It oftentimes stems from a sites’ information architecture or CMS limitations — which likely means it wasn’t deliberate.
Unfortunately, there is no simple check in Google Search Console that will flag this issue for you. Even the most advanced third-party tools don’t always do a good job of finding duplicate content — especially when the source is internal.
Here are eight potential sources of duplicate content that could be affecting your site:
1. HTTP and HTTPS URLs
One of the quickest ways to check if your site has two live versions being indexed is to try and visit the site using both the HTTP and HTTPS protocol. If both exist, don’t be alarmed — yet. It’s likely your developer switched the site over to HTTPS and neglected to 301 redirect the HTTP version.
Similarly, before Google incentivized webmasters to make their sites fully HTTPS, many sites chose to implement HTTPS only on selective pages which needed the added security – such as login and checkout pages. If the developer chose to use a relative linking structure, anytime a crawler visited a secure page it would force HTTPS to be appended to these URLs – ultimately creating two versions of the site.
Similar to this, ensure your site doesn’t have both a www and non-www version. You can fix this problem by implementing 301 redirects and specifying your preferred domain in Google Search Console.
2. Sneaky scraper sites
While there are no internet police to help you reclaim stolen property, there are ways you can code your site which will increase the difficulty for scrapers trying pawn off your content as their own. As mentioned above, always use absolute URLs instead of relative URLs:
- Absolute URL: https://www.bestrecipes.com/chocolate-cakes
- Relative URL: /chocolate-cakes
Why is this so important? When you use relative URLs, your browser assumes that the link is pointing to a page that’s on the same browser you’re already on. As you might know, it’s never a good idea to let Google assume (think those terrible sitelinks that make no sense). Some developers favor relative URLs because they simplify the coding process.
If your developer isn’t willing to re-code the entire site, implement self-referencing canonical tags. When a scraper pastes your content on their new site, the canonical tags will sometimes stay in place, allowing Google to know your site is the content’s original source.
In order to tell if you’ve been scraped, try using free tools such as Siteliner or Copyscape.
3. Long lost subdomains
So you abandoned your subdomain and chose to use a subdirectory instead. Or maybe you created an entirely new site. Either way, your old abandoned content could still be alive and well – and will likely come back to haunt you. It’s best to 301 redirect a discontinued subdomain to your new site. This is especially important if your old site has a high influx of backlinks.
4. The “secret” staging environment
Coding a new site design? Preparing your site for the big reveal? If you haven’t blocked Google’s crawlers from doing so, Google may have decided to take a sneak peek.
It’s a common misconception that since no one would ever type staging.yoursite.com, it’s off limits. Wrong! Google is constantly crawling and indexing the web, including your staging environment. This can muddy up your search results and cause confusion for users.
Not only is this a huge no-no in terms of site privacy and security, allowing Google to crawl unnecessarily can take a serious toll on your crawl budget. Keep it simple: apply a noindex tag to the entire staging environment, and block staging in the robots.txt file. No peeking.
Remember, though — when you move from the staging environment to the live site, DO NOT forget to remove these blocking commands from the code!
5. Dynamically generated parameters
Most often generated by a faceted navigation setup that allows you to “stack” modifiers, this is one of those issues that may stem from your sites’ architecture. So what exactly do dynamically generated parameters look like?
- URL 1: www.bestrecipes.com/chocolate-recipes/cake/custom_vanilla
- URL 2: www.bestrecipes.com/chocolate-recipes/cake/custom_vanilla%8in
- URL 3: www.bestrecipes.com/chocolate-recipes/cake/custom_vanilla%8in=marble
This is a simplified example; however, your CMS may be appending multiple parameters and generating unnecessarily long URL strings, all of which are fair game for Google to crawl.
On the flipside, Google can take it upon itself to crawl through the faceted navigation to create and index endless URL combinations that no user has requested.
In either scenario, apply a canonical tag to the preferred URL and set-up parameter controls in Google Search Console. You can take this one step further and block certain URLs in robots.txt using a wildcard (*) to prohibit the indexation of anything that comes after a specified subdirectory. For example:
6. Mirrored subdirectories
Does your business operate in two or more geographic locations? Some businesses prefer to have a main landing page which allows users to click the location most applicable to them and then directs them to the appropriate subdirectory. For example:
- URL 1: www.wonderfullywhisked.com/fr
- URL 2: www.wonderfullywhisked.com/de
While this might seem logical, evaluate if there is truly a need for this setup. While you may be targeting different audiences, if both subdirectories mimic each other in terms of product selection and content, this is when the lines start to blur. To tackle this issue, head to Google Search Console and set up location targeting.
7. Syndicated content
Syndication is a great way to get your content in front of a fresh audience; however, it’s important to set guidelines for those who want to publish your content.
Ideally, you would request that the publisher use the rel=canonical tag on the article page to indicate to search engines that your website is the original source of the content. They could also noindex the syndicated content, which would solve potential issues with duplicate content in search results.
At the very least, publishers should be linking back to the original article on your website as a means of attribution.
8. Similar content
This may be the least of your worries when it comes to true duplicate content – however, Google’s definition of duplicate content does include content that is “appreciably similar.”
Though two pieces of content may vary in syntax, a general rule of thumb is that if you can gather the same information from both articles, there is no real reason for both to exist on your website. A canonical tag is a great option here, or consider consolidating your content pieces.
It’s important to get a handle on duplicate content issues in order to avoid depleting your crawl budget — which could prevent new pages from getting crawled and indexed. Some of the best tools in your arsenal include canonical tags, 301 redirects, nofollow/noindex tags and utilizing parameter controls. Work to reduce duplicate content by adding these quick checks to your monthly SEO maintenance routine.
Some opinions expressed in this article may be those of a guest author and not necessarily Search Engine Land. Staff authors are listed here.