Three Canonicalization Problems Fixed with .htaccess

Given the detrimental effect duplicate content can have on search engine rankings, it is surprising how common canonicalization issues are, even on major websites. Fortunately, however, many of these problems can be easily solved by appropriate use of redirects to enforce a single URL for each piece of unique content. In this article, we look at three common canonicalization issues and how they can be fixed using .htaccess and 301 redirects.

For any of this to work you’ll need an Apache server with mod_rewrite enabled. To get started, add the following to the top of your .htaccess file. If this results in a 500 server error, your host probably doesn’t support mod_rewrite, and there is likely not much you can do about it.

Options +FollowSymLinks
RewriteEngine On

Domain Canonicalization

While many web users view www.example.com and example.com as functionally equivalent addresses, search engines cannot make such an assumption since, technically, the URLs are different and a webserver could return different content for each. In reality, hosting packages often automatically create a ‘www’ subdomain to serve the same content as the main domain, leading to many pages appearing twice in search results – both with and without the ‘www’ prefix.

The simple fix for this is to pick one domain to use as the canonical tag and redirect any requests for other domains to that one:

RewriteCond %{HTTP_HOST} !^www.example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

HTTP/HTTPS Canonicalization

Websites handling sensitive information (e.g. ecommerce) will often need to make use of secure connections via the https protocol. Typically, this is only required for certain parts of the site, such as those accessed via a login, but links to the https versions of normal pages often creep into search results.

The following example ensures that any requests for paths beginning with ‘/checkout’ will be redirected to https if they are not using the secure protocol, but attempts to access any other parts of the site via https will be redirected to their non-https equivalents.

# redirect non-https requests for /checkout to https
RewriteCond %{HTTPS} off
RewriteRule ^checkout/ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
# redirect all other https requests to http
RewriteCond %{HTTPS} on
RewriteCond $1 !^checkout/
RewriteRule ^(.*) http://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

Directory Index Canonicalization

Apache allows us to specify a list of pages which will be returned if a directory page is requested (i.e. a url ending with a slash), such as http://www.example.com/. Typically, index.html or index.php are used. However, direct access to these files is not prevented and careless creation of links can easily lead to http://www.example.com/ and http://www.example.com/index.php appearing in search results separately.

The following rule ensures that any requests ending in /index.php will be redirected to the parent directory.

# matches original request header
# (to avoid infinite loop with apache internal rewriting)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9} /.*index.php HTTP/
RewriteRule ^(.*)index.php$ http://%{HTTP_HOST}/$1 [R=301,L]