X-Robots-Tag: Control Google Indexing via HTTP Headers

We’ve all got used to being able to control how the major search engines index our sites using a combination of robots.txt and the robots meta tag to add attributes like ‘noindex’ to individual pages. While this works great for the pages themselves, it’s not so good for non-HTML, indexable content such as PDFs or embedded media, as we have no HTML <meta> tag in which to insert the meta-information. In this article we take a look at a potential solution to this problem: the X-Robots-Tag HTTP Header.

The X-Robots-Tag Header

The idea behind X-Robots-Tag is that it allows robots directives normally found in a meta element to be sent as part of the server’s HTTP response headers. In other words the instructions are sent with the file rather than within the file, with the main advantage that this can be used with any type of content. So an HTTP response might look like this (using the Live HTTP Headers plugin for Firefox):

HTTP/1.x 200 OK
Date: Thu, 27 Aug 2009 09:21:23 GMT
Server: Apache/2.0.52 (Red Hat)
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
X-Robots-Tag: noindex

A ‘noindex’ for Images

As an example consider the case of a webmaster wishing to prevent images being indexed and appearing in search results. One approach might be to add the entire ‘images’ directory to the robots.txt file (i.e. Disallow: /images). The problem with this is that the robots.txt file provides crawling directives rather than indexing directives – that is, the search engine is instructed not to visit the ‘images’ directory when crawling your site, but could still end up at one of your images if it is embedded in someone else’s site. Furthermore, what if we only want to block crawlers from certain images – the robots.txt file would quickly become large and unmanageable.

X-Robots-Tag in Apache (htaccess)

A combination of ‘Header set’ and the FilesMatch directive allows us to add the robots tag in Apache. Here’s a couple of examples which could be in Apache’s httpd.conf or .htaccess files.

Add ‘noindex’ header to all image files:


Header set X-Robots-Tag "noindex"
 

Add noindex to image files matching a particular pattern – in this case those with ‘thumbnail’ in the filename (i.e. /images/product41-thumbnail.jpg would be served with a noindex header while /images/product41-large.jpg would not):


Header set X-Robots-Tag "noindex"

X-Robots-Tag in PHP

PHP’s header function allows us to send any HTTP header, as follows:

header('X-Robots-Tag: noindex,nofollow');

This supports more complex scenarios than using Apache alone. Rather than applying the same rule to all images, we could use some custom logic such as a database lookup to determine whether to add the X-Robots-Tag header. The first step would be to route requests for images to a php script using Apache’s .htaccess file (e.g. requests for jpeg files within the images directory will be handled by ‘image-handler.php’).

RewriteEngine On
RewriteRule ^images/.*.jpg$ image-handler.php

Then in image-handler.php, we can perform our custom logic (defined in the allowImageIndexing() function) and set the appropriate header:

$filename = basename($_SERVER['REQUEST_URI']); // extract image filename 
header('Content-Type: image/jpg');             // set content type (otherwise it 
                                               // will be the dafault text/html)
if (!allowImageIndexing($filename)) {          // perform lookup
    header('X-Robots-Tag: noindex');           // set the x-robots-tag accordingly
}
readfile('images/' . $filename);               // stream image file in response

X-Robots-Tag Search Engine Support

Fortunately the three major search engines all now support the X-Robots-Tag