PDF Ranking test results

Last 26th of October I started a test to monitor how PDF documents are crawled and indexed by search engines.

The aim of the test was to try and find out the following things about PDF documents like:

  • Do Microsoft Word® headers (normally used into a professional document) make a difference?
  • How much does the Keyword Density in the document impact on its ranking?
  • How much do the document properties (Title, Author, Comments and Keywords) influence the indexing?

After just one week, Google started to show some tangible results. The other search engines are still looking around; only Ask returns a couple of results, but nothing worthwhile to report.

The scope of this document is to highlight the fluctuation that the different PDF documents (13 in total) made day-by-day.

Those are the first – most interesting – results I collected during the past couple of weeks for the main document returned by a SERP generated using the URK “seiunamicone”.

google-serp

These are accompanied by the hidden results (in a drop down menu):

rest-of-the-serp

I’ve been monitoring the SERPs for a while, and apart from the first couple of days where it there were constant fluctuations, the results seem to be stabilized.

Proposing a lot of images it would probably have been muddled, however I can assure you that a lot of changes took place and I presume that even more of these changes will occur over the next couple of days.

Just to give you an example outside the above picture – on 3rd of November the third result was a document called PDF-test-without-headers-KD43.pdf – my test n. 11. It is quite different to determine why this document was ranking at this specific time, which is the reason for why I included a graph collecting the SERPs detail changes.

This is the full graph:

full-chart

Whilst this is a graph with the documents that just take part on the SERP during the period in which I monitored it.

PDF-ranked

Let’s analyze it altogether, but first let me remind you of something about the documents generated. I assumed a KWD of the URK “seiunamicone” split between the page (42%) and the document properties (56%) and fake headers when H1 and H2 have been created using pure emphasis instead of Word styles.

The first PDF to be indexed has been a document called Test 7 (PDF-test-without-header2-KD100.pdf). This document contains an H1 made using Word styles, a fake H2 – just emphasized text – with a KD of 100%. Just after some days, this document has been completely refused by Google SERP. Today is in the index but sit nowhere.

A snugly result for test number 5, 28% KD and one header, whilst no index at all for test 3, 10 or 13 for example.

If we would like to analyze only the first three results (first one and it’s aggregate) plus the first result shown when expanding hidden results we got the following picture.

Positive results has been collected for test number 12, always been present in the SERP and now stable on position 1 from about one week, test number 1.1, with some fluctuation, but now stable on position 2, and finally test 11, that apart some daily disappear has always been on position three.

So what makes the difference for these documents?

I almost sure Google is able to interpret the RTF code contained into PDF document (most probably doing a sort of reverse engineering). This sounds like strong assert (and maybe it is, so please take it just as my personal opinion) but it’s the only explanation I was able to find when I answered to the question “Why these?”

Analyzing the SERPs, I saw that after a KD factor, the headers get their own importance, so, if I should answer about the question

What are the predominant factors that influence a PDF indexing into Google?

According to the test result I collected during the past weeks, today I would probably answer with the following bulleted point:

  1. Document properties usage. Adding the keyword(s) into the document properties (Title, Subject e Keywords – Comments are ignored. We can even use the Author field, but looks like to be used for different purposes, isn’t it?)
  2. Keyword density. A sufficient number of keyword in strategic part of the document – as per HTML pages – results in a better-optimized document, especially when headers are used. But we remember of another important aspect, such as document length and size that after a certain dimension (100k) results in a non-crawled text.
  3. Header usage. Inserting keywords into the header 1 (made with Word styles, not emphasizing the text) boost the document and help it for a better indexing. Eventually use an H2 sounds good, but during the tests I noticed that use both of them don’t get any extra advantage.
  4. Keyword proximity. Whenever the headers are not used, keyword proximity plays an important role.