What are the best practices for PDF optimization

MATT CUTTS: Today's question comes from San Francisco, California.

The question is, what are the best practices around PDF andother document optimization, and when does Google choose to show these files over other web pages?

That's a really fun question for a couple reasons. So you can think aboutPDFs specifically.

And there's not that much to do in terms of optimization. For one thing, I'd make sure
that it's actually text, because you can have PDF that'sprimarily composed of images.

 And we might be able to OCR over time.

But really, if you have text in that document, it's a lot easier for us to index.

You want to make sure thatyou choose good titles.

You probably don't want to just have massive numbers of PDFs, if it's all likeshovelware, like you're just auto-generating content.

And you're just throwing it up there.

But there's a more interesting question underneath this question to me, which is, how do you rank web pages versus PDF documents?

PDF documents tend to be longer.

People are a little less likely to link to them, maybe because it can be a jarring experience to click on a link, and you immediately get thrown into a PDFreader piece of software.

 And so you really do have apples and oranges.

And Google's philosophy is to try to determine, as best we can, what's the utility of the next result?

Is the user better served by returning a PDF?

Or are they better served by returning a web document?

And it's a really hard problem.

Fundamentally, these are different data types.

One might be a book in PDF,and one might be a 400-word web page.

And trying to figure out what's the relative utility of those is really, really difficult.

Different people will disagree.

Different search engines will have different philosophies.

We essentially try to say, given what we know about the user, given everything else, given all the relevant signals that we have, try to make ourbest guess about, OK, the next most useful thing will be a PDF versus a web page.

It's an imperfect science.

It's much more of an art than a science, because different people will have different philosophies.

Some people don't like to get PDFs.

And then some PDFs are nothing more than a few matches, and it's a book-length thing.

And so having a few matches in a PDF might not be as helpful as having the same number of matches in a web document.

It's fundamentally a hard problem.

But we do our best try to say, given these different type of media, given these different type of documents, what's the best match for the user?

What's going to give them the best value and help them out the most in terms of their information need?