A Peak Inside Google’s Algorithm

Ever wish you could have a peak inside Google’s algorithm? Even just a little peak? Saul Hansell of the New York Times gave us all that peak in his article yesterday, Google Keeps Tweaking Its Search Engine. The entire article is worth a read or three, but here are some quotes I pulled along with a running commentary.

Matt Cutts has mentioned on several occasions that both major and minor changes are being made to the algorithm, but did anyone expect that

the search-quality team makes about a half-dozen major and minor changes a week to the vast nest of mathematical formulas that power the search engine.

Of course those changes don’t always make everyone happy.

As always, tweaking and quality control involve a balancing act. “You make a change, and it affects some queries positively and others negatively,” Mr. (Udi) Manber says. “You can’t only launch things that are 100 percent positive.”

Everyday it seems like Google is finding a new way to collect user data and measure traffic patterns in an effort to better understand your intent when you type a query. All the data seems to helping them do just that.

These formulas have grown better at reading the minds of users to interpret a very short query. Are the users looking for a job, a purchase or a fact? The formulas can tell that people who type “apples” are likely to be thinking about fruit, while those who type “Apple” are mulling computers or iPods. They can even compensate for vaguely worded queries or outright mistakes.

“Search over the last few years has moved from ‘Give me what I typed’ to ‘Give me what I want,’ ” says Mr. (Amit) Singhal

The article gives an interesting look into how some of the algorithm tweaks come about. All Google employees can use its ‘buganizer” system to report an issue with a search. In one particular case

Bill Brougher, a Google product manager, complained that typing the phrase “teak patio Palo Alto” didn’t return a local store called the Teak Patio.

So Mr. Singhal fired up one of Google’s prized and closely guarded internal programs, called Debug, which shows how its computers evaluate each query and each Web page. He discovered that Theteakpatio.com did not show up because Google’s formulas were not giving enough importance to links from other sites about Palo Alto.

It was also a clue to a bigger problem. Finding local businesses is important to users, but Google often has to rely on only a handful of sites for clues about which businesses are best. Within two months of Mr. Brougher’s complaint, Mr. Singhal’s group had written a new mathematical formula to handle queries for hometown shops.

Ok, who wants a copy of “Debug” or at least an hour or two with it? I know I do.

I think it’s good to see that Google is responding to things human beings see as a potential problem with search results. They clearly want the algorithm to do the work, but I think Google understands it’s not perfect and when something isn’t quite working like it should they’re willing to make a fix. It certainly makes sense not to rush to make changes for every complaint, though.

But Mr. Singhal often doesn’t rush to fix everything he hears about, because each change can affect the rankings of many sites. “You can’t just react on the first complaint,” he says. “You let things simmer.”

So he monitors complaints on his white board, prioritizing them if they keep coming back. For much of the second half of last year, one of the recurring items was “freshness.”

The freshness comment is interesting. ‘Sandbox’ anyone? Is Google rethinking some of the factors it considers? Could this be a sign that they’re dialing back age as a ranking factor? I doubt it, but it does seem to indicate they’re aware that sometimes fresher content is more relevant to the query. It looks like the solution to the problem is to understand the query and determine when users are more likely to want the latest content and when users are more likely to want authority content. Enter QDF (Query Deserves Freshness).

Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.

But last year, Mr. Singhal started to worry that Google’s balance was off. When the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them.

Mr. Singhal introduced the freshness problem, explaining that simply changing formulas to display more new pages results in lower-quality searches much of the time. He then unveiled his team’s solution: a mathematical model that tries to determine when users want new information and when they don’t. (And yes, like all Google initiatives, it had a name: QDF, for “query deserves freshness.”)

THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject.

I’m sure this won’t be the last any of us hears about QDF. Blog engines usually have a freshness component to them as do news search engines., which only makes sense. QDF seems to be an indication that writing about ‘hot’ content can give a newer site more of a chance to rank, though it should be considered that ‘hot’ topics tend to get written about a lot and you would expect more competition for terms around the ‘hot’ topic. What it might mean is that results for whatever Google and QDF determine to be ‘hot’ will change frequently. You could see a boost for your content, but that boost will be temporary as fresher stories are written.

The Times article gives a look at how the whole process might work from the moment someone types a query to the moment results are displayed. Note that the system for ranking pages involves more than 200 factors of which PageRank is just one. Also note the push toward personalization in search results.

As Google compiles its index, it calculates a number it calls PageRank for each page it finds. This was the key invention of Google’s founders, Mr. Page and Sergey Brin. PageRank tallies how many times other sites link to a given page. Sites that are more popular, especially with sites that have high PageRanks themselves, are considered likely to be of higher quality.

Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years.

Increasingly, Google is using signals that come from its history of what individual users have searched for in the past, in order to offer results that reflect each person’s interests. For example, a search for “dolphins” will return different results for a user who is a Miami football fan than for a user who is a marine biologist. This works only for users who sign into one of Google’s services, like Gmail.

Further there’s an attempt to understand query intent through the use of classifiers and all the signals and classifiers are used to determine a page’s relevance or ‘topicality’ to the query.

Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names.

These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. A page about President Bush’s speech about Darfur last week at the White House, for example, would rank high in topicality for “Darfur,” less so for “George Bush” and even less for “White House.” Google combines all these measures into a final relevancy score.

The example using the speech about Darfur is interesting. That page might very well mention George Bush often, but in truth it’s less relevant to the man and more relevant to the subject of the speech. That’s not a stretch for a human being to see and hopefully Google’s signals and classifiers can see it too. Relevant results are the goal after all.

The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. “If you have a lot of different perspectives on one page, often that is more helpful than if the page is dominated by one perspective,” Mr. Cutts says. “If someone types a product, for example, maybe you want a blog review of it, a manufacturer’s page, a place to buy it or a comparison shopping site.”

Does “diversity” mean duplicate content? Or are we to take this that Google really is offering a variety of different types of content. Would they need to add extra diversity if they truly understood query intent? If you’ve shown a clear indication you’re looking for information are the comparison shopping sites necessary? Could more diversity in results be a sign of less understanding of the intent?

The peak the article gives may not reveal as much as we’d all like to know, but when you’re talking about a ‘black box’ any glimpse is welcome and sure to lead to discussion. Here are a few other places you can find the article being talked about.

I think there’s confirmation of some things we’ve all known, some confirmation of things we’ve all thought we knew, and maybe even a few new doors that were opened for us all to speculate about. What do you take from the article? Did it clarify some things for you? Give you some new things to think about? What ways do you see the information helping you when optimizing your pages and sites?

Download a free sample from my book, Design Fundamentals.

Leave a Reply

Your email address will not be published.