Google's challenge: searching the live web

I'm looking forward to the Google "Behind the scenes" presentation put on by the Vancouver HPC Users Group (note: not a permalink; added to It's being given by Narayanan Shivakumar ('Shiva'), a Google Distinguished Entrepreneur and the founding Director of the Google Seattle-Kirkland R&D center. The abstract is as follows:

Google deals with large amounts of data and millions of users. We'll take a behind-the-scenes look at some of the distributed systems and computing platform that power Google's various products, and make the products scalable and reliable.

The bio says that Shiva is currently "excited about a variety of search and webcrawling technologies (including Google Sitemaps)".

I see the challenge for Google and all search engines to be "how to search the live web". One of the things I often explain is that I firmly believe that all static web pages will eventually be replaced by dynamic web pages. Another way to say this is that much of the content on the web, especially much of it which is being updated often, is actually being created by web apps.

For web apps, URLs are nothing more than keys to content. Type in, and the underlying web application will look up the content that is keyed to that URL. In fact, that "about" string is nothing more than a query to the underlying content "engine" of a website.

What is Google and other search engines? They are a centralized aggregator of all the unique queries of all the web apps that run websites in the world. Increasingly, they are having trouble keeping up.

So, the Google Sitemaps (Search Engine Watch Q&A with Shiva from June of 2005) XML file is an attempt to have a central place per site to indicate to Google how to get at that information. Google Co-op (with Google Subscribed Links powering it) is another, albeit user-driven example. Both are still designed with static content (or relatively unchanging) in mind. Of course, the purpose for Google Sitemaps is to have individual websites use this as a hint file, to inform the GoogleBot what is updated and what it would like indexed. I'm not sure that a new format is needed here -- the idea was that it could be an industry standard, but of course it's still being called Google Sitemaps, so it doesn't seem to have done much in that regard.

What other XML format do we know that describes changes to web pages? Yep, RSS/Atom are definite features of the live web, and has well defined mechanisms of being polled for updates (and even publish-and-subscribed to, a much more efficient mechanism than polling).

Social search, cross-language search (search for "monkey" in English, get relevant matches in all languages you understand and optionally machine translations of ones you don't), subscribed searches, levels of notification (your mobilephone, your RSS reader, your work email account) -- all of these are combining into something more complex than "type words into a box", in the interest of becoming simpler: getting us the information we need, when we need it.

If you hadn't already guessed, yes, I am going to try and ask some questions at the presentation. Of course, this is a presentation at the High Performance Computing Users Group, so maybe I won't get to ask as many search questions as I'd like.

And yes, I do find it ironic that the number one hit for Shiva's full name is a link that leads to a 403 Forbidden error :P