Wednesday, June 9, 2010

Google Caffeine Is Live

Illustration showing difference between old and new Google index.

Late yesterday Google announced the launch of its new web indexing system, Caffeine. According to Google, it provides 50% "fresher" results and is the largest collection of web content it has ever offered.

The big push for this new indexing technology is the rise (explosion, really) of real-time data on the web. With Twitter acting as a de facto news source for so many, and with other services such as location-based social media, rapid-fire blogs, mailing list archives and the like pushing data live at an ever-increasing rate, Google's old model of updating once a week (or every two weeks) just wasn't cutting it. We've seen Twitter integrated into Google search results already, but that has been strapped on to results that were otherwise out of date.

Google provided the image you see at the top of this post as an example of how the old index worked compared to how Caffeine works. Unless you speak blocks and the Bohr model, it really doesn't say much (except that I worry about the little guy in the cloud taking a camera to the head). The gist is, the old index was built in layers, with some content being refreshed more regularly than other content. Sites like CNN might live at the top of the index, being refreshed regularly. Sites like my personal pages might live at the very bottom, only being refreshed when someone goes in with a shovel to scrape the cruft off the edges of the database.

With Caffeine, they've flipped all that over. The crawler is always out, always indexing, and it's always updating the index. Whether it makes it to your site today or tomorrow is unknown, but now at least it's more likely to be picked up sooner rather than far far later. This is how Google describes it:

Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles.

As a searcher, you may find that you are getting more timely results more regularly. The effect may not be immediate, however. I have already been trying it out, but my tweets aren't really a good example.

No comments:

Post a Comment