Geeks 'R' Us

Saturday, March 05, 2005

Google's secret of success? Dealing with failure

Google Technology: Linux, Shards, MapReduce, and the GFS

CNet reports on Google’s Urs Hölzle, who nicely wraps up the Google server farm specs. According to the article, Google wrote its own file system GFS, the Google File System. It’s optimized for handling large 64 megabyte data block. The whole systems assumes failure can and will happen anytime, and does the necessary to automate the process involved in handling fallbacks.

The Google data is replicated in three different places, with one master machine to locate individual copies of pieces of data (such as a keyword index).

Google’s servers – according to some figures, there are 100,000 of them – are running a stripped down lean Red Hat Linux variant. This, according to CNet, is really just a Google-modified OS kernel of the original system.

And then, there’s “shards” and “MapReduce”:

“[Google Inc] has also devised a system for handling massive amounts of data and returning rapid responses to queries. Google splits the Web into millions of pieces, or “shards” in Google tech speak, which are replicated in case of failure.

Not surprisingly, the company creates an index of words that appear on the Web, which it stores as an array of large files. But it also has document servers, which hold copies of Web pages that Google crawls and downloads.

Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward, according to Hoelzle. (...)

Google’s programming tool, called MapReduce, which automates the task of recovering a program in case of a failure, is critical to keeping the company’s costs down.”


[Source]