Friday, December 07, 2007 12/07/07-Super Crunching

In today's excerpt--"super crunching," the increasingly common practice of analyzing very large sets of data on a particular subject for improving service, diagnosing, setting policy or making decisions--such as the stocking and pricing decisions at Wal-Mart or Amazon:

"When I say that "super crunchers" are using large datasets, I mean really large. Increasingly business and government datasets are being measured not in mega- or gigabytes but in tera- or even petabytes (1,000 terabytes). A terabyte is the equivalent of 1,000 gigabytes. The prefix tera comes from the Greek word for monster. A terabyte is truly a monstrously large quantity. The entire Library of Congress is about twenty terabytes of text. Part of the point of this book is that we need to start getting used to this prefix. Wal-Mart's data warehouse, for example, stores more than 570 terabytes. Google has about [five] petabytes of storage which it is constantly crunching. ...

"Tera mining of customer records, airline prices, and inventories is peanuts compared to Google's goal of organizing all the world's information. ... Google has developed a Personalized Search feature that uses your past search history to further refine what you really have in mind. If Bill Gates and Martha Stewart both Google 'blackberry,' Gates is more likely to see web pages about the email device at the top of his results list, while Stewart is more likely to see web pages about the fruit. Google is pushing this personalized data mining into almost every one of its features. Its new web accelerator dramatically speeds up access to the Internet--not by some breakthrough in hardware or software technology--but by predicting what you are going to want to read next. Google's web accelerator is continually pre-picking web pages from the net. So while you're reading the first page of an article, it's already downloading pages two and three. And even before you fire up your browser tomorrow morning, simple data mining helps Google predict what sites you're going to want to look at (hint: it's probably the same sites you look at most days).

"The granddaddy of all of Google's super crunching is its vaunted 'PageRank.' Among all the web pages that include the word [you are searching], Google will rank a page higher if it has more web pages that are linking to it. To Google, every link to a page is a kind of vote for that web page. And not all votes are equal. Votes cast by web pages that are themselves important are weighted more heavily than links from web pages that have low PageRanks (because no one else links to them). Google found that web pages with higher PageRanks were more likely to contain the information that users are actually seeking. And it's very hard for users to manipulate their own PageRank. Merely creating a bunch of new web pages that link to your homepage won't work because only links from web pages that themselves have reasonably high PageRanks will have an impact. And it's not easy to create web pages that other sites will actually link to."

Ian Ayres, Super Crunchers, Bantam Dell, Copyright 2007 by Ian Ayres, pp. 10-11, 39-41.


