According to an article at Computerworld, Yahoo is running a 2 PB (not GB, not TB, PB - Petabyte) database that processes 24 billion events a day. Let's put that in perspective. 24 billion events is 24,000 million events; 24,000,000,000 events. 1 petabyte is 1,000,000,000,000 bytes. Yahoo has two of those. Actually, I should be basing this on 1k which is 1024 but when you're dealing with petabytes, I don't think we need to be picky. We're talking really, really big.
Yahoo uses this database to analyze the browsing habits of it half a billion monthly visitors. How would you like to tune those queries? Do you think they allow ad-hoc access?
And the data, all of it constantly accessed and all of it stored in a structured, ready-to-crunch form, is expected to grow into the multiple tens of petabytes by next year.
That means that it is not archived and is sitting in tables, ready to be queried.
By comparison, large enterprise databases typically grow no larger than the tens of terabytes. Large databases about which much is publicly known include the Internal Revenue Service's data warehouse, which weighs in at a svelte 150TB.
Even one TB is still a bug database. Today's 10TB database is last decade's 10GB database. I remember trying to get acceptable performance on a multi-gig database in the early 90s. That was painful. Today, I regularly have indexes bigger than that.
So the real questions are how did they do it and can just anyone do it? Don't rush out to create your own PB database with Postgres just yet.
According to the story, they used Postgres but modified it heavily. Yahoo purchased a company that wrote software to convert the postgres data store to a columnar format (think Vertica or Sybase IQ). That means they also had extensive engineering support to pull this off. They left the interface mostly alone though so that Postgres tools still work. Of course, the whole purpose of using Postgres was that it was a free SQL database. That means that they are accessing it via SQL.
The database is running on "less than 1000" PCs hosted at multiple data centers. Yahoo does not plan to sell or license the technology right now but I would be surprised if that doesn't come at some point. I wonder if they will release that code to the Postgres community? I wonder if the Postgres community would accept it if they did?