From Threads, Fall 2010 issue
The Starfish team (from left): Fatma Bilgen Cetin, Herodotos Herodotou, Fei Dong, Nedyalko Borisov, Gang Luo, Harold Lim, Liang Dong, and Prof. Shivnath Babu
At first, Professor Shivnath Babu was fascinated. Hadoop looked like an exciting new way to process massive amounts of data -- an open-source version of Google's famous MapReduce, a software framework that allows users to analyze large data sets across hundreds, even thousands, of computers. But in papers and at conferences, database experts dismissed the system as inferior to traditional databases, so Babu held his enthusiasm in check.


Starfish finds good settings automatically for MapReduce jobs
But beginning in 2007, Babu, an Assistant Professor in the Department of Computer Science specializing in database systems, watched as companies like Yahoo!, Facebook, eBay, and AOL began using Hadoop for everything from structured data storage to search optimization. Finally, while on sabbatical in the spring of 2010, Babu let his curiosity win out and took Hadoop for a test drive.
"What I saw was very different from what you hear and read in papers. There was much more under the hood," says Babu. "Boy, were those people wrong." Hadoop had many benefits, Babu realized: It allowed a user to input massive amounts of data easily, to change and adapt workloads based on the quantity of data, and to perform sophisticated data analysis, all without requiring expert knowledge of databases. Exploring the system, Babu began to think about all the people that it could benefit.
We live in an era of massive amounts of data. "Big data" is no longer a term just for analysts, but for computational biologists, economists, physicists and more. Each year, scientists gather petabytes of data from diverse sources, "but they're struggling to process it," says Babu. To use databases, a person needs to be skilled in Structured Query Language, a programming language of which most scientists have never even heard. To use Hadoop, though it is more user-friendly, they may have to adjust many tuning knobs in the system to meet their performance needs. That's not a problem for companies like Facebook and Yahoo! with the resources to do so. But what about everyone else?
Visualizing Hadoop execution
This fall, Babu and students have embarked on an effort to bring Hadoop to the masses, a project they fondly call "Starfish." The goal is to create a system in which a user can write a big data application in whatever programming language is most comfortable for them. Even more importantly, the system will assess what the user is trying to do -- whether it's cataloging thousands of PDFs or mining data from hundreds of experiments -- and streamline the process. "The more a system understands what the program is trying to do, the better it can optimize it," says Babu. "This opens up a whole bunch of interesting challenges."
After a year of research and initial tests, this fall Babu and his students wrote and submitted their first paper on Starfish. They've designed Starfish to act as a layer on top of Hadoop to make the platform self-tuning -- able to optimize to a user's needs on the fly -- and also more robust, more elastic (able to easily shrink or grow as the size of a dataset changes), and better at managing numerous versions of the same data.
"We have a good idea of where we are going," says Babu. "Hadoop has shown promise, but its usage has grown beyond big companies. Hopefully Starfish users will be spared the hassle of setting up and tuning Hadoop," he adds. "Instead, they can write programs in whatever language they want, and Starfish will do the rest."