Oct 18, 2011 4:01 PM

How Yahoo Spawned Hadoop, the Future of Big Data

If you listen to the pundits, Yahoo isn't a technology company. And yet it spawned one of the most important software technologies of the last five years: Hadoop, an open source platform designed to crunch epic amounts of data using an army of dirt-cheap servers.

Image may contain White Board Human Person Clothing Apparel Teacher Text and Man

The email went to Eric14. His real name is Eric Baldeschwieler, but no one calls him that. At fourteen letters, Baldeschwieler is a mouthful, and he works in a world where a name takes a backseat to an online handle.

The sender was Rob Bearden, a serial entrepreneur from Atlanta, Georgia, famous for actually making money from open source software. He e-mailed Baldeschwieler because he was looking to build a new company around what is widely regarded as The Next Big Thing in corporate computing. The irony is that Baldeschwieler worked for an outfit few would associate with enterprise technology. And if you listen to the pundits, it wasn’t a technology company at all. He worked for Yahoo.

The two met for dinner at a Vietnamese restaurant in Palo Alto, California, just down the road from Yahoo’s Sunnyvale headquarters. By the end of the meal, they agreed that Yahoo -- yes, Yahoo -- held the seeds of a company that could reshape the way big businesses operate, and within six months, they convinced the Yahoo board to spin off Baldeschwieler and about 25 other engineers.

Dubbed Hortonworks, the new venture is by no means guaranteed success, but it certainly has its hands on the right technology. Much as the world can’t quite grasp Eric Baldeschwieler’s last name, the pundits are still struggling to wrap their heads around the fact that Yahoo bootstrapped one of the most influential software technologies of the last five years: Hadoop, an open source platform designed to crunch epic amounts of data using an army of dirt-cheap servers.

>"There's a change happening, driven by unprecedented volumes and velocities of unstructured data. Traditional relational databases and business intelligence software can't handle this. The thesis is that Hadoop can"

Today, Hadoop underpins not only Yahoo, but Facebook, Twitter, eBay, and dozens of other high-profile web outfits. It analyzes the vast amounts of data generated by these online operations, but it also pumps data into live public applications, including Facebook’s new messaging services and the "Today" module that serves up news stories on the Yahoo homepage.

Last year, eBay erected a Hadoop cluster spanning 530 servers. Now it’s five times that large, and it helps with everything analyzing inventory data to building customer profiles using real live online behavior. "We got tremendous value -- tremendous value -- out of it, so we’ve expanded to 2,500 nodes," says Bob Page, vice president of analytics at eBay. "Hadoop is an amazing technology stack. We now depend on it to run eBay."

Thanks to its success on the web, the platform is primed for use in the business world -- leading a wave of technologies spilling out of the net’s biggest names and into the corporate data center. "Hadoop is not just a laboratory curiosity," says Jim Kobelius, an analyst with research outfit Forrester, who spent the past few months interviewing companies about their use the technology. "It’s actually in use in operational environments today."

Giants such as IBM, EMC, Oracle, and even Microsoft are pitching Hadoop tools at corporate customers. An all-star startup dubbed Cloudera has sprung up around the technology, counting among its ranks Hadoop's original developer, Doug Cutting, who once worked for Baldeschwieler at Yahoo. And now, Cloudera and Cutting have big competition from his former boss. Hortonworks officially opened for business in July with Rob Bearden as chief operating officer and Baldeschwieler as CEO.

In today's internet-driven world, more and more data is hitting big businesses, and it's hitting them faster. Hadoop is a way of dealing with that data, and Hortonworks aims to take the open source project mainstream. "There's a change happening, driven by unprecedented volumes and velocities of unstructured data," Rob Bearden says. "Traditional relational databases and business intelligence software can't handle this. The thesis is that Hadoop can."

Open Source Déjà Vu

If you had stumbled onto Rob Bearden and Eric Baldeschwieler as they sat down for dinner that night in Palo Alto, you might have wondered what on earth brought them together. Born and raised in Georgia before graduating with a degree in marketing from Jacksonville State University in Alabama, Bearden is a man who knows how to get a point across with his lilting Southern accent, whereas Baldeschwieler, aka Eric14, is very much the laconic software engineer. He gets his point across with code.

Peter Fenton -- a partner with Hortonworks’ chief backer, Silicon Valley VC firm Benchmark Capital -- sees the contrast. He describes Bearden as a "true Southern gentleman", before referring to Baldeschwieler as an "introverted architect." But for a venture like Hortonworks, Fenton says, Bearden and Baldeschwieler are ideally suited.

"Eric is the 'editor in chief', whereas Rob is the 'publisher,'" Fenton explains. "There's the person who's the conceptual authority for the company, who knows where to go next, and then there's the person who actually builds the business, who builds in rigor and general management and processes. They have to complement each other in such a way that they have to be different."

Based in Sunnyvale, Hortonworks aims to expand the scope of the open source Hadoop project, adding all the tools the average corporate operation would need, before eventually settling on a way to make money through service and support. Baldeschwieler -- who ran the Hadoop team at Yahoo from the very beginning -- will oversee the coding. Bearden will hunt down the revenue.

Bearden has played the "publisher" role before. Nearly a decade ago, Fenton hired him as president and chief operating officer at JBoss -- an Atlanta-based outfit that grew up around the JBoss open source Java application server -- and the project’s founder, Mark Fleury, served as CEO. Bearden then moved to another Benchmark-backed, open source outfit, SpringSource, where he worked as COO alongside chief exec and project founder Rod Johnson.

Fleury declined to discuss his time with Bearden, but another JBoss colleague, Joe McGonnell, says that underneath Bearden’s smooth matter is the sort of ruthlessness that makes things happen in the business world. "On the surface, he has a laid-back personality and a Southern charm if you will, but at the same time, he’s very intense," McGonnell says. "He’s a very passionate guy. He really locks in on the opportunity he’s working on."

At both JBoss and SpringSource, Bearden got results. In just two years, JBoss reached a $60 million revenue run rate, before it was purchased by Red Hat for at least $350 million in 2006, and SpringSource was scooped up by VMware for approximately $362 million in 2009. He didn’t have the same success with his venture in between those two big sales -- the Atlanta-based OpenSpan -- but two out of three ain’t bad.

In many ways, Hortonworks is déjà vu all over again. Open source project. Funding from Peter Fenton. Engineer as chief exec. Bearden as the businessman. But his latest venture is also a bit different. The immediate aim is to expand the open source project, not sell subscriptions to a distro based on the existing open source code. The open source code, Bearden says, isn't yet ready for the enterprise, and Hortonworks aims to change that.

"We believe that for this market to grow and grow fast, everything has to happen in open source," says Bearden. "We need to make the open source project enterprise viable. Right now, there are less than 50 production sites of Hadoop in the world, but nearly every Fortune 500 company is evaluating it. We have to make sure it's ready for them to use."

Bearden is coy about how the company intends to actually make its money, and on the surface, his let's-just-expand-the-open-source-project pitch seems naively idealistic. But it's hard to argue with his track record, and those who know him say he has knack for seeing where a market will eventually go. “There’s a lot of really talented people that have worked with Rob that keep following him around from company to company, and that says a lot about him," says McGonnell. "They know he’s got a knack for finding companies that have an opportunty to change market, not companies that can grow at a respectable rate."

Hortonworks is his most ambitious play yet. The ultimate goal is not to sell the company to a big-name tech outfit. Hortonworks just spun out of a big-name tech outfit. Bearden believes the market for Hadoop is so large that Hortonworks will eventually grow into a public company to rival the likes of Red Hat and VMware. "This is clearly one we build to go public," Bearden says. "The whole data layer for the enterprise is shifting. That's the market opportunity for this company."

Origins of the Elephant

The name is a nod to the eponymous elephant in Dr. Seuss’s classic children’s book Horton Hears a Who. Hadoop, you see, has a long history with pachyderms. It was dubbed Hadoop after the yellow stuffed elephant that once belonged to the son of its founder, Doug Cutting. It’s unclear whether the Seuss estate is looking for royalties. But Cutting’s son certainly is. "I've told him I'll pay his college tuition," Cutting says.

In 2004, Doug Cutting and fellow coder Mike Cafarella were building an open source alternative to web search engines like Google and Yahoo. The project was called Nutch, and they needed an inexpensive way of building the massive index of web pages the project required. As luck would have it, Google had just released a pair of research papers describing two proprietary software platforms that underpinned its own search engine: the Google File System (GFS), a means of storing data across distributed machines, and Google MapReduce, a distributed number-crunching platform that runs atop GFS. These served as the basis for Hadoop.

"We knew from the beginning we wanted to do Nutch in a distributed manner. We knew we couldn't do it on one machine," Cutting says. "We had it up and running on four or five machines, and we had a lot of manual steps to keep everything running on those machines. The papers provided a way to automate all those steps and give us reliability and really give a nice framework for what we were doing."

Cutting was a free agent at the time, but he was looking for full-time employment, and somewhere along the way, he discussed his new project with Raymie Stata, then the chief architect of search and advertising at Yahoo. Cutting had also interviewed with IBM, but whereas Big Blue was interested in older open source project of his -- Lucene, an information retrieval software library -- Stata latched on to Cutting's latest obsession.

Stata soon put Cutting on contract to continue his work on Hadoop, and in early 2006, after a working prototype was in place, Yahoo committed to moving its search infrastructure to the platform. "Every search company at that time had built something very MapReduce-like ... [but] these were overly complex systems," says Stata, who later ascended to the CTO post at Yahoo. "The genius MapReduce is that it was such a simple abstraction." Cutting joined a brand-new Hadoop team run by Eric Baldeschwieler, who had come to Yahoo after the company acquired his previous employer, search outfit Inktomi.

In those days, Cutting never imagined the platform would be used for anything other than a search engine. But Baldeschwieler says that Yahoo embraced the project -- and kept it open source -- because the company realized it could become a "general purpose" technology. "From the beginning, we thought that the world needed this," Baldeschwieler says, "though it exceeded our wildest expectations."

Hadoop Maps the Web

Two years later, in January 2008, Yahoo flipped the switch on its first major Hadoop application, using the platform to build its search "webmap" -- an index of all known webpages and all the meta-data needed to search them. At the time, Baldeschwieler boasted that Yahoo’s 10,000-processor-core Hadoop cluster could generate this webmap 33 times faster than the company’s previous system on the same hardware, but speed was only a small part of the story.

When running on all cylinders, the old system was actually faster than the new Hadoop application. The difference is that Hadoop -- which spreads tasks evenly across a sea of low-cost machines -- is designed to keep running when individual servers fail.

"When it’s running perfectly, the old system does outperform the new one. But of course, hardware fails, and there are all sorts of scenarios under which the old system doesn’t perform perfectly," Baldeschwieler said. "Hadoop gives us much more flexibility. It’s built around the idea of running commodity hardware that runs all the time."

Like Google MapReduce, Hadoop "maps" tasks across a cluster of machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation. If a system fails, another node can pick up the slack. It’s an old "grid computing" technique given new life in the age of "cloud computing."

>"Hadoop gives us much more flexibility. It’s built around the idea of running commodity hardware that runs all the time"

That spring, Yahoo held its first Hadoop developer summit in Santa Clara, California, and though it expected fewer than a 100 attendees, more than 350 showed up. Amazon was running Hadoop atop its Elastic Compute Cloud (EC2), and both Yahoo and IBM Research had built SQL-like query languages for the platform.

A year later, in spring of 2009, the second annual summit attracted twice as many developers. Facebook and eBay were using Hadoop. At Yahoo, it was driving services beyond the search engine. And even Microsoft was using the open source platform, after acquiring San Francisco startup Powerset, which had built its semantic search engine atop the platform.

Famously allergic to open source software, Microsoft eventually gutted Powerset’s Hadoop cluster and replaced it with some sort of proprietary technology. But it has now reversed course and embraced the platform once again, and Hadoop is a mainstay with the rest of the web’s biggest names -- including Google, which had adopted Hadoop a way of teaching the Mountain View way to up-and-coming engineers. The first enterprise play wasn’t far behind. Cloudera launched in March 2009.

The Hadoop All-Stars

Cloudera grew straight out of the big web players. Yahoo vice president of engineering Amr Awadallah and Facebook "data manager" Jeff Hammerbacher joined forces with Christophe Bisciglia, who had installed Hadoop as the cornerstone of the Google "Big Data" course he taught at the University of Washington. The idea was to do for Hadoop what Red Hat had done for Linux: create a company that would provide services and support and additional software around the open source platform.

Overseen by ex-Oracle man and longtime open source guru Mike Olson, Cloudera now boasts dozens of customers across both the web and the enterprise, including eBay, Samsung, Rackspace, Groupon, and comScore. And at the end of 2009, the Burlingame, California-based startup recruited another Hadoop all-star: Doug Cutting.

Cutting’s departure from Yahoo came little more than a week after the company agreed to shift its underlying search infrastructure onto Microsoft Bing. Though he said this played no role in his decision, his move was symbolic of Yahoo’s struggles to retain its talent -- and its mojo -- after offloading such a high-profile part of its business to Redmond.

The Microsoft deal meant that Yahoo’s webmap -- the world’s largest Hadoop application -- would soon vanish, and many wondered whether the company would retain its place at the hub of the Hadoop world. As it turned out, Yahoo continued to fund Hadoop development in a big way -- with the platform playing a major role across the rest of its web services -- but the company was still struggled to cope with the spectre of the Microsoft deal -- and other demons -- when Rob Bearden emailed Eric Baldeschwieler.

The Spinoff

After selling SpringSource to VMware, Fenton and Bearden, who had become a partner at Benchmark Capital, started looking for "biggest opportunity" in today’s business market, and they soon settled on Hadoop. "We looked at a lot of things, around social media and things like that. But it was very obvious, very quickly that being able to manage 'Big Data' is the biggest problem that CIOs have to solve, and they are looking for a new platform to do that with, as opposed to their existing relational [database] and [business intelligence] technologies,” Bearden said when Hortonworks was first announced. “It was clear that Hadoop was the way they wanted to solve the problem."

Fenton and Bearden considered an investment in Cloudera, but they eventually decided they didn’t like the company's business model -- even though that model is nearly identical to what they had so much success with at JBoss. Cloudera uses what’s called an "open core" model, offering an open source Hadoop distro as well as a for-pay enterprise version of the platform that includes proprietary tools.

Rob Bearden adamant that with Hadoop, the primary aim should be expand the core open source platform -- that this will ultimately bring the platform to a much wider audience. And the only way to do this, he says, to control the project’s core "committers".

The majority of the core committers were still at Yahoo, he decided, and those committers worked for Baldeschwieler. Baldeschwieler says that when they met, he and Bearden immediately agreed that the committers should be spun-off into their own company. The idea had already been discussed by some at Yahoo. But the pair still had to convince Jerry Yang and the rest of the Yahoo board.

>"We believed that Yahoo was living in the future, and that we were doing this for the general interest. Hortonworks as just a natural extension of what we started all those years ago"

Fenton and Bearden told Jerry Yang and company that they should spin a company off, rather than lose all their Hadoop engineers to Cloudera and other web competitors. "If we didn't form a company soon, those engineers would have been recruited one by one out of Yahoo," Fenton says. "Hadoop had tipped. It had gone from being in the back water of consumer internet companies to the mainstream consciousness of enterprise IT. As Rob [Bearden] said, that was good news for Yahoo but it was also bad news for Yahoo The only way to keep those engineers together was to give them the autonomy to go and pursue their mission as their own company."

But Yahoo's Raymie State looks back on the Hortonworks decision very differently. When Doug Cutting joined Yahoo and insisted that Hadoop remain open source, Stata agreed, and he says the same thinking carries over to the new venture. "[When Yahoo started work on Hadoop], we believed that Yahoo was living in the future, and that we were doing this for the general interest," he says. "Hortonworks as just a natural extension of what we started all those years ago. If we wanted [Hadoop] to become the standard of the industry, we needed a Hortonworks to exist."

Rumors of a spinoff first turned up in the press in April of this year, and by June, Yahoo, Baldeschwieler, and Bearden formally unveiled Hortonworks.

Elephants Everywhere

The initial result is an amusingly heated rivalry between Cloudera and Hortonworks -- the kind of rivalry you only see in the open source world. In September, Hortonworks unleashed a blog post boasting that Yahoo has contributed 84 percent of the lines of code still in the Apache Hadoop "trunk," and this sparked a public back-and-forth between Cloudera’s Mike Olsen and Eric Baldeschwieler over the accuracy of that figure. But ultimately, this Hadoop civil war shows just how vibrant the platform is.

"Additional investment in the platform and more people concentrating on the open source distro is good for community and good for Cloudera," Olson says. It’s the sort of thing you always hear from a competitor when a new company enters a market. But in this case, there’s a truth to it. Bearden and Baldeschwieler’s efforts to expand the open source project can only help Cloudera -- and the rest of the market.

As it stands, the open source code is a long way from meeting enterprise needs without some serious customization. "It still has problems with security, management, and efficiency -- enterprise-class problems," says eBay’s Bob Page. Though the platform is designed to keep running when “slave” servers go down, for instance, it collapses if the one "master" server goes down. And you can't really do random reads and writes as you would with a standard file system.

MapR, another Hadoop startup, has released a version of the platform that solves such problems, and this is the basis of the Hadoop appliance offered by EMC. But the open source platform, which can potentially reach far more businesses, is well behind.

>"It still has problems with security, management, and efficiency -- enterprise-class problems"

If the platform is refined, eBay's Page says, it can fit right into the enterprise, as a complement to existing data analytics and data warehousing tools from traditional IT giants like Oracle, HP, and SAP. Unlike these tools, Hadoop is designed to handle mountains of unstructured data -- just the sort data facing the enterprise in the modern Internet age.

With traditional database and data analytics tools, information is stored in neat rows and columns, and there are limits to how much data you can juggle and how quickly. With Hadoop, you can distribute raw data across a vast cluster of low-cost machines, and you can process that data in the same place you store it.

The result, says MapR boss John Schroeder, is that you can store all your data and analyze it as needed. "It's a George Guilder thing. If I can get a terabyte drive for $100 -- or less if I buy in bulk -- and I can get cheap processing power and network bandwidth to get to that drive, why wouldn't I just just keep everything?" he says. "Hadoop lets you keep all your raw data and ask questions of it in the future.

"It's kinda like the way you use email today versus 15 years ago. Fifteen years ago, you had a very limited amount of email storage, and you had to go in and delete stuff every so often, and if you wanted to be able to find anything, you had to carefully organize it into neat folder. Now, you can just keep everything, and you can search for anything you like. This is like Hadoop and Big Data."

Welcome Idealism

Schroeder says the market is bigger than he even expected, boasting that just months after launching its own proprietary Hadoop distro, the company has $60 million in the "sales pipeline" and $6.5 million in sales opportunities that could close in the third and fourth quarter. Most of these potential customers, he says, have Hadoop up and running in some form. According to Cloudera's Mike Olson, his company is seeing particular interest among internet service providers and big financial institutions.

The potential market for Hadoop is so large -- it's such a general purpose technology -- there is sure to be plenty of support dollars to go around -- for Cloudera and Hortonworks and beyond. Rob Bearden and Eric Baldeschwieler aim not just to grab a small piece of those dollars, but to stretch the entire market to its limits -- at least, that’s their pitch.

The pitch has some scratching their heads, including Hadoop’s founder. "It not clear to me what their business really is," says Doug Cutting. But Hortonworks already has a customer in Yahoo, and it's now consulting with Microsoft as well. And whether the company makes any money in the long term or not, its short-term aim is certainly welcome. Hortonworks has now expanded to about 30 employees, and all but a few those are engineers contributing full time to Apache Hadoop. This benefits not only Hortonworks competitors, but any potential users.

Longtime open source pundit Matt Asay says that Cloudera’s Mike Olson is one of the few people in the industry he would never bet against. But another is Rob Bearden. Many have painted this as a clash the open source titans, arguing over who will come out ahead, but perhaps the larger point is that so many big names have put their weight behind Hadoop, and that not one two outfits have committed to refining the open source code with the average business in mind.

Insiders also argue over whether Google or Yahoo deserve the credit for Hadoop. After all, the project was based on research papers published by Mountain View. But this too misses the point. The story here is that unlike Google's platforms, Hadoop was open source, that Yahoo kept it open source, and that so many others, across the web and elsewhere, are committed to it. Some vendors will succeed with Hadoop, others will fail. But the technology is a juggernaut.

Photos: Jim Merithew/Wired.com