![]() |
![]() Enterprise-Ready Apache Hadoop When our company was acquired by EMC in July of 2010, we could have easily been scooped up and monetized as a pretty nice data warehousing business for our parent company. They decided to do the opposite. EMC’s leadership believed in our team and our vision for leading the Big Data analytics industry and decided to double down on their investment. The largest and most ambitious piece of this “double down” was the decision to enter the Hadoop market and develop our own Apache-based Hadoop distribution, Greenplum HD. This was undoubtedly going to be a challenge and would require building a great team — a freaking amazing team at that! So in this post, I’d like to shed some light on the incredibly talented and diverse group of technologists we’ve assembled here at Greenplum and in the process, point out some of the groundbreaking work they’re doing. I’ve grown to call to this group Greenplum’s “Team of Rivals”, referencing the title of the Doris Kearns Goodwin book in which she tells the story of how Abraham Lincoln recruited a talented, yet divisive group of politicians to form his cabinet. As somewhat of an American history nerd, I found “Team of Rivals” fascinating for Goodwin brings to light the brilliance Lincoln displayed in filling his cabinet with the best people, period. Instead of taking into consideration the particular background or political leaning of these individuals, Lincoln concentrated on recruiting leaders. A cabinet of leaders he was able to motivate to work together for the sake of the greater good. Throughout the book, Goodwin shows that the collective knowledge and experience that this group brought to Lincoln’s presidency was critical to his success as the nation’s leader and in keeping the country united while on the path to victory in the Civil War. ![]() Image of Abraham Lincoln’s Cabinet via Longwood University. This story especially resonated with me as I watched Luke Lonergan, Greenplum’s co-founder, build our own “Team of Rivals” in his recruitment of the Greenplum HD team. By bringing together an accomplished group of technologists who serve as leaders in their respective fields, Greenplum has assembled a group similar to Lincoln’s in terms of diversity and talent (Big Data talent in this case). Instead of rivalries based on political leanings, the rivalries within our team are based on the somewhat conflicting concepts that apply to each member’s area of expertise. What is probably the largest and most obvious rivalry within this group is the conflict associated with integrating Greenplum’s MPP database technology with Hadoop. Yet, this is one of the many concepts this team is working on; others include: integrating Hadoop with external storage instances, virtualizing Hadoop, and marrying the properties of High-Performance Computing with Hadoop. As Lincoln’s team worked towards the common goal of winning the Civil War, the Greenplum HD team is also united around a single objective: creating the industry’s leading Hadoop distribution. And given the open, collaborative nature of this team we’ve already succeeded in avoiding the egotism and mutual disdain that Lincoln had to overcome in the formation of his cabinet. Let’s be clear — in no way am I saying that building a Hadoop distribution is as important or difficult as navigating a Civil War — but I think anyone who has dealt with this technology can agree, it’s definitely not trivial. Greenplum’s Team of RivalsThe core of the Greenplum HD team ties back to the original Hadoop team at Yahoo!, responsible for the very first Hadoop cluster brought into production at the Internet giant. Former Yahoo! architect, Sameer Tiwari, is leading the overall strategy of the Greenplum HD platform and is especially focused on integrating Hadoop within a hybrid enterprise data environment with multiple storage systems. Sameer has been building platform products for large deployments since the Application Server days at Sun Microsystems and built scale-out analytics architectures before the term “Big Data” even existed. He’s also an avid blogger: be sure to check out his recent posts, Hadoop and Disparate Data Stores and Managing Hot and Cold Data for more details on the problems he’s focused on addressing. Sameer is joined by Greenplum’s head of Hadoop engineering, Apurva Desai, who was responsible for managing the world’s largest Hadoop implementation at the time, with over 50 Petabytes of data distributed across 45,000 nodes within Yahoo’s private cloud. Since joining Greenplum, Apurva has not only quickly built a rockstar engineering team, but has also successfully launched Greenplum’s 1,000 node Big Data research and development environment called, The Analytics Workbench. More recently, Apurva’s team has released Greenplum HD 1.2, which is based on Apache Hadoop 1.0.3, and includes a Hadoop administration application called Greenplum Command Center. GPHD 1.2 also features an automated install and configuration utility, code extensions to enhance the reliability and performance of Hadoop in virtual environments (co-developed with VMware) and the leading data loading utility for Hadoop, GP Data Loader. In some of the recent testing of GP Data Loader, Apurva’s team has benchmarked loading data onto 420 nodes of Greenplum HD at nearly 1 Terabyte per second. We will have more concrete loading numbers to release in January, so stay tuned. Apurva and Sameer work closely with two of Greenplum’s resident MPP database experts, Gavin Sherry and Dr. Lei Chang. The group has been collaborating on a project to bring Greenplum’s MPP SQL engine to the Hadoop platform. With the development of Apache Hadoop 2.0 and the introduction of YARN as the new resource manager, Hadoop has been re-architected to support computing paradigms other than MapReduce. This has given our team the appropriate framework to expand upon this project and allow customers to directly compute against Hadoop data leveraging Greenplum’s award-winning MPP database engine. The success of this project relies heavily on the impressive database backgrounds of both Gavin and Lei. Gavin is the architect and leader of the Greenplum Database (GPDB) engineering team and is one of the leading contributors to the Postgres project having originally joined in 1999. His experience as the Greenplum Database Kernel engineer over the past six years makes him an invaluable member of this team. Based of out Greenplum’s Research and Development office in Beijing, Lei leads the team architecting GPDB to run atop of HDFS. He has a PhD in Data Warehousing and Data Mining from Peking University and his work at Greenplum generally focuses on blending parallel data warehousing and Hadoop technologies. If you missed Lei’s presentation at this year’s Hadoop Summit, Greenplum Database on HDFS, you might want to check out the video recording here. One of the other points of emphasis in the development of Greenplum HD was incorporating Message Passage Interface (MPI) and other properties from High Performance Computing (HPC) into Hadoop. Although the advent of Hadoop has somewhat influenced the way the HPC community thinks about computing, the two worlds are still very much apart. To kick off this effort of blending the best from the traditional high performance computing realm with all of the innovation and development occurring in the Hadoop ecosystem, Luke brought in one of the best, Dr. Milind Bhandarkar. Milind is a veteran developer of HPC software platforms (see his PhD Thesis Charisma: A Component Architecture for Parallel Programming) who was also one of the founding members of the Hadoop development team at Yahoo! In his role as Chief Scientist of Greenplum’s Machine Learning Platform, Milind is involved in quite a bit around here but more recently has centered his attention on a project referred to as MR+, which is focused on bringing MPI and Hadoop closer together.Milind’s journey towards the convergence of Hadoop and HPC has led to a number of different discoveries, many of which have exhibited the conflicting nature of the two paradigms. The most notable discovery has been YARN’s inability to scale from a performance perspective. Typically when technology vendors or users talk about scaling, they talk about the ability of software to arbitrarily run over large clusters. However, in HPC environments, performance is what defines scale. Milind’s team, found that the launch times for MPI jobs using YARN would run adequately when dealing with a small number of processors but when they ran with over 100 processors, the systems simply wouldn’t perform. As a result of these findings, this group has separated their MR+ efforts into 2 parts:
We’ll release more details of this project in a future blog post but have begun working with HPC leaders such as the Sequoia team at Lawrence Livermore Labs and have already seen dramatic performance improvements in MapReduce job launch times using the MR+ framework. To round out the team leading Hadoop development at Greenplum, we’ve also added industry leaders who have worked with two of the most sophisticated Hadoop organizations in the world: The U.S. Government and Hortonworks. Having earned a PhD in Machine Learning & Multi-Agent systems, Dr. Donald Miner is not only bringing an impressive educational background to his role as Lead Hadoop Solutions Architect at Greenplum but is also building upon a wide-range of experiences from his past life working as an analytics consultant to the federal government. One of the most striking differences in the government’s use of Hadoop vs. early adopters of technology is by no surprise, security. Don is actively incorporating his background making Hadoop more secure and working with security-focused NoSQL database technologies such as Accumulo, in his influence over the development of Greenplum HD. Since joining Greenplum, Don has taken on a customer-facing role, actively working with customers on architecting and deploying Hadoop environments. He just released a book, MapReduce Design Patterns, at the O’Reilly Strata and Hadoop World conference in New York, which has garnered a tremendous amount of attention within the Big Data community. If you’re ever in the DC/Baltimore area, you’ll probably be able to catch him at the local Hadoop User Group (HUG) Meetup. Vitthal (Suhas) Gogate is the latest member to join the Greenplum HD team, having served most recently as Architect and Technology Lead at Hortonworks. He is a founder and PMC member of the Apache Ambari project, an open source project focused on building software to handle the management and monitoring of Hadoop clusters. Suhas previously had worked at Netflix as a Hadoop expert developing their Hadoop-based data analytics platform on the Amazon EC2 cloud. He was also involved in the early days of Hadoop at Yahoo!, serving as a Solutions Architect focused on consulting and training thousands of users. While at Yahoo!, Suhas led the development of many user-facing tools for Hadoop including Vaidya, a performance diagnostic tool for MapReduce which he later open sourced as an Apache Hadoop project. At Greenplum, Suhas is focused on the overall architecture and design of Greenplum HD (GPHD), including the development of Greenplum’s own web-based Hadoop management tool, Greenplum Command Center.For anyone who has watched this space over the past 18 months, it’s no secret that the Hadoop race is on. We’re confident that with this diverse team of leading Hadoop talent building off of each other’s unique expertise, Greenplum HD is primed for success. Stay tuned in the coming months as our “Team of Rivals” has got some groundbreaking technology to release that we fully expect will change the face of Big Data. |
Update your feed preferences |
