Hadoop Weekly Issue #71

Hadoop Weekly Issue #71

25 May 2014

Articles in this week’s newsletter cover a couple of themes that have been emerging recently in the Hadoop ecosystem. First, Apache Storm continues to see adoption for production workloads (whereas I’ve yet to see many serious deployments of newer tools like Spark streaming). Second, Hadoop in the cloud is starting to gain traction (and will likely accelerate as light-weight virtualization and the cloud price wars take off). There are a lot of good articles covering these topics and more in this week’s issue.


kafka-storm-starter is a repository containing an example integration between Kafka and Storm for stream processing. It uses Avro for serialization, and the code base contains both Kafka and Storm standalone code examples, example unit tests, and example integration tests. The README has a lot of details on the implementation, on setting up a development environment, and much more.


“Compiled Python UDFS for Impala” explores the details of the LLVM-based Python UDFs for Cloudera Impala that can be built with the recently released impyla project. The talk details the LLVM intermediate output format, some example impyla code for building llvm UDFs, and gives a performance comparison between impyla and PySpark.


Informit has posted the preface to the recently-released “Apache Hadoop YARN” by Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, and Jeff Markham. The preface briefly tells the story of Hadoop, the Hadoop community, and the reasons for and goals of YARN.


Big Data & Brews has a “How It Works” episode that includes architecture overviews by folks from Concurrent, MapR, and Pivotal.


Flickr has recently migrated an image classification system from OpenMPI based system to a hybrid batch + online system using Apache Hadoop and Storm. Hadoop was used for building a classifier using a training data set, and it can also process the entire flickr corpus using 10,000 mappers in about 1 week. For real-time updates, they use a 10-node Storm cluster pulling from redis. The slides from the presentation have much more detail, and there’s a link to the video of the talk in the description.


HBase added support for a new DataType API as part of HBase 0.95.2. Prior to that, all interaction with HBase was using byte arrays. This talk goes through the motivation and API of the new DataType API. It also has some examples of using the TypedTuple, Structs, and Protobuf serialization APIs.


GigaOm has an interview with Cloudera’s VP of Engineering Peter Cooper-Ellis about Cloudera’s datacenter footprint and deployment setup. They deploy on bare-metal (for benchmarking, PoC, etc), internal clouds (for core products like building and packaging), and public clouds like Amazon (for product development, testing, and debugging). Cloudera has to test out lots of different configurations in order to certify and support third-party integrations. The article also notes that optimizing Cloudera products for cloud environments is one of the engineering teams top initiatives for 2014.


Folks tend not to run Hadoop in cloud environments due to performance and cost reasons, but I’m starting to see more evidence of companies doing so—particularly when it comes to HBase. This talk is explains the various details of Pinterest’s HBase deploy in the Amazon cloud, which serves terabytes of data. It talks about schema design, load testing, tools, monitoring, alerting, and more.


The Cloudera blog has a post showing how to convert data from Avro to Parquet. It includes two examples, the first using the Java APIs to write out data single-threaded. The second shows how to write a MapReduce job to do the conversion. It also has some details on tweaking compression and other configuration parameters.



Hortonworks and BMC announced a partnership to bring BMC’s Control-M for Hadoop to the Hortonworks Data Platform. Control-M is a workflow automation tool which supports HDFS, Pig, Sqoop, Hive, and MapReduce.


Computing has an interview with Cloudera Chief Strategy Officer Mike Olson on the recent Cloudera-Intel deal. He defends the deal, which has been criticized as positioning Cloudera to compete with established DW vendors like Teradata and Oracle, and accepts blame for not explaining their positioning and strategy better. He also defends Cloudera’s stance on open-source (their distribution contains proprietary components), and speaks more about the benefits of the Intel deal.


The Parquet project was accepted into the Apache Incubator. Co-founded by Twitter and Cloudera, Parquet is a columnar storage format built for Hadoop. The format is supported by a number of projects and commercial distributions. The Cloudera blog has more details and a list of posts written about Parquet.


Databricks and Pivotal announced a partnership to support Apache Spark on Pivotal HD 2.0.1. The news was announced on the Databricks blog in a post by Pivotal. The post contains links to downloads and documentation for getting started with Spark on Pivotal HD at the end.


MongoDB and Hortonworks announced that MongoDB is now a Hortonworks Certified Technology Partner. In technology terms, the MongoDB Hadoop Connector has been certified for HDP 2.1. According to the announcement, extensive review and testing was done as part of the certification, and there is detailed documentation linked to from the post.


InformationWeek has an article on the recently-announced public beta of Splice Machine’s SQL-on-Hadoop product. The article has some additional details on the implementation, which marries Apache Derby and HBase (from the 0.95 series). There’s a testimonial from Harte Hanks about their plan to replace a Oracle RAC system with Splice Machine.


The GigaOm Structure podcast hosted Altiscale co-founder and CEO Raymie Stata, who was formerly CTO of Yahoo during the period at which Hadoop was just getting started. The GigaOm website has a summary of the podcast, which covers Hadoop as a Service, some thoughts on the commercial Hadoop industry, Spark, search, and more.


The Optiq project was accepted into the Apache Incubator. Optiq is a system to allow SQL-access to data stored in heterogenous data stores including an advanced query optimizer. It’s in use by several Hadoop ecosystem projects including Apache Drill, Hive, and Cascading Lingual.



Apache Flume 1.5.0 was released. This release adds a number of new features including a SpillableChannel (for spilling to disk when in-memory buffer fills up) and an Elasticsearch Sink. The release also contains a large number of documentation improvements, bug fixes, and general codebase improvements.


Version 0.14.1 of the Kite SDK was released this week. The release contains a number of bug-fixes, including dataset example fixes and a fix to the Kite Maven plugin.


Amazon Elastic MapReduce released a new version of their Hadoop 2 AMI with support for Hadoop 2.4.0, Pig 0.12, Impala 1.2.4, HBase 0.94.18, and Mahout 0.9.


Version of the Microsoft .NET API for Hadoop was released. The software uses the Hadoop HTTP APIs to communicate with Hadoop clusters, and it is available in the nuget Package Manager Console.


Google has open-sourced the Google Cloud Storage Connector as part of the “bigdata-interop” Hadoop tools for the Google Cloud Platform. The code is Apache licensed.


Parquet 1.5.0 was released. The release adds statistics to Parquet pages and row groups, adds fixes for column pruning, bumps the protobuf dependence to version 2.5.0, and more.


FullContact has open-sourced their SSTable InputFormat for Hadoop. The library provides offline access to Cassandra data for MapReduce. The SSTable InputFormat can split input data for better parallelism.



Curated by Mortar Data ( http://www.mortardata.com )



A Deep Dive on Amazon Kinesis for Real-time Stream Processing (Irvine) – Wednesday, May 28

Pepperdata Meetup: Best Practices for Large-Scale Distributed Systems (Sunnyvale) – Wednesday, May 28

Leveraging Vertica for Analytics in Hadoop/NoSQL environment (San Ramon) – Wednesday, May 28

Beyond MapReduce [Clash of The Titans Series] (Sunnyvale) – Thursday, May 29

How LinkedIn Uses Scalding for Data Driven Product Development (Mountain View) – Thursday, May 29

Washington State

Fun Things You Can Do With Spark 1.0 with Paco Nathan [Special Event] (Seattle) – Tuesday, May 27

Spring 2014 Seminar Series: Big Data Infrastructure (Tacoma) – Wednesday, May 28


This Ain’t Your Father’s Search Engine (Saint Paul) – Thursday, May 29

North Carolina

May CHUG: Nikhil Kumar (SyncSort) on Converting SQL to MapReduce (Charlotte) – Wednesday, May 28

New York

Design Framework for Big Data Computing (New York) – Wednesday, May 28

Big Data Topic: Applying Testing Techniques to Hadoop Development (New York) – Thursday, May 29

Inside Look: How Next Big Sound Tamed the Hadoop Ecosystem for Music & Publishing (New York) – Thursday, May 29


Monthly Meetup – Open Sessions (Toronto) – Thursday, May 29


Hadoop vs Spark (Tel Aviv-Yafo) – Tuesday, May 27


23rd Brussels Datascience Meetup (Ghent) – Tuesday, May 27


Simplifying Application Development on Hadoop (Berlin) – Monday, May 26

Meetup #14 with Ted Dunning and Sebastian Schelter (Berlin) – Thursday, May 29


Big Data Conference (Mumbai) – Friday, May 30

How YARN Made Hadoop Better (Hyderabad) – Saturday, May 31


Read More…

Yahoo at Hadoop Summit, San Jose 2014

By Sumeet Singh, Sr. Director, Product Management, Hadoop

Yahoo and Hortonworks are pleased to host the 7th Annual Hadoop Summit – the leading conference for the Apache Hadoop community – on June 3-5, 2014 in San Jose, California.


Yahoo is a major open source contributor to and one of the largest users of Apache Hadoop.  The Hadoop project is at the heart of many of Yahoo’s important business processes and we continue to make the Hadoop ecosystem stronger by working closely with key collaborators in the community to drive more users and projects to the Hadoop ecosystem.

Join us at one of the following sessions or stop by Kiosk P9 at the Hadoop Summit to get an in-depth look at Yahoo’s Hadoop culture.


Hadoop Intelligence – Scalable Machine Learning

Amotz Maimon (@AmotzM) – Chief Architect

“This talk will cover how Yahoo is leveraging Hadoop to solve complex computational problems with a large, cross-product feature set that needs to be computed in a fast manner.  We will share challenges we face, the approaches that we’re taking to address them, and how Hadoop can be used to support these types of operations at massive scale.”

Track: Hadoop Driven Business

Day 1 (12.05 PM). Data Discovery on Hadoop – Realizing the Full Potential of Your Data

Thiruvel Thirumoolan (@thiruvel) – Principal Engineer

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

“The talk describes an approach to manage data (location, schema knowledge and evolution, sharing and adhoc access with business rules based access control, and audit and compliance requirements) with an Apache Hive based solution (Hive, HCatalog, and HiveServer2).”

Day 1 (4.35 PM). Video Transcoding on Hadoop

Shital Mehta (@smcal75) – Architect, Video Platform

Kishore Angani (@kishore_angani) – Principal Engineer, Video Platform

“The talk describes the motivation, design and the challenges faced while building a cloud based transcoding service (that processes all the videos before they go online) and how a batch processing infrastructure has been used in innovative ways to build a transactional system requiring predictable response times.”

Track: Committer

Day 1 (2.35 PM). Multi-tenant Storm Service on Hadoop Grid

Bobby Evans – Principal Engineer, Apache Hadoop PMC, Storm PPMC, Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

“Multi-tenancy and security are foundational to building scalable-hosted platforms, and we have done exactly that with Apache Storm.  The talk describes our enhancements to Storm that has allowed us to build one of the largest installations of Storm in the world to offer low-latency big data platform services to entire Yahoo on the common storm clusters while sharing infrastructure components with our Hadoop platform.”

Day 2 (1.45 PM). Pig on Tez – Low Latency ETL with Big Data

Daniel Dai (@daijy)– Member of Technical Staff, Hortonworks, Apache Pig PMC

Rohini Palaniswamy (@rohini_aditya) – Principal Engineer, Apache Pig PMC and Oozie Committer

“Pig on Tez aims to make ETL faster by using Tez as the execution as it is a more natural fit for the query plan produced by Pig.  With optimized and shorter query plan graphs, Pig on Tez delivers huge performance improvements by executing the entire script within one YARN application as a single DAG and avoiding intermediate storage in HDFS. It also employs a lot of other optimizations made feasible by the Tez framework.”

Track: Deployment and Operations

Day 1 (3:25 PM). Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Koji Noguchi (@kojinoguchi) – Apache Hadoop and Pig Committer

“For the first time, the maestro shares his pearls of wisdom in a public forum. Call Koji and he will tell you if you have a slow node, misconfigured node, CPU-eating jobs, or HDFS-wasting users even in the middle of the night when he pretends he is sleeping.”

Day 2 (12:05 PM). Hive on Apache Tez: Benchmarked at Yahoo! Scale

Mithun Radhakrishnan (@mithunrk), Apache HCatalog Committer

“At Yahoo, we’d like our low-latency use-cases to be handled within the same framework as our larger queries, if viable.  We’ve spent several months benchmarking various versions of Hive (including 0.13 on Tez), file-formats, and compression and query techniques, at scale.  Here, we present our tests, results and conclusions, alongside suggestions for real-world performance tuning.”

Track: Future of Hadoop

Day 1 (4:35 PM). Pig on Storm

Kapil Gupta – Principal Engineer, Cloud Platforms

Mridul Jain (@mridul_jain) – Senior Principal Engineer, Cloud Platforms

“In this talk, we propose PIG as the primary language for expressing real-time stream processing logic and provide a working prototype on Storm.  We also illustrate how legacy code written for MR in PIG, can run with minimal to no changes, on Storm.  We also propose a “Hybrid Mode” where a single PIG script can express logic for both real-time streaming and batch jobs.”

Day 2 (11:15 AM). Hadoop Rolling Upgrades – Taking Availability to the Next Level

Suresh Srinivas (@suresh_m_s) – Co-founder and Architect, Hortonworks, Apache Hadoop PMC

Jason Lowe – Senior Principal Engineer, Apache Hadoop PMC

“No more maintenance downtimes, coordinating with users, catch-up processing etc. for Hadoop upgrades.  The talk will describe the challenges with getting to transparent rolling upgrades, and discuss how these challenges are being addressed in both YARN and HDFS.”

Day 3 (11:50 AM). Spark-on-YARN – Empower Spark Applications on Hadoop Cluster

Thomas Graves – Principal Engineer, Apache Hadoop PMC and Apache Spark Committer

Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC

“In this talk, we will cover an effort to empower Spark applications via Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed onto your existing Hadoop hardware (without creating a separate cluster). Spark applications can then directly access Hadoop datasets on HDFS.”

Track: Data Science

Day 2 (11:15 AM) – Interactive Analytics in Human Time – Lighting Fast Analytics using a Combination of Hadoop and In-memory Computation Engines at Yahoo

Supreeth Rao (@supreeth_) – Technical Yahoo, Ads and Data Team

Sunil Gupta (@_skgupta) – Technical Yahoo, Ads and Data Team

“Providing interactive analytics over all of Yahoo’s advertising data across the numerable dimensions and metrics that span advertising has been a huge challenge. From getting results in a concurrent system back in under a second, to computing non-additive cardinality estimations to audience segmentation analytics, the problem space is computationally expensive and has resulted in large systems in the past. We have attempted to solve this problem in many different ways in the past, with systems built using traditional RDBMS to no-sql stores to commercial licensed distributed stores. With our current implementation, we look into how we have evolved a data tech stack that includes Hadoop and in-memory technologies.”

Track: Hadoop for Business Apps

Day 3 (11:00 AM) – Costing Your Big Data Operations

Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management

Amrit Lal (@Amritasshwar) – Product Manager, Hadoop and Big Data

“As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations. Our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. We will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.”

For public inquiries or to learn more about the opportunities with the Hadoop team at Yahoo, reach out to us at bigdata AT yahoo-inc DOT com.


Read More…

Computer vision at scale with Hadoop and Storm

Computer vision at scale with Hadoop and Storm:

Recently, the team at Flickr has been working to improve photo search. Before our work began, Flickr only knew about photo metadata — information about the photo included in camera-generated EXIF data, plus any labels the photo owner added manually like tags, titles, and descriptions. Ironically, Flickr has never before been able to “see” what’s in the photograph itself… 


Read More…

Hadoop Weekly Issue #70

Hadoop Weekly Issue #70

18 May 2014

Yahoo announced their support for Hive and Tez this week in the widely contested SQL-on-Hadoop market. Meanwhile, there is an interesting overview of a real-world use-case at Allstate with Cloudera’s SQL-on-Hadoop system Impala. There are also plenty of interesting technical articles and exciting announcements—including the public availability of Splice Machine’s RDBMS on HBase product and a native implementation of MapReduce that’s been open-sourced by Intel.


The Cloudera blog has an article about a change in the way that Oozie manages its shared library directory in HDFS. The changes add support for multiple versions of the directory which fix a race condition. The post explains the changes and tooling around it.


There have recently been a lot of benchmarks in the SQL-on-Hadoop arms race. Those benchmarks tend to use synthetic data, though, so it’s interesting to hear about systems running on real-world datasets. In this case, data from Allstate Insurance is analyzed using Impala. What starts as a 2.3TB csv file compresses to 106GB when encoded using Parquet, and the derived performance improvements on a 35-node cluster are impressive. The article has full details, including some tips for someone looking to tackle a similar dataset.


I once looked into writing a new convertor for Parquet in order to write out a custom data type. I wish that I had this excellent guide, which walks through the various pieces of the Parquet write API. In addition to the written analysis, the post has excellent diagrams showing the different layers as well as how data is stored on disk.


A tutorial on the Cloudera blog shows how to process stock market data stored in an Apache Avro file using Apache Crunch. It covers how to implement a secondary sort to group by day and sort by stock symbol within the day.


While both aim to crunch massive amounts of data, Hadoop and HPC overlap very little—both in terms of community and technology/deployments. This in-depth article explores why the HPC community doesn’t adopt Hadoop (it’s an invader, it looks funny, it’s a poor reinvention of HPC technologies, and more), and suggests some changes that could be made on both sides to break down the divide. Given the massive amounts of money HPC organizations spend, I wouldn’t be surprised if we start seeing vendors address the problems.


Yahoo’s Hadoop Platforms Team has written a post about Apache Hive, Tez, and YARN at Yahoo. The article explains the rise of Hive at Yahoo (it sounds like it’s gaining ground on Pig, especially in the past 6 months), why Yahoo is betting on Hive over various other SQL-on-Hadoop solutions, explains some of the query performance that they’ve seen at Yahoo (and how Shark wouldn’t work on a 100-node cluster in their tests), and more.


Episode 22 of the All Things Hadoop podcast features an interview with Patrick Hunt discussing Apache Solr. The episode covers backing Solr by HDFS, SolrCloud, Cloudera Morphlines, and more. The article has a great summary of some of the technical details.



IBM is shipping a new version of their distribution, InfoSphere BigInsights 3.0 in a few months. One of the most notable features of this upcoming release is Big SQL version 3.0, which aims to be a drop-in replacement for existing RDBMs. The update will include SQL 2011 compatibility, stored procedures, data federation (i.e. pulling back data from other services like DB2), security enhancements, and better performance.


Concurrent and DataBricks have announced a partnership to build an integration between Cascading and Apache Spark. Datanami has an interview with Concurrent founder and CTO Chris Wensel about the push to integrate Cascading with multiple backends (the first of which was Tez). He notes that choosing the right backend is all about trade-offs.


Since the Cloudera-Intel deal, IBM is one of the few (the only?) distributions from a hardware maker. While most Hadoop clusters run on standard Linux systems, IBM’s distro is optimized to run on both x64 Intell processors and 64-bit RISC Power architectures. Datanami has a story about IBM’s Power8 and Hadoop. It covers some of the advantages of the chip vs Intel IvyBridge (e.g. memory, io, and cache) which can ultimately lead to cost savings.


AltiScale offers a Hadoop as a Service that aims to be more efficient than other systems like Amazon’s Elastic MapReduce by not relying on virtualization. To do so, Qubole runs bare-metal servers in its own data center. While performance is improved, you must go through the process of transferring your data out of the cloud. This article has more details on the trade-offs and how Altiscale’s offering is somewhere between EMR and running your own Hadoop cluster.


Hortonworks announced that they’ve acquired XA Secure, provider of security tools for Hadoop. The XA Secure software includes centralized policy management, fine grained access control, auditing, and encryption. Hortonworks plans to open source the software and incubate it in the Apache Software Foundation. They hope that the process will begin in the second half of the year, and they will offer it in binary form until then.


Zettaset’s flagship project, Zettaset Orchestrator, is management software and tools built for security. An interview with Zettaset’s CEO explains how the company thinks about enterprise-grade security for Hadoop, and it discusses some of the software that the company has built.


MapR has released a Sandbox VM with HP Vertica running atop of MapR. A post on the MapR blog introduces the VM and talks about some of the use cases and technical details of the MapR-Vertica integration.



Splice Machine announced public availability of its RDMBS that runs atop of Hadoop and HBase. Unlike other solutions, Splice Machine’s product includes support for online transaction processing and is ANSI SQL 99-compliant. An article on Datanami has more details about the product, which is set to hit version 1 later this year.


Version 0.14.0 of the Cloudera Kite SDK was released. This version includes improved documentation, view support for the MR and Crunch libraries, bug fixes, and several other enhancements.


Cloudera Enterprise (CDH and Cloudera Manager) version 5.0.1 were released this week. The 5.0.1 release includes a new version of Impala (1.3.1) and several bug fixes to Cloudera Manager, and CDH components. In addition, version 4.8.3 of Cloudera Manager was released with several bug fixes and improvements.


Version 0.10.0 of Scalding, the Hadoop library written in Scala powered by Cascading, was released. This version upgrades cascading dependencies and includes a handful of improvements.


A team at Intel has been working on a native implementation of the MapReduce Map Output Collector for several months. It’s called NativeTask and was open-sourced this week. In addition to being a drop-in replacement for many existing Hadoop jobs, the framework also supports native mappers and reducers written in C++. For the classic WordCount example, the framework provides a 2.6x speedup.



Curated by Mortar Data ( http://www.mortardata.com )



What is the big idea with ZooKeeper? by Jan Gelin of Rubicon Project (Los Angeles) – Monday, May 19

Techtalk v3.0 – Analytics in Cassandra and Hadoop + InfiniDB (San Francisco) – Thursday, May 22


Big Data, from technology to business potential (Bellevue) – Wednesday, May 21


Learn what’s new in Mahout with Ted Dunning (Boulder) – Wednesday, May 21


St. Louis Hadoop Users Group Meetup (St. Louis) – Tuesday, May 20


Got data? Join us for a tech talk from Josh Wills, Data Scientist – Cloudera (Southfield) – Thursday, May 22


Abacus presents Hortonworks Hadoop and RedHat (Atlanta) – Tuesday, May 20

New York

Intermediate Workshop II: Writing Spark Applications (New York City) – Thursday, May 22


The Data Operating System for Hadoop 2.0 – YARN Tech Talk (Cambridge) – Tuesday, May 20


Kick-off meetup: Intro to Apache Spark by Databricks (Vancouver, B.C.) – Thursday, May 22


HBase User Group London: Types in HBase & Apache Gora (London) – Monday, May 19


Let’s Get Hands-on with Big Data! (Nairobi) – Monday, May 19


Big data: HBase for the Architect, by Nick Dimiduk (Paris) – Tuesday, May 20


Mind the Stack – Architecture stories by Israeli startups (Tel Aviv) – Tuesday, May 20


Charla/Taller de Big Data Open Source (Madrid) – Thursday, May 22


Read More…

Yahoo Betting on Apache Hive, Tez, and YARN

by The Hadoop Platforms Team

Low-latency SQL queries, Business Intelligence (BI), and Data Discovery on Big Data are some of the hottest topics these days in the industry with a range of solutions coming to life lately to address them as either proprietary or open-source implementations on top of Hadoop.  Some of the popular ones talked about in the Big Data communities are Hive, Presto, Impala, Shark, and Drill.

Hive’s Adoption at Yahoo

Yahoo has traditionally used Apache Pig, a technology developed at Yahoo in 2007, as the de facto platform for processing Big Data, accounting for well over half of all Hadoop jobs till date.  One of the primary reasons for Pig’s success at Yahoo has been its ability to express complex processing needs well through feature rich constructs and operators ideal for large-scale ETL pipelines.  Something that is not easy to express in SQL.  Researchers and engineers working on data systems built on Hadoop at the time found it an order of magnitude better than working with Java MapReduce APIs directly.  Apache Pig settled in and quickly made a place for itself among developers.

Over time and with increased adoption of the Hadoop platform across Yahoo, a SQL or SQL-like solution over Hadoop started to become necessary for adhoc analytics that Pig was not well suited for.  SQL is the most widely used language for data analysis and manipulation, and Hadoop had also started to reach beyond the data scientists and engineers to downstream analysts and reporting teams.  Apache Hive, originally developed at Facebook in 2007-2008, was a popular and scalable SQL-like solution available over Hadoop at the time that ran in batch mode on Hadoop’s MapReduce engine.  While Yahoo adopted Hive in 2010, its use remained limited.

On the other hand, MapReduce, Pig and Hive, all running on top of Hadoop, raised concerns around sharing of data among applications written using these different approaches.  Pig and MapReduce’s tight coupling with underlying data storage was also an issue in terms of managing schema and format changes.  Apache HCatalog, a table and storage management layer was conceived at Yahoo as a result in 2010 to provide a shared schema and data model for MapReduce, Pig, and Hive by providing wrappers around Hive’s metastore.  HCatalog eventually merged with the Hive project in 2013, but remained central to our effort to register all data on the platform in a common metastore, and make them discoverable and sharable with controlled access.

The Need for Interactive SQL on Hadoop

By mid 2012, the need to make SQL over Hadoop more interactive became material as specific use cases and requirements emerged.  At the same time, Yahoo had also undertaken a large effort to stabilize Hadoop 0.23 (pre Hadoop 2.x branch) and YARN to roll it out at scale on all our production clusters.  YARNs value propositions were absolutely clear.  To address the interactive SQL use cases, we started exploring our options in parallel, and around the same time, Project Stinger got announced as a community driven project from Hortonworks to make Hive capable of handling a broad spectrum of SQL queries (from interactive to batch) along with extending its analytics functions and standard SQL support.  Early version of HiveServer2 also became available to address the concurrency and security issues in connecting Hive over standard ODBC and JDBC that BI and reporting tools like MicroStrategy and Tableau needed.  We decided to stick with Hive and participate in its development and phased (Phases I, II, III) delivery.  At this point, Hive also happens to be one of the fastest growing products in our platform technology stack (Fig 1) confirming the fact that SQL on Hadoop is a hot topic for good reasons.

imageFig 1. Growth in Hive jobs relative to overall Hadoop jobs

Why Hive?

So, why did we stick with Hive or as one may say, bet on Hive?  We did an evaluation of available solutions, and stayed the course we were on with Hive as the best solution for our users for several key reasons:

  • Hive is the SQL standard for Hadoop that has been around for seven years, battle tested at scale, and widely used across industries
  • A single solution that works across a broad spectrum of data volumes (more on this in the performance section)
  • HCatalog, part of Hive, acts as the central metastore for facilitating interoperability among various Hadoop tools
  • A vibrant community from many well known companies with top notch engineers and architects vested in its future
  • Top Level Project (TLP) with Apache Software Foundation (ASF) that offers several advantages, including our deep familiarity with ASF and all the related Hadoop ecosystem projects under Apache and the clarity around making contributions to gain influence in the community that may allow Yahoo to evolve Hive in a direction that meets our users needs
  • Perhaps one of the few SQL on Hadoop solutions around that has been widely certified by BI vendors (an important distinction to consider as Hive gets used in many cases by data analysts and reporting teams directly)
  • Alleviating performance concerns with relentless phased delivery (Hive 0.11, 0.12 and 0.13) against the initially stated performance goals

Query Performance on Hive 0.13

Since performance was one of users biggest concerns with Hive 0.10, the version Yahoo was running, we conducted Hive’s performance benchmarks, not to say that the significant facelift in features with later versions of Hive wasn’t important.

In one of the recent performance benchmarks Yahoo’s Hive team conducted on the Jan version of Hive 0.13, we found the query execution times dramatically better than Hive 0.10 on a 300 node cluster.  To give you an idea of the magnitude of performance difference we observed, Fig 2 shows TCP-H benchmark results with 100 GB dataset on Hive 0.10 with RCFile (Row Columnar) format on Hadoop 0.23 (MapReduce on YARN) vs. Hive 0.13 with ORC File (Optimized Row Columnar), Apache Tez on YARN, Vectorization, and Hadoop 2.3).  Security was turned off in both cases.  With Hive 0.13, 18 out of 21 queries finished under 60 seconds with the longest still under 80 seconds.  Also, Hive 0.13 execution times were comparable or better than Shark on a 100 node cluster.

imageFig 2. Hive 0.10 vs. Hive 0.13 on 100 GB of data

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing.  In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset.  These same queries ran successfully and much faster on Hive 0.13, allowing for better scale.  This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size.  The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when.  A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

imageFig 3. Hive 0.10 vs. Hive 0.13 on 1 TB of data

imageFig 4. Hive 0.10 vs. Hive 0.13 on 10 TB of data

The performance of Hive 0.13 is certainly impressive over its predecessors, but one must realize how these performance improvements came by. Several systems rely on caching data in memory to lower latency.  While this works well for some use cases, the approach fails when either the data is too large to fit in the memory or on a shared multi-tenant environment where memory resources have to be shared among tenants or users.  Hive 0.13, on the other hand, achieves comparable performance through ORC, Tez, and Vectorization (vectorized query processing) that does not suffer from the issues noted above.  On the flip side, building solutions in this manner certainly requires heavy engineering investment (100s of man month in case of Hive 0.13 since the start of 2013) for robustness and flexibility.

Looking Ahead

We are excited about the work going on in the Hive community to take Hive 0.13 to the next level in subsequent releases in terms of both features and performance, in particular the Cost-based Query Optimizations and the ability to perform inserts, updates, and deletes with full ACID support.


Read More…

Hadoop Weekly Issue #69

Hadoop Weekly Issue #69

11 May 2014

Once again, this week’s newsletter has a number of good articles about Apache Spark, which continues to take the Hadoop ecosystem by storm. This week also has an interesting technical post on a new feature of Apache Ambari, as well as an in-depth interview with Cloudera CTO Amr Awadallah.


These slides from a presentation at Data Science Days give an intro to Apache Spark. The talk covers all of the key components of the Spark stack, including Shark (SQL), Spark Streaming, MLlib, and GraphX.


The Cloudera blog has a post covering the design and deployment of YARN Resource Manager (RM) High Availability (HA). This topic was also recently covered by the Hortonworks blog, but this post covers Cloudera-specific parts (like Cloudera Manager integration) in addition to another valuable resource for understating YARN RM HA.


A post on the codecentric blog covers a new feature of Apache Ambari, the Hadoop cluster management software, called blueprints. Blueprints provide a mechanism for describing host configurations in a cluster independent of actual nodes. The post describes the feature in depth, including Ambari’s blueprint HTTP endpoints, an example configuration, and an example deployment using the Ambari API.


This talk describes how Accumulo might someday get support for SQL queries via the Hive API and other SQL-on-Hadoop systems. It also talks about some of the problems that are unique to Accumulo, such as visibility labels and authorizations.


A talk entitled “Collaborative Filtering with Spark” covers using Spark’s MLlib to compute alternating least squares. After motivating a switch from interative MapReduce, the talk describes two failed attempts that suffered from bottlenecks, and also thee final solution that uses a custom partitioner.


The MapR blog has a post with a number of questions and answers about the the features and status of Apache Spark. Of particular interest, it covers the status of Spark’s integrations with several ecosystem projects and tools like R, Apache Giraph, Apache Pig, Apache HBase, Apache Mahout, and Apache Mesos.



Databricks and DataStacks announced a new partnership to integrate Apache Spark with Apache Cassandra.


MapR released information about their customer growth in Q1 2014. Specifically, they had record bookings with sales coming from ~90% subscription licensing. The press release also sheds some light on MapR’s customer base, which they promote as diverse both geographically and across industries.


Forbes.com has an interview with Cloudera CTO Amr Awadallah. The interview covers a lot of ground, including Cloudera’s vision for an Enterprise Data Hub, the Cloudera-Intel deal, vendor lock-in (and how Cloudera thinks about open-source), and competitors’ business models.


Hortonworks and two partners announced that the partner products are certified for HDP 2.1. Specifically, version 5.5 of Rainstor and Syncsort’s DMX-h are now certified.


The Qubole blog has a post of 10 “Hadoop Happenings,” which includes links and one-sentence summaries to several stories from the Hadoop ecosystem. There are both technical and industry-related articles in the post.



Apache HBase 0.94.19 was released, containing a number of bug fixes.


MapR and HP announced general availability of HP Vertica on MapR. The integration, which runs Vertica backed by the MapR FS, was originally announced in February.


Rackspace announced support for a new version of the Hortonworks Data Platform as part of their Cloud Big Data Platform (which is in limited availability). In addition to support for HDP 2.1, Rackspace has created a service to self-provision Data Nodes.


Qubole has announced that their Presto-as-a-Service offering is generally available after 4 months of alpha testing. Unlike many SQL-on-Hadoop implementations, Presto supports queries over data stored in S3 in the AWS cloud. Their blog post notes that Presto shows 2-7.5x speedups over Hive when running on data store in S3.


Apache Accumulo 1.6 was released this week. The new release includes enhancements across scalability & performance, administration, development flexibility, encryption, and more. The Sqrrl blog has a summary of the release notes, which are pretty epic given that Accumulo 1.6 includes over 650 closed tickets.



Curated by Mortar Data ( http://www.mortardata.com )



Invited Speaker Series: Malware Detection using Spark on MapR (Westlake Village) – Tuesday, May 13

May SF Hadoop Users Meetup (San Francisco) – Wednesday, May 14

Apache Drill: Building Highly Flexible, High Performance Query Engines (Palo Alto) – Wednesday, May 14

Washington State

Spring 2014 Seminar Series: Big Data Infrastructure (Tacoma) – Wednesday, May 14

xPatterns on Spark, Shark, Mesos, & Tachyon (Bellevue) – Wednesday, May 14


Advanced Hadoop Based Machine Learning (Austin) – Wednesday, May 14

Data Curiosity Meetup – Hadoop & High-Performance Computing (Round Rock) – Thursday, May 15


Choosing the Right Data Architecture for Your Big Data Project (Chicago) – Wednesday, May 14


Monthly Meetup – Big Data: Lions, and Tigers and Pig Ohh My (Louisville) – Thursday, May 15


Cleveland Big Data and Hadoop User Group (Cleveland) – Monday, May 12


May HUG Meetup (Pittsburgh) – Wednesday, May 14


Lars George – HBase (Atlanta) – Tuesday, May 13

New York

Workshop for Beginners IV: Getting started with Spark and SparkSQL (Brooklyn) – Friday, May 16


Hadoop presentation with Kim Fowler (Taunton) – Monday, May 12

May Hadoop MeetUp (London) – Tuesday, May 13


Read More…

Hadoop Weekly Issue #68

Hadoop Weekly Issue #68

04 May 2014

There are several articles this week covering deploying Hadoop, including two on integrating Hadoop and Docker. Given how hard it can be to test out Hadoop (let alone deploy to production), it’s always promising to see new tools and systems being used. Videos from Hadoop Summit Amsterdam were posted, and there are several new releases including a Tech Preview of Spark on HDP, and a new version of Impala. Enjoy all of the content to consume and new software to try out!


The Pivotal blog has a post on running the Pivotal HD distribution inside of Docker. By utilizing pre-packaged docker images, it’s very simple to get an environment up and running. The tutorial includes setting up MapReduce as well as HAWQ, the SQL-on-Hadoop system from Pivotal. There are some docker containers for other distributions, so it should be possible to adopt this tutorial to other environments.


Another post on getting a Hadoop cluster going quickly, this time using Puppet to provision virtual machines running in Virtualbox using Vagrant. Specifically, this bootstraps 3 VMs with Apache Ambari, at which point you can use the management software to install and configure the Hadoop daemons. If you want to try out Ambari, this is a good way to do so pretty quickly.


Rather than running Hadoop in Docker, this post discusses some upcoming support for running docker containers inside of YARN. Docker supports pre-baked images that can contain libraries and binaries not found on the host, making it possible to run jobs with vastly different sets of dependencies on the same compute node (akin to virtualization, but with much less overhead). The Register has more details on the integration, including interviews with Altiscale CEO Raymie Stata and Hortonwork’s Arun Murthy.


The Sqrrl blog has a post on recent news related to big data security. It coverts HDFS ACLs, Apache Knox, MongoDB 2.6, and Cloudera Search. The post wraps up with details about the security features of Sqrrl Enterprise.


The Cloudera blog has a post on the recently announced python client for Impala, impyla. It contains a walkthrough on the API, including the preview APIs for integrating with scikit-learn and shipping python udfs.


Apache BigTop is a system for building Hadoop ecosystem components into a cohesive unit, which is used to package most Hadoop distributions. This post walks through how BigTop builds RPM packages for each of the components.


A guest post on the Cloudera blog by WibiData engineer Jonathan Natkins describes how to integrate a custom service into Cloudera Manager. The integration relies on a new feature of Cloudera Manager 5 called custom server descriptors. If you’re using Hadoop ecosystem components not supported by Cloudera with CDH, this offers an opportunity to manage them alongside the Hadoop services.


The DataStax blog has an interesting article explaining how they provision and test Cassandra across multiple data centers and 1000 nodes in the cloud.


The Hortonworks blog is doing a series on resilience/high-availability for the YARN Resource Manager (RM). The first phase of this work is implemented, which is a mechanism for persisting the state of the RM to a data store (HDFS and Zookeeper are implemented). Clients must use a new RMProxy library to survive a RM restart.


MortarData has a post about integrating MongoDB and Hadoop. The post includes links to their documentation that describe several strategies for accessing MongoDB data in Hadoop, and there is a video from their CEO describing how to build a recommendation engine with Hadoop and MongoDB.



Videos from Hadoop Summit in Amsterdam in early April have been posted online. The talks cover five tracks, and slides for many of the talks are posted, too.


In a post entitled “Spark on fire,” the DBMS2 blog describes recent Spark news and how companies are deploying Spark. The post notes that Spark 1.0 is expected to be released later this month, and discusses SparkSQL and applications of Spark for machine learning.


Another post on the DBMS2 blog covers Cloudera’s SQL-on-Hadoop positioning. Cloudera supports both Hive and Impala, and it’s not always clear which system should be used for which type of processing (at least in the longer term). It’ll also be interesting to see how Shark and SparkSQL fit into Cloudera’s strategy.


Cloudera and MongoDB have expanded their partnership to include co-marketing and co-selling of each others software. There are also plans to support live-snapshotting of MongoDB data to a Hadoop cluster for analysis.


Pepperdata, makers of Hadoop cluster supervisor and analysis software, have announced a Series A round of financing totaling $5M. They will use the money to grow their team and further product development.


In a third of three posts this week, the DBMS2 blog enumerates the details (and adds some speculation) on the recent Intel investment in Cloudera. It includes some of the short and medium-term goals of the relationship and specifics on the financial transaction.


ComputerWeekly has an article that explores whether Hadoop should complement or replace a data warehouse. It paints a picture of Hortonworks being in the “complement” camp while Cloudera is in the (eventually) “replace” camp. It also includes quotes from Teradata CTO, who doesn’t think that replacing a EDW with Hadoop makes financial sense.


InformationWeek has a story on Datameer’s software, which takes a different approach than other systems. Instead of relying on a SQL-on-Hadoop system to answer queries to power a BI tool, it offers a spreadsheet and visualization tool that operates directly on data in HDFS or another data store.



Hortonworks has announced a Tech Preview of Apache Spark for HDP 2.1. The preview is based on Apache Spark 0.9.1 and Hortonworks has published rpms and debs for installing the software.


Cloudera announced the 1.3.1 release of Impala. The new version includes improvements to memory handling and additional SQL functions.


Apache Tajo 0.8.0 was released. Tajo is a low-latency SQL on Hadoop (as well as additional platforms/data stores) distributed system. The new release includes a number of new SQL features, improved performance and scalability, added support for new storage systems and formats (including Amazon S3 and Parquet), and much more. The Apache blog has full coverage of the new features.


Apache Kafka was released. This is a bug fix release containing 13 fixes, including a fix for a deadlock.


Radoop 2.0 was released this week from the company of the same name. Radoop integrates predictive analytics tools from RapidMiner with Hadoop.



Curated by Mortar Data ( http://www.mortardata.com )



HBaseConHackathon (San Francisco) – Tuesday, May 6, 2014


An Introduction to Apache HBase, MapR Tables, and Security (Phoenix) – Wednesday, May 7, 2014


Revenue Management and Hadoop, ‘Data Hubs’ & the Data Center Transformation (Boulder) – Thursday, May 8, 2014


Advanced Hadoop Based Machine Learning (Austin) – Wednesday, May 7, 2014


Teradata & The Ohio State University to Present (Dublin) – Tuesday, May 6, 2014

District of Columbia

Big Data Week 2014 Meetup (Washington) – Monday, May 5, 2014


2nd Annual Big Data Breakfast (Columbia) – Tuesday, May 6, 2014

New York

Hadoop Developer Day (New York) – Tuesday, May 6, 2014

Bridging the gap, OLTP and Real-Time Analytics in a Big Data World (New York) – Tuesday, May 6, 2014

Apache Spark – Easier and Faster Big Data + Collaborative Filtering (New York) – Wednesday, May 7, 2014

Intermediate Workshop II: Writing MapReduce Applications (New York) – Friday, May 9, 2014


BigDataCloud Mini Conference 2014 (Bangalore) – Tuesday, May 6, 2014

Introduction to Hadoop (Mumbai) – Wednesday, May 7, 2014

Hadoop by example (Hyderabad) – Saturday, May 10, 2014

Bangalore Baby Hadoop Meetup (Bangalore) – Saturday, May 10, 2014


Special Event: Future of Data – Doug Cutting, Founder of Hadoop (Sydney) – Tuesday, May 6, 2014


SQL for Hadoop (Ontario) – Wednesday, May 7


Read More…