Hadoop Weekly Issue #98

Hadoop Weekly Issue #98

30 November 2014

There’s a lot to cover from the past two weeks, including several new releases: Apache Hadoop 2.6.0, Apache Ambari 1.7.0, and updates from Cloudera, MapR, and Hortonworks. Technical posts from LinkedIn, Pinterest, and Spotify all give a glimpse into the data infrastructure of those companies. There’s also coverage of the Hortonworks IPO and Apache Drill, which graduated from the Apache incubator this week.


LinkedIn has published a technical overview of Gobblin, their data ingestion framework. LinkedIn uses Gobblin to ingest data from Kafka, Databus (database change logs), and external partners in a unified fashion. It’s responsible for things like compaction and privacy compliance. The post promises that in the future Gobblin will be open-sourced.


The Cloudera blog has a post about BigBench, which is a specification-based benchmark for big data systems. Parts of BigBench are derived from TPC-DS, including the data model and ~1/3 of the queries. The post also details the upcoming BigBench 2.0 and includes some experimental results from running BigBench against 3 different server configurations.


The Hortonworks blog has a deeper look at features of the recently-released Hive 0.14. There are a number of notable new features, including ACID transactions, a cost-based optimizer, and temporary tables.


Apache Pig 0.14.0 was recently released with some exciting new features. This post looks at several those features and improvements: Pig on Tez, ORCStorage, Predicate pushdown for Load Functions, auto-shipping of UDF dependencies, and refactoring of jar artifacts.


The Amazon Web Services blog has a post covering how to build a data ingestion pipeline using Data Pipeline and Elastic MapReduce. The tutorial assumes that web servers publish data to an S3 bucket, after which it describes how to run Pig jobs to clean and filter the data and to generate reports using Hive.


This post covers how to use the Oozie workflow system and the Sqoop data transfer framework to import data from MySQL to Hive. The post has a 3-part walkthrough and a troubleshooting section that describes several common errors.


The Cloudera Impala team has written a paper describing Impala. It looks at the use-cases Impala aims to solve, gives an overview of the system architecture and main components, includes benchmarks with different file formats/compression, describes Impalas integration with YARN, and provides a comparison to other SQL-on-Hadoop systems as well as a commercial analytics db engine.


This in-depth tutorial looks at using Microsoft Azure’s HDInsight to run a Storm and HBase cluster to perform sensor analysis. The system consumes data from Event Hub and uses ASP.NET SignalR and D3.js to build a dashboard.


Apache Kafka 0.8.2 is currently in beta release with a final release planned for late this year. This post looks at several new features and improvements of the release, which include a new producer API, topic deletion, offset management in Kafka rather than Zookeeper, automated leader rebalancing, controlled shutdown, stronger durability guarantees, and connection quotas.


Spotify has written a post about how they’ve been migrating many of their MapReduce jobs from Hadoop Streaming with Python to Apache Crunch. Apache Crunch is based on Google’s FlumeJava, which has a number of nice properties (type-safety, Avro-support, simple testing, etc.). The post looks at how Spotify uses Crunch and introduces a new-open source project called crunch-lib, which contains a number of high-level operations.


One of the goals of YARN is to be a general-purpose compute framework for many applications. There have been a few examples so far, but they’re still mostly compute frameworks (e.g. Spark, Storm). This post looks at a new project from LucidWorks for running SolrCloud on YARN. It describes the YARN application lifecycle, how to launch a Solr cluster, and how to shut one down. It’s a great introduction to building a long-running application on YARN.


Among the many SQL-on-Hadoop frameworks is IBM’s Big SQL 3.0. This post compares Big SQL to Cloudera Impala 1.4 and Apache Hive 0.13 in two areas – how much effort is required in porting the SQL queries to each engine and a performance comparison. With the caveat that every vendor will try to show their system in the best light, the post claims that Big SQL is 3.6x faster than Impala (5.4x Hive) for single user at 10TB scale and 2.1x faster than Impala (8.5x Hive) with 4 concurrent streams.


Amazon Web Services has posted some benchmarks for Hadoop workloads using Elastic MapReduce (it looks at 6 node clusters, which is good start but not a comprehensive overview). They compare the current-generation instances to previous-generation instances as well as running against data stored in S3 and on an instance-backed HDFS. Current-generation instances are faster and cheaper, which makes the more cost-effective in most cases. The new instance types have a fraction of the instance storage, though.


As Hadoop deployments in the cloud become more economical and performant, we can expect to see more folks deploying Hadoop in the cloud in some shape or form. The Hortonworks blog has two posts about a hybrid cloud—the first details common use cases such as backup (to a blobstore like S3 or Azure Blob storage), development, and burst/overflow. The second describes how you can use the Microsoft Azure cloud towards these ends.



Pinterest describes their analytics system which is built on MySQL and HBase with the Flask web framework. For scalability, the system uses HBase with co-processors and secondary indexes. The post gives a high-level overview of how the HBase storage/compute works as well as how they build rolling-window data sets with Cascading jobs.



ZDNet has coverage of Hortonworks’ IPO. The company will trade on Nasdaq under “HDP,” and 6 million shares of stock are expected to be offered at between $12 and $14. The article has some more details on the IPO and the company based on the S-1 filing.


InfoWorld has an analysis of the Hortonworks IPO and the Hadoop industry at a whole. The reporting is based on the data in Hortonworks’ S-1 filing and previous statements about the company’s finances. The article discusses the role that open-source software has on the revenue and on the success of any Hadoop vendor.


Apache Drill, the schema-free SQL system which can query data on lots of different systems, has graduated from the Apache incubator. The Apache blog has more details on the project, and GigaOm research has more background on the SQL-on-Hadoop ecosystem and how Drill fits into it.



If you’re looking for some more information on Apache Drill, the MapR blog describes several use-cases that customers are targeting for Drill. These include self-service data (e.g. integration with Tableau or MicroStrategy), data agility (plugging into many data sources), interactive query response time, and ubiquity (integration with several systems such as Spark and MongoDB).


The edX massive online course organization is offering two courses for Apache Spark. “Introduction to Big Data with Apache Spark” will be taught by UC Berkeley professor Anthony Joseph and will start on February 23rd. “Scalable Machine Learning” will be taught by UCLA assistant professor Ameet Talwalkar and will start on April 14th.



Apache Hadoop 2.6.0 was released this week. The new release includes a number of improvements, bug fixes, and new features. These include a (beta) key management server, improved support for heterogeneous storage tiers, (beta) transparent encryption at rest, support for long-running services in YARN, support for rolling upgrades, (beta) support for running applications in docker containers, and much more. The Hortonwork’s blog has more details on several of the new features as well as a look forward to the Hadoop 2.7 release.



On the heals of Apache Hadoop 2.6.0 release, Hortonworks has announced HDP2.2 based on Hadoop 2.6.0 and with new versions of all core components. This release also adds several new systems which were previously in technical preview—Apache Spark, Apache Slider, Apache Kafka, and Apache Ranger. Other major improvements features of the release are support for Rolling Upgrades, automated cloud backup to Azure and S3, and phase 1 of stinger.next (Hive 0.14.0).


If you’re looking to try out Hadoop 2.6.0, there’s a new docker image for the release. This post from SequenceIQ shows how to run the container and run an example Hadoop job.


MapR has announced upgrades to several systems included in its distribution. The new versions include Impala 1.4.1, Spark 1.1.0, Pig 0.13, Hive 0.13, and Sqoop 1.4.5.


DataStax has announced DataStax Enterprise 4.6, which provides an enterprise-ready Apache Cassandra and management software. The new version includes Apache Spark streaming analytics and security integration.


Apache Ambari 1.7.0 was released this week. It’s a pretty epic release resolving over 1,600 tickets. The Hortonworks blog has details on improvements and new features, which include improvements across operations, extensibility, and the core platform.


Cloudera announced several patch-level releases this week. Cloudera Enterprise 5.2.1 includes fixes to Oozie, YARN, Impala, Cloudera Manager, and Cloudera Navigator. Several previous releases (Cloudera Enterprise 5.1.4/5.0.5 and some CDH4 releases) were all patched to fix the POODLE vulnerability in SSL. Finally, new ODBC and JDBC drivers for CDH 5.2 were released with support for Hive 0.13, Impala 2.0, and more.




The presto-marathon-docker project provides tools for running Presto (the open-source SQL-on-Hadoop/S3/etc from Facebook) inside of docker and using Mesos to build an on-demand cluster.



Curated by Mortar Data ( http://www.mortardata.com )



Databricks Spark Meetup (Playa Vista) – Thursday, December 11


December SF Hadoop Users Meetup (San Francisco) – Wednesday, December 10



Apache Drill Intro: Data Exploration and Analytics on Hadoop (Houston) – Wednesday, December 10



Let’s Talk about Hadoop! (Minnetonka) – Wednesday, December 10



Join Us @Chadoopers (Chattanooga) – Thursday, December 11



Hadoop, Pig & Hive (Harrisbug) – Tuesday, December 9


New York

Go Lightning Talks: Apache Kafka; Solving Problems with Bosun (New York) – Tuesday, December 9



Show & Tell: Winter Is Coming Edition (Boston) – Wednesday, December 10


Spark + Cassandra (Waltham) – Wednesday, December 10



What’s the Scoop on Hadoop? (Ottawa) – Wednesday, December 10



Building Data Pipelines (Bristol) – Tuesday, December 9



Mobile Beacons and #IoT with SQL on Hadoop Featuring RSA Analytics (Melbourne) – Tuesday, December 9



Intro to Apache Spark: Paweł Szulc (Warsaw) – Tuesday, December 9


Lightning Talks (Warsaw) – Wednesday, December 10



A Machine Learning Pipeline for Event Detection on Spark (Goteborg) – Wednesday, December 10



How to Think in MapReduce (Cluj-Napoca) – Wednesday, December 10



Apache Flink and Generating Query Suggestions on Hadoop (Hoofddorp) – Thursday, December 11



Apache Spark (Buenos Aires) – Thursday, December 11


Apache Spark and the Current Big Data Landscape (Dublin) – Thursday, December 11



Big Data and Real-time Analytics with Spark (Bangalore) – Friday, December 12


Practical MapReduce Programming Mini-Course (Chennai) – Saturday, December 13


Hadoop Hackathon (Gurgaon) – Saturday, December 13



Read More…

Hadoop Weekly Issue #97

Hadoop Weekly Issue #97

23 November 2014

Lots of news out of Europe this week as both StrataConf Barcelona and ApacheCon EU took place. In addition to several interesting presentations from those conferences, this week’s issue contains several articles on Spark and YARN. Also, Stripe has open-sourced four Hadoop tools, and there were releases from open-source projects Apache Pig and Apache Chukwa as well as vendor releases by Splice Machine and HP’s Vertica. One quick editorial note—with the short week in the US for Thanksgiving, I’m going to skip next week’s issue and resume with issue #98 on December 7th.


This tutorial describes the steps needed to run a HBase cluster with Microsoft Azure’s HDInsight, and how to build a simple application to interact with it. It details how to setup a new maven application, configure hbase-site.xml, and building an executable jar for running a test against HBase.


This post looks at a number of key configuration parameters relating to YARN memory usage (which is common area to tweak). It covers configuration for map reduce, the yarn scheduler, and yarn application managers.


This post is a good overview of Spark and how it compares to MapReduce. It describes RDD, describes the computation model (and how it compares to MapReduce), and compares Spark to Cascading/Scalding. The post also explores Spark’s reputation as being in-memory focussed and what it means in terms of performance.


Hadoop’s default deployment doesn’t include any sort of authentication – the system trusts that the client is who it says it is. To enforce authentication, Hadoop has support for Kerberos. This gives a quick intro to the key concepts of Kerberos and how they integrate with Hadoop.


This presentation describes HBase’s upcoming support for richer encodings and data types. It looks at the new OrderedBytes API, the DataType API (for structs, unions, etc), and what’s upcoming. There are also examples of using the new APIs.


Twitter has posted about their system for building an index of the entire corpus of tweets. The system relies heavily on Hadoop, both for storage and for aggregating and preparing (via Pig jobs).


Apache Samza is a stream processing framework built on YARN. Samza was originally open-sourced by LinkedIn, and they have written about operating Samza (and Apache Kafka) at scale. The post describes their deployment system as well as their metric collection and alerting setup.


This presentation explores the state of Hadoop and RDF. Two projects providing the must complete support are Apache Jena project, which has experimental modules for RDF in Hadoop and the Intel Graph Builder, which supports generic graph data via Pig UDFs.


SAMOA: Scalable Advanced Massive Online Analysis is a new system aiming to be the “Mahout of Streaming.” Open-sourced by Yahoo, the project includes support for several backends including Apache Storm, Apache S4, and Apache Samza. The system implements a number of machine learning algorithms, including Distributed Stream Clustering and Vertical Hoeffding Tree Classifier.


A new (alpha) feature of the upcoming Hadoop 2.6 release will be the ability to run Docker containers as part of YARN applications. SequenceIQ has an overview of this feature, and how it can also be used when YARN is already running inside of Docker.


The Altiscale blog has an article about the ubiquity of the Hive metastore. It notes how several new SQL-on-Hadoop systems support it (Impala, Presto, Spark SQL, Drill) and that they do this in order to be compatible with Hive. Both of these turn out to be good things for users—if your data is in HDFS and the Hive metastore, it’s easy to use in many other systems.


The SequenceIQ blog has a post on building a hybrid Hadoop deploy in the cloud by using both a permanent cluster and adding ephemeral clusters as needed. They show how to build clusters with their Cloudbreak (cloud-agnostic provisioning built on docker and Ambari) tool and to use their other tool Periscope to autoscale the ephemeral clusters.



The Call For Abstracts for Hadoop Summit Europe 2015 closes on December 5th. The conference takes place in April in Brussels, Belgium.


MapR, whose distribution is available as an option for Amazon Elastic MapReduce (EMR), has a post with some news on the integration. First, MapR is now supported via EMR on the latest instance families. Second, hourly pricing of the EMR+MapR will be lower for both the M5 and M7 enterprise versions (M3 continues to be free).


MapR and Teradata announced an expanded partnership in which Teradata will provide MapR’s distribution as part of the Teradata Unified Data Architecture.


GigaOm has an article on eHarmony’s infrastructure ambitions around Hadoop and OpenStack. As part of the transition, they’re moving from several instances of a Hadoop appliance to a single YARN cluster as well as bringing up new technologies like Spark and Storm.


Datanami has an article on the rise of Spark. They note that for the first time, Apache Spark has bypassed Apache Hadoop on Google Trends. The post has a look back at the highlights of Spark this year including the growing number of contributors and the recent sort workload results.


Qubole and Microsoft Azure have announced a new strategic relationship in which Qubole’s Big Data-as-a-Service platform is available on Azure.



Splice Machine has announced general availability of their RDBMS built on Hadoop. Version 1.0 includes supports for SQL:2003 including analytics functions, native backup and recovery, and integrates with HCatalog. Splice Machine is positioned as a cheaper alternative than scaling out a traditional RDBMS.


Apache Pig 0.14.0 was released this week. Major highlights of the release include support for a Apache Tez as a backend, OrcStorage, and loader predicate push down. The release supports Hadoop 0.23.x, 1.x, and 2.x.


Stripe has open-sourced four new projects for Hadoop. The projects are: Timberlake, a dashboard for YARN and MRv2, Brushfire, a system for distributing learning of ensemble tree models inspired by Google’s PLANET, Sequins, a static database backed by SequenceFiles, and Herringbone, a tool for working with Parquet files and Impala/Hive.


HP has announced a new version of their Vertica columnar MPP analytics engine that runs on Hadoop. It supports many major distributions, including MapR, Cloudera, Hortonworks, and Apache Hadoop.


Apache Chukwa version 0.6.0 was released. Chukwa is a system for distributed monitoring and analysis from data in log files. The new release adds support for HBase, deprecates the Chukwa collector, and resolves a number of bugs.



Curated by Mortar Data ( http://www.mortardata.com )



Stream Processing on Hadoop (Saint Paul) – Monday, November 24



All Models Are Wrong, Some Are Useful (Plano) – Monday, November 24



Options for Streaming Analytics on Azure and Azure Batch (London) – Tuesday, November 25



Hadoop Meetup on Cascading/Tez with Concurrent and Hortonworks (Paris) – Tuesday, November 25



November Meetup (Mannheim) – Monday, November 24


Apache Spark (Hamburg) – Wednesday, November 26



What’s New with Apache Spark? An Evening with Paco Nathan (Amsterdam) – Monday, November 24



Offline and Real-time Click Stream Processing (Stockholm) – Wednesday, November 26



Apache Spark: Easier and Faster Big Data (Sydney) – Thursday, November 27



Spark Singapore First Meetup! (Singapore) – Wednesday, November 26



Read More…

Hadoop Weekly Issue #96

Hadoop Weekly Issue #96

16 November 2014

Big news this week out of Palo Alto as Hortonworks has filed paperwork for an initial public offering. There were also a number of notable releases this week, including Apache Hive 0.14.0. Technical posts cover a large number of ecosystem topics, including Apache Sqoop, Apache Drill, and Apache Pig. There’s a lot of breadth in this issue, so there should be something for everyone!


The Cloudera blog has a guest post from Cerner about integrating Apache Kafka with HBase and Storm for real-time processing. The post describes how adopting Kafka helped reduce load on HBase (which was previously used for queuing) and improve performance. This style of Kafka-based architecture seems to be more and more common, but it’s always interesting to hear how folks are putting together the pieces of the Hadoop ecosystem.


The MapR blog has a post on using the recently-released Apache Drill 0.6.0-incubating to analyze Yelp’s public data set. The data, which is a JSON file, can be queried directly via SQL in Drill without first declaring the data’s schema (drill auto-detects it). The post has a number of sample queries which you can use to get started analyzing this or any other data set.


The Cloudera blog has a second guest post, this time from Dell, on the new Oracle direct-mode in Sqoop 1.4.5. The post describes several of the implemented optimizations in the Oracle direct mode and includes an analysis of performance improvements the connector provides.


The Hortonworks blog has a post on using Apache Pig with the Python Scikit-learn package in order predict flight delays using logistic regression and random forests. The post is a bit light in details, but there is a linked IPython notebook which has a very detailed overview and description of the entire process. Given that Python is often a data scientist’s top choice for machine learning on small data sets, it’s useful to see how to extend it to larger data sets with Pig.


The ingest.tips blog has a post on Sqoop1 support for Parquet, which leverages the Kite SDK to generate Parquet files during import. The post serves as a good introduction to Sqoop1, which can both import data to HDFS and update the Hive metastore with information about the data. There are examples demonstrating how to use Parquet support.


Tephra is a open-source system that provides globally-consistent transactions for Apache HBase. Cask, the makers of Tephra, have written a blog post describing the requirements and design of Tephra. Tephra is designed in such a way that it can be used with systems other than HBase, and it is even designed to support transactions spanning multiple data stores.


This presentation focusses on Spark streaming, the micro-batch component of Apache Spark. The slides give an introduction to both Spark and Spark streaming, describe several use cases (claiming there are 40+ known production use cases), give an overview of several integrations (Cassandra, Kafka, Elastic Search, and more), and look ahead to some upcoming features and improvements in the development pipeline.



Hortonworks has filed paperwork for their initial public offering this week. The filing includes a number of details on the company, including financial numbers ($33.4M in revenue so far in 2014), an overview of key company milestones, and number of employees (524 at the end of September). GigaOm has an analysis of some of these numbers and an overview of what the IPO means for the rest of the industry.



IBM’s Big Data for Social Good Challenge opened this week. The challenge includes $40k in prizes, which will be awarded by a panel composed of IBM and industry experts. IBM has a curated list of datasets which can be used as part of a challenge entry.



Apache Drill 0.6.0-incubating was recently released. 0.6.0 is the second beta release, primarily containing bug fixes. Notable new features include ANSI SQL support for MongoDB, partition pruning, and (alpha) window function support.


Cubert is a new open-source tool from LinkedIn for writing high-performance MapReduce jobs. It’s a new language on the same level of Pig or Hive (sharing some resemblance to Pig) as well as a novel storage format/layer called blocks. For statistical calculations, graph computations, and OLAP cubes, Cubert offers impressive performance improvements. There’s a lot more information in the introductory blog post.


Apache Hive 0.14.0 was released this week. The release resolves over 1,000 (!) Jira issues. I’m sure we’ll soon hear more details about the release in blog post form but some quick highlights include: support for insert/update/delete with ACID support, a cost-based optimizer, support for data stored in Accumulo, support for HBase snapshots, and many improvements to ORCFile and HiveServer 2.


Pivotal Cloud Foundry (CF) has added support for deploying Cassandra via DataStax Enterprise. The blog post introducing the feature has many more details as well as an example of setting up a cluster.


Version 0.4.1 of the Spark Job Server has been released. The new version supports Spark 1.1.0 and has improvements for deployment/configuration.



Microsoft released version 2.5 of the Azure SDK and a preview of Visual Studio 2015. The releases contain support for HDInsight (the Hadoop as a Service component of Azure) including a Hive query editor and job viewer.



Curated by Mortar Data ( http://www.mortardata.com )



Data Exploration in Spark (San Francisco) – Tuesday, November 18


Getting Started with Spark and Scala, by Paul Snively of Verizon OnCue (El Segundo) – Tuesday, November 18


OCBigData Monthly Meetup #7 (Irvine) – Wednesday, November 19


49th Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) – Wednesday, November 19


HBase Meetup @ WANdisco (San Ramon) – Thursday, November 20



Unlocking Your Hadoop Data with Apache Spark and CDH5 (Seattle) – Wednesday, November 19



MapR Presents Apache Drill: Self-Service Data Exploration (Portland) – Wednesday, November 19


Apache Spark: Setup, Overview, and Comparison (Portland) – Wednesday, November 19



Scalable In-Hadoop ETL Execution: Pentaho’s Visual MapReduce (Overland Park) – Wednesday, November 19



Securing the Hadoop Cluster (Saint Louis) – Tuesday, November 18



Hadoop Like a Champion! (Austin) – Tuesday, November 18


Spark and Cassandra: Building and Deploying an Application (Austin) – Thursday, November 20



Hadoop Lunch at Adobe (Lehi) – Thursday, November 20



Hadoop Tutorial: Map-Reduce on YARN, Part 1 (Sterling) – Saturday, November 22



Understanding the Foundations of Hadoop (Philadelphia) – Tuesday, November 18


North Carolina

Triangle SQL Server UG Meeting (Raleigh) – Tuesday, November 18


Automating Customer Intelligence Management in Hadoop (Charlotte) – Wednesday, November 19


When to Use Pig instead of Hive (Winston Salem) – Thursday, November 20


New Jersey

YARN + Docker Containers: Integration and Privilege Isolation (Hamilton Township) – Wednesday, November 19


New York

Privilege Isolation in Docker Containers (New York) – Thursday, November 20



SQL on Hadoop: Hands-on (Boston) – Wednesday, November 19



November 2014 Hadoop Meetup (London) – Monday, November 17



Analyzing Real-World Data with Drill, Hadoop & MongoDB | Tomer Shiran, MapR (Singapore) – Monday, November 17



Apache Cassandra, Apache Spark, and Hadoop Meetup (Munich) – Tuesday, November 18


Patrick McFadin Talks C* & Spark for Time Series, plus A Leap Forward for SQL on Hadoop (Berlin) – Wednesday, November 19



Patrick McFadin Talks Cassandra, Spark, Tips and Tricks (Amsterdam) – Friday, November 21



Big Data Meetup, ApacheCon Edition (Budapest) – Tuesday, November 18



Drilling in on SQL and Hadoop (Melbourne) – Wednesday, November 19



Databricks Comes to Barcelona (Barcelona) – Thursday, November 20



Big Data Meetup (Bangalore) – Friday, November 21


Hadoop Workshop (Hyderabad) – Saturday, November 22



Read More…

Hadoop on Google Cloud Platform

Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google’s Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs. https://cloud.google.com/solutions/architecture/hadoop   […]

Read More…

Hadoop Weekly Issue #95

Hadoop Weekly Issue #95

09 November 2014

This week’s issue has great technical content including articles about data infrastructure from small companies, Buffer and Asana, to a large company, Facebook (and their big data challenges). There’s also coverage of a diverse set of topics related to YARN – Kafka on YARN, a comparison of YARN and Mesos, and the YARN timeline server. In industry news, Databricks recent sort benchmarking results have earned a tie for first place in this year’s Daytona GraySort contest.


The Buffer developer blog has a post on how they’ve evolved their analytics data infrastructure from just Mongo and Amazon SQS to also include Hadoop and Redshift. They use Mortar’s Hadoop-as-a-Service to run Pig scripts which load data from Mongo to S3 to Redshift. Luigi, the open-source Hadoop workflow engine from Spotify, is used for orchestration.


Facebook recently posted about several data problems that the company is facing. The look at big data challenges gives you a flavor for Facebook’s data sizes/volumes and internal systems (several powered by Hadoop). Among the problems are those faced by many folks working with big data infrastructure – e.g. how to sample data, which types of compression to use- and some which are unique to large scale companies- e.g. distributing a data warehouse across data centers.


The Cloudera blog has a post on using Spark Streaming for doing near-time session analysis. The post includes an example job which feeds data into HBase to power BI tools via the Hive adapter. The code for this system is available on github, and the post has a detailed look at what the major parts of the example Spark streaming job are doing.


This post looks at the relationship between YARN and mesos. There’s a fairly direct mapping between major components (e.g. YARN ResourceManager ~ meson-master with meta-scheduler), but resource allocation is different in the two systems (Mesos is push-based, YARN is pull-based).


Hortonworks has posted a video, slides, and a Q&A from a recent webinar on the new features and improvements in Hive as part of HDP 2.2. The new features in this version (which includes the first set of deliverables from stinger.next) include support for insert/update/delete and the cost-based optimizer.


This post shows how to deploy the YARN Timeline Server using Apache Ambari blueprints. The timeline server is still a work in progress, but you can get an idea of what types of information it currently supports with the screenshots linked to in the post.


DataTorrent has blogged about a new project to bring Apache Kafka to YARN. The so-called KOYA (Kafka on YARN) project plans to leverage YARN for Kafka broker management, automatic broker recovery, and more. Planned features include a fully-HA application master, sticky allocation of containers (so that a restart can access local data), a web interface for Kafka, and more. The post invites folks in the community to help build KOYA.


O’Reilly Radar has a post on schemas for data. It discusses why it’s tempting to use formats with implicit schemas (e.g. JSON, CSV), the benefits of schema, and why Apache Avro is a good solution. There’s a bit of detail on Avro and its file format, which stores the schema with the data.


The Cloudera blog has a post on the role of HBase in the Hadoop ecosystem. It discusses when it’s more appropriate to use Cloudera Impala (or any MPP engine atop HDFS) vs. HBase. Often times folks end up duplicating the data between systems, which leads to overhead and questions about the source of truth.


Mortar Data has posted a video (and slides) of a presentation by Mayur Rustagi of Sigmoid Analytics on the Pig-on-Spark initiative. The presentation is from the NYC Pig User Group meetup that took place during Strata + Hadoop World.


Asana has written about the evolution of their data infrastructure and the tools that they’re using. Like Buffer, Asana is loading data into Redshift and is using Luigi for managing dependencies. They are also using Elastic MapReduce. The post walks through their philosophy for build data infrastructure—mainly don’t over engineer things from the beginning.


The Cloudera blog has a post about integrating Flume with Kafka. On the Kafka -> Flume side, the integration allows you to deploy Kafka and serialize data to HDFS, HBase, or any other Flume sink without writing any custom code. The integration also supports Flume -> Kafka, in which case a local agent can buffer data. The post also describes upcoming work on a Kafka Channel for Flume.


Amazon recently announced a new Linux AMI version 2014.09. While it’s not yet the default AMI for Elastic MapReduce, it offers a lot of compelling features for building a Hadoop (or other big data) cluster in AWS. Those features come via the 3.14.19 Linux Kernel, which includes improvements for memory management (zram, zcache, zswap), tcp (fast open enabled by default), and btrfs. This post discusses how those improvements might enhance performance of different systems in the hadoop ecosystem.



GridGain, makers of an in-memory “data fabric,” have submitted their code to the Apache Incubator. The new project is known as Apache Ignite (incubating). In the announcement, GridGain touts it as a mature in-memory computing platform that can easily integrate with Hadoop.


In a follow-up to the earlier post on sorting 100TB and 1PB with Apache Spark, Databricks announced that their entry to the 2014 Daytona GraySort contest has tied for first place.


MapR and MongoDB have announced that MongoDB connector for Hadoop is certified for the MapR distribution.


Datanami has a report on the state of security for Hadoop. While a number of new projects have cropped up to add authorization, authentication, and encryption to the ecosystem, these are still pretty immature. Commercial add-ons are looking to fill this security gap. Datanami speaks with folks from Dataguise and Zettaset about the state of commercial support.


A trio of LinkedIn veterans who have worked on Apache Kafka and other data infrastructure projects have started a new company called Confluent. They will be focussing on Kafka and realtime data and have publicly committed to continue to work on Kafka (and potentially other tools, too) in open-source. There are more details about the new company in a post on LinkedIn.



Salesforce has introduced the Data Pipelines pilot for running Apache Pig queries against Salesforce data using the Salesforce platform. This post is a brief introduction and tutorial to the system.


Scoobi, the Scala API for MapReduce, has released version 0.9.0. The release includes support for Scala 2.11, improvements to serialization (WireFormats), fixes for EMR/S3, and more.


Plunger is a new open-source tool from Hotels.com for unit testing Cascading pipelines. The github project readme has several code examples of the API. The framework provides a number of utilities for testing (such as pretty printing data and testing serializers).


Amazon has announced support for HUE as part of Elastic MapReduce. It includes first-class support for data stored in S3.



Curated by Mortar Data ( http://www.mortardata.com )

Spark + Cassandra: Technical Integration (O’Reilly Media Webcast) – Wednesday, November 12




Diving into Spark Internals + Kafka and akka (San Jose) – Monday, November 10


Cascading: A Java Developer’s Companion to the Hadoop World (San Francisco) – Tuesday, November 11


November SF Hadoop Users Meetup (San Francisco) – Thursday, November 13


#SDBigData Monthly Meetup (San Diego) – Wednesday, November 12


Twofer: Mac Moore of Gridgain & Dale Kim of MapR (Santa Monica) – Wednesday, November 12



Trafodion: Transactional SQL-on-HBase, by Rohit Jain (Houston) – Monday, November 10



Lighting a Spark under Cassandra and Elasticsearch (Boulder) – Tuesday, November 11



Securing Hadoop: What Are Your Options? (Chicago) – Wednesday, November 12



Michigan Hadoop User Group Initial Meetup (Southfield) – Monday, November 10



The Scoop about Hadoop. What Is It? How to Begin? (Harrisburg) – Tuesday, November 11



Join Us for the Kick-Off Meeting at Society of Work (Chattanooga) – Thursday, November 13



Hadoop: A Look under the Hood (West Hartford) – Tuesday, November 11



Big Data: Unconference (Toronto) – Friday, November 14



How Secure Is Your Entire Hadoop Cluster? (Manchester) – Tuesday, November 11


Hadoop, R, Spark, and the Reverend Bayes (London) – Tuesday, November 11


5th Spark London Meetup (London) – Tuesday, November 11



PySpark: Real-time Large-scale Data Processing with Python and Spark (Berlin) – Tuesday, November 11



How Apache Spark Fits in the Big Data Landscape (Stockholm) – Thursday, November 13



BigData and Analytics: Why to Learn Hadoop (Hyderabad) – Wednesday, November 12


What Is Big Data? What Is Data Science? What Is Hadoop? (Hyderabad) – Saturday, November 15


Our First Meetup (Pune) – Saturday, November 15


Apache Spark and the Power of In-memory Computation (Bangalore) – Saturday, November 15



Read More…

MapR Partners with MongoDB, Certifies MongoDB Connector for Hadoop

MapR Technologies, Inc., provider of a leading distribution for Apache™ Hadoop®, has announced a partnership with MongoDB, a next-generation database that helps businesses transform their industries by harnessing the power of data. As part of this agreement, MongoDB has certified its Connector for Hadoop with the MapR Distribution. Certification with the MapR Distribution including Apache […]

Read More…

Hadoop Weekly Issue #94

Hadoop Weekly Issue #94

02 November 2014

Hadoop in the cloud (both open and public) is a big topic again this week. There are articles on Hortonworks’ HDP in the Microsoft Azure cloud, Cloudera’s new cloud provisioning tool Cloudera Director, OpenShift, and SequenceIQ’s Cloudbreak. Also, there are several articles this week on Hadoop adoption, which seems to be limited by maturity of enterprise features. Finally, Kafka released version 0.8.2-beta this week, and a new project aims to provide higher throughput from Kafka for MapReduce jobs.


As the Hadoop ecosystem of projects grows and folks are using it in many different ways, integration between projects and consistency across projects are both important parts of usability. This article highlights several ways that the Hadoop ecosystem could improve along those lines. It’s just the tip of the iceberg—hopefully these things get better as Hadoop matures.


In the first part of a three-part series on HBase, this post presents an introduction to HBase’s data model and architecture. It also contains instructions on setting up a local HBase and interacting with it using the HBase shell.


The Cloudera Blog has a post on integrating the KiteSDK with OpenShift. Specifically, the Kite SDK has tooling for running in-process mini clusters (HDFS, Hive, Flume, HBase, Zookeeper) for testing as well as locally via the command-line. The post introduces these tools and describes work to add support for running a mini cluster via OpenShift to the command-line tools.


Hortonworks has posted a recording of and slides from a recent webinar on Apache Knox and Ranger, which are the main enterprise security products in their distribution. In addition, the post includes several questions and answers related to the offering. For anyone interested in enterprise security, this is a good overview of the current state of Hortonworks’ offerings.


While not directly related to Hadoop, this post summarizes a recent paper out of Facebook on their f4 BLOB storage system. The review notes that f4 is built atop of HDFS, and it describes how it gets around several HDFS limitations (namely adding cross-data center replication and using erasure coding to decrease replication factors). Definitely one of the more technical posts linked in this newsletter, but it’s quite interesting.


Label-based scheduling is a system for tagging resources in a heterogenous cluster and supplying boolean rules for scheduling jobs against these resources. The MapR blog has an overview of this feature for MapR’s distribution including a description of how it integrates with the Capacity Scheduler and Fair Scheduler. The community is looking to add a similar implementation to core Hadoop as part of YARN-796.


Spark vs. Tez has been a point of contention for a while now. Spark has gained momentum recently with several companies (including MapR and Cloudera) committing to it. Hortonworks, the main proponents of Tez, continue to tout Tez with a prototype implementation of the Spark API using Tez as the backend. In other words, it’s Spark on Tez on YARN (with data in HDFS). There is a discussion of the prototype and some benchmarks (as always, beware of vendor benchmarks—they’re typically not representative of your own workload) on the Hortonworks blog.


This post shows how to launch a HDP cluster on the Microsoft Azure cloud. Azure has a wizard for building both small-scale evaluation clusters and standard clusters (which have up to 45 worker nodes).


In another Hadoop-in-the-cloud post, Cloudera has an introduction to the new Cloudera Director for deploying CDH clusters in the cloud (supporting AWS initially). The post describes the data model, the server API, the user interface, the and the client.


This post introduces the Cloudbreak shell. Cloudbreak is a Hadoop as a Service system for deploying Hadoop clusters in the cloud. The post walks through setting up the command line tools and provisioning a Hadoop cluster.


MapR has a video (and transcript) of a whiteboard presentation comparing and contrasting Spark Streaming and Storm Trident. Both systems are micro-batching streaming frameworks. The presentation covers fault tolerance, ease of deployment, compatibility with YARN, and more.



Databricks and Hortonworks have announced an expanded partnership. As part of the expansion, the two companies are working on helping customers, engineering (namely enterprise features like security), and open source. Cross posts on the Hortonworks and Databricks blogs have takes from both companies on the expanded partnership.



EnterpriseTech has a post on the growth and adoption of Hadoop. It cites industry research and surveys as well as interviews with Hadoop vendors. The key takeaway seems to be that enterprise adoption isn’t quite there yet (only about 2,000 production deploys of Hadoop) but is on the verge of hockey stick growth.


SDTimes has an in-depth look at the Hadoop ecosystem. The article explores the various applications of Hadoop, its costs, tooling/support for ad hoc queries, language and library support for data science, security, and more.


SearchDataManagement also has a post about Hadoop adoption. This article interviews consultant and author Joe Caserta, who has been a bit surprised with the lack of adoption of Hadoop. The Q&A strives to explain why—maturity, support for interactive queries, and data governance are among the reasons.



Kangaroo is a new open-source project from Conductor for writing MapReduce jobs consuming data from Kafka. The introductory post explains Conductor’s use case—loading data from Kafka to HBase by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions which are limited to a single InputSplit per Kafka partition, Kangaroo can launch multiple consumers at different offsets in the stream of a single partition for increased throughput and parallelism.


Amazon Web Services has updated their Amazon Kinesis Storm Spout in order to support Storm’s Ack/Fail semantics (the spout can re-emit messages). They’ve also published a white paper with a reference architecture.


Apache Kafka 0.8.2-beta was released. The new version contains a new Java producer, support for deleting topics, Scala 2.11 support, and a new configuration option to prefer consistency over availability.


As the repository name suggests, this is a project for building a docker image that allows running Hive on Tez. The project README has details on building the image, running it, testing it with some built-in scripts, and more.


Kylin is the recently open-source OLAP system from eBay. SequenceIQ has a docker image for running kylin, which includes support for Apache Ambari for managing the cluster.


Version 4.0 of Platfora, the analytics platform built with Hadoop and Spark, was released. The new release has new visualization and geo-analytics tools as well as insight delivery for sharing visualizations over email.



Curated by Mortar Data ( http://www.mortardata.com )



Introducing Apache Flink: A New Approach to Distributed Data Processing (Palo Alto) – Tuesday, November 4


State of Apache HBase, 1.0 Release, by Nick Dimiduk of Hortonworks (Los Angeles) – Thursday, November 6



Pivotal Business Data Lake (Tempe) – Wednesday, November 5



November Meetup: Clickstream Data Monetization Using Datameer (Fort Worth) – Thursday, November 6



Spark Gotchas and Anti-Patterns, plus Julia Language (Broomfield) – Wednesday, November 5



Unit Testing with Hadoop, plus Spark and Storm (Mayfield Village) – Monday, November 3


North Carolina

ORM for HBase (Durham) – Tuesday, November 4


IBM’s Hadoop Integration with SAS Analytics: Using Hive (Durham) – Thursday, November 6


New Jersey

Fraud Detection = Spark + memSQL (Hamilton Township) – Tuesday, November 4



Data at a SaaS Company (Melbourne) – Wednesday, November 5



Offline and Real-Time Click Stream Processing (Amsterdam) – Thursday, November 6



Introduction to the Hadoop Ecosystem + Forming the HUG (Oslo) – Thursday, November 6



Hadoop Meetup @ IBM EGL (Bangalore) – Friday, November 7


Hadoop Hands-on/Demo, plus Big Data Industry Trends and Opportunities (Chennai) – Saturday, November 8



Read More…