Hadoop Weekly Issue #93

Hadoop Weekly Issue #93

26 October 2014

Given the torrent of Strata + Hadoop World news last week, it’s no surprise that this week’s edition is a bit shorter than normal. With that said, the amount and quality of technical content in this edition is above average—posts on Storm, HBase, HDFS metadata, Docker-in-YARN, and much more. Security is also a hot topic this week—in addition to technical posts, Cloudera announced that Cloudera Enterprise has achieved PCI compliance.


The Yahoo Storm team has written about Storm at Yahoo. The post describes the history of Storm’s adoption at Yahoo and some early products powered by it. It then describes a number of improvements that Yahoo made (including a netty-based messaging setup, several security features, multi-tenancy, and Storm-on-YARN). Finally, there are some notes on new features in the works.


This post shows how to use Cascading to run a TopK query with the recently added Tez backend. The post has code examples, which it walks through in detail, and the full code is available on github.


The Hortonwork’sblog has a post describing how to configure the HBase REST server, HiveServer, WebHDFS, and Oozie with SSL encryption and certificates. Once these services are configured for SSL, Apache Knox can then be configured to talk to the services over SSL—providing end-to-end encryption. The post has a lot of low-level details (including keytool commands, config file options) for this setup.


A post on the ingest.tips blog has an overview of the major differences between Sqoop 1 and Sqoop 2. Whereas Sqoop 1 is a standalone tool, Sqoop 2 is a client-server architecture with a management UI and command shell. The post also describes the status of Sqoop support in Oozie.


This post describes a system for analyzing trading data in real-time and in batch using several big data tools. The system takes in data in real-time into Kafka, has implemented a rule engine with Storm, stores data for dashboarding and visualization in Cassandra, and uses Hive to perform batch analysis on data ingested from Kafka to Hadoop using Camus. The source code for the project (called wolf), is available on github.


HTrace is an open-source library from Cloudera for distributed tracing inspired by Google’s dapper paper. It’s used for finding bottlenecks in RPCs and distributed systems with low-overhead. This presentation gives an overview of the tracing model, how to enable it with HBase, and more.


This presentation covers the upcoming Apache HBase 1.0 release. The talk covers the history of HBase, gives a brief introduction to the architecture, describes some major changes for the 1.0 release (co-locating Meta with Master, Region Replicas for improved availability, and more), and describes the upgrade path to 1.0 from previous versions (nothing that Hadoop 1.x and Java 6 are not supported).


This post is an in-depth description of the various files that the HDFS NameNode and JournalNodes maintain to store HDFS metadata. A lot of things have changed in the setup with the introduction of HA NameNode, so it’s quite useful if you’re only familiar with the previous implementation. In addition to an overview of all the files, there’s also a description of several commands and settings related to HDFS metadata.


Kubernetes-YARN is a new project (currently in prototype/alpha) to provide a mechanism for running Docker containers (via the Kubernetes container cluster manager) alongside YARN applications. This introductory post describes the architecture and provides a walkthrough on a vagrant-based single-node cluster in which an nginx docker container is run on the YARN cluster.


Cloudera has two posts focussing on new features in the recently released CDH 5.2. The first post provides an introduction to Kerberos and LDAP, describes how they’re integrated into Impala, and shows how to setup Impala to run in a secure environment with LDAP and Kerberos enabled. The second post is about several new features in Hue, including a new Security app and improved dashboards for Search and Oozie



In celebration of the 1-year anniversary of the release of Apache Hadoop 2.2.0, Xplenty has a three-part blog post on YARN. The posts looks at some of the challenges in upgrading to YARN from pre-YARN Hadoop, the “renaissance” of Hadoop (i.e. the plethora of new projects-particularly SQL-on-Hadoop), and the rising popularity of Apache Spark .





This post looks at how American Express is using Hadoop for several new products and services. Hadoop applications are used to analyze transaction and social data in both real-time and in batch. The post includes details on their Hadoop deployment, which is built on 2U servers with 24 disk bays and dual-10GbE networking. They have also shared some performance numbers from a TeraSort run on a 255-machine chunk that was added to the cluster, in which they sorted a terabyte in 45s (this was in 2013). Many more details in the article.


The Insight Data Engineer Fellows Program is a six-week program to help engineers gain experience with data engineering technologies and meet folks from industry. Applications for the next session are due tomorrow, October 27th.


Videos of all keynotes from the recent Strata + Hadoop World have been posted on Youtube. There are also interviews with a number of folks in the Hadoop industry.


Xplenty, makers of a data integration platform built on Hadoop, have announced that they’e raised $3 million in series A financing. Datanami has more details on Xplenty and their product.


Datanami has two posts about last week’s Strata + Hadoop World. The first covers the keynote by Cloudera’s chief strategy officer, Mike Olson, in which he predicted Hadoop’s disappearance (i.e. people won’t spend as much time in the weeds getting the tech to work, they’ll focus on applications). The second post covers several announcements from the conference from the likes of Cray (a new Hadoop appliance), Revolution Analytics, Pentaho, and more.



Cloudera announced that Cloudera Enterprise has been “fully certified as compliant with Payment Card Industry (PCI) Data Security Standards.” The first company using the certified product is MasterCard.


Spark Summit East is taking place in New York City on March 18 and 19th, 2015. The call for presentations is now open until December 5th.



Amazon Web Services has supported Spark on Elastic MapReduce (EMR) for over a year by way of a bootstrap action to install the software on the cluster. They’ve recently added support for Spark 1.1.0 on Hadoop 2.4.0 with the Hadoop AMI version 3.1+.



Curated by Mortar Data ( http://www.mortardata.com )



HBase Meetup @ 4 Infinite Loop (Cupertino) – Monday, October 27


Galvanize Data Science Launch (San Francisco) – Wednesday, October 29


RHadoop – Scaling the R Language for Big Data Analysis (San Ramon) – Wednesday, October 29


Introducing Apache Flink, a New Approach to Distributed Data Processing (Pasadena) – Wednesday, October 29


Hadoop as a Service: Is the Market Now? Is Hadoop Ready for the Cloud? (Sunnyvale) – Thursday, October 30



CloudBreak: Hadoop on Docker (Ballwin) – Saturday, November 1



Hadoop Security: Managing Big Data in a Dangerous World (Urbana) – Tuesday, October 28



Indy Big Data Monthly Meetup (Carmel) – Wednesday, October 29



Spark Bake-Off (McLean) – Thursday, October 30


North Carolina

RDBMS on Hadoop? Talk & Hands-on Session from Splice Machine (Charlotte) – Wednesday, October 29



Show and Tell Night (Cambridge) – Tuesday, October 28


Big Data (Part 1): Overview (Plymouth) – Thursday, October 30



October Meetup (Mannheim) – Monday, October 27



Managing Data in a Hybrid Hadoop & RDBMS Environment (Brisbane) – Wednesday, October 29



Bucharest HUG, October Meetup (Bucharest) – Thursday, October 30



Read More…

Hadoop Weekly Issue #92

Hadoop Weekly Issue #92

19 October 2014

With Strata + Hadoop World this week, there were a number of partnership announcements and software releases. Among them, Cloudera and Hortonworks released new versions of their distributions, MapR is bundling MapR-DB with their community edition, and Pivotal announced plans for the Tachyon project. There are also several good technical posts this week covering Sqoop, Kafka, Presto, Hive, and Scala as a language for data processing. I tried to cover the key news from the week but likely missed some stories given the Strata + Hadoop World tsunami. Please let me know if there’s something you think should be in next week’s newsletter.


ingest.tips is a new blog focussing on data ingestion into Hadoop. I recommend catching up on all the posts published so far. They cover Flume v. Kafka, the design and features of Sqoop2, the Kafka High-level consumer, and a recap of this week’s Kafka Meetup in NYC.





This newsletter has had a lot of coverage of the work done by the folks at SequenceIQ on dockerizing Hadoop. In fact, they’ve been up to so much that it can be hard to see the whole picture through a series of posts. The Hortonworks blog has a guest post in which SequenceIQ summarizes their platform—Cloudbreak for provisioning clusters and Periscope for SLA enforcement and autoscaling.


This post describes some of the roadblocks in setting up the latest version of Apache Sqoop (1.99.3) and how to get past them. It serves as a pre-walkthrough to the Sqoop in 5 minutes tutorial from the official Sqoop documentation.


Apache Phoenix has gained momentum recently as a SQL engine for HBase. The Hortonworks blog has some notes on integrating Phoenix with Hive, which can also do SQL over data stored in HBase (but with an emphasis on batch as opposed to OLTP). Plans include a unified SQL layer which can delegate to either Phoenix or Hive, a shared metadata repository, and a shared transaction manager.


The Scala language has been adding adopters for years—especially as several popular distributed systems are written in Scala (e.g. Kafka and Spark). This post discusses three reasons that Scala should be your go to language for data engineering / processing at scale.


The Hortonworks blog describes Ozone, an object store that it plans to add to HDFS. An object store (Amazon S3 is probably the best known example) has different requirements than a file system, such as support for large numbers of objects (much more than the number of files HDFS can support), a simple REST API, and cross-datacenter replication.


This post is a brief intro to the Hadoop metrics framework. Specifically, it includes snippets of both the registration of and export of (via FileSink and web services) metrics.


Cloudera introduced Cloudera Director this week for running Hadoop clusters in the cloud. The AWS big data blog has a post describing how to build a cluster in AWS with Cloudera Director and Cloudformation. The post describes two possible topologies in an AWS Virtual Private Cloud, how to configure the cluster, how to deploy it, and how to terminate the cluster.


The Qubole blog has a guest post by MediaMath on their experiences with Presto, the big data SQL framework from Facebook. The post includes a performance comparison of Presto vs Hive on data (presumably real data from MediaMath, not synthetic data) stored in Amazon S3. Results show that Presto is ~3x faster than Hive on average, and 5x faster when caching (a Qubole-only speedup) is enabled.


The Cloudera blog has two posts on new features of CDH 5.2, which was released this week (more on the release below). The first covers Impala, which has gained support for several analytics functions, two new datatypes (VARCHAR and CHAR), support for spilling to disk when the query doesn’t fit in RAM, and more. The second covers Apache Sentry, which adds the GRANT keyword (to allow a user to grant privileges to there users) and the REVOKE keyword to remove the privilege.




The DBMS2 blog has a post about Cloudera’s product offering. It serves as a glossary of all the products and buzz words surrounding Cloudera’s products. The post is pre-Strata + Hadoop World, so it doesn’t include any newly announced products (such as Cloudera Director).


This post discusses the history and architecture of the Apache incubator project, Flink (formerly Stratosphere). The post argues that Flink is in a better position that most big data query engines because it contains a cost-based optimizer for unstructured data and can unify real-time processing with analysis of historic data. In terms of real-time, the post compares Flink with Spark streaming (which only does micro batch).


MapR announced that they’re planning to integrate Apache Drill, the data exploration platform, with Apache Spark. Given recent news related to Spark (e.g. efforts to get Hive running on Spark), this is another vote for Spark as the successor to MapReduce.


This post opens with an observation that I struggle with every week as I find content for this newsletter: “it’s getting hard to pinpoint what, exactly, Hadoop is.” It points out that all the moving pieces and flexibility of Hadoop can make it difficult to deploy and operate. This in turn is a big opportunity for folks selling to enterprises.


Cloudera has started “Cloudera Labs” for incubation of project inside of Cloudera Engineering. The initial set of projects include Kafka, Hive-on-Spark, Impyla (a python client for Impala), and Oryx (an implementation for the lambda architecture).


The DBMS2 blog has a post-Strata + Hadoop Wold article on Cloudera’s announcements this week. Key observations include the large number of business partnerships announced by Cloudera this week and that they’re becoming more cloud friendly.


The number of partnerships and announcements this week from Cloudera is a bit overwhelming. Many are covered elsewhere in the newsletter, but the full list is indexed in the Cloudera press center.


Pivotal’s distribution, Pivotal HD, has included support for Spark since May. They’ve announced plans to take their commitment to in-memory computing a step further by partnering with UC Berkeley’s AMPLab to further develop the Tachyon in-memory distributed file system.


Datanami has coverage recently releases by big data software vendors in which MapReduce is replaced with next generation processing systems. Of the companies profiled, four have moved to Spark while one has moved to Tez. Regardless of if Spark or Tez is winning, it’s clear that MapReduce is becoming less common.


A new Gartner research note on comparing Hadoop distributions has been published. While the full report is behind a paywall, this post describes the note’s key findings and recommendations. They include: vendor lock-in isn’t a large concern, Gartner expects new Hadoop ecosystem technologies soon, and Hadoop is becoming the de facto system for cluster management.


“Time Series Databases” is a new book written by some folks at MapR and being published by O’Reilly. The book looks at open-source tools for time series data—specifically OpenTSDB and Grafana. It also covers using MapR-DB as a backend to OpenTSDB. MapR is sponsoring a free download (behind a email-wall).


This article considers the pros and cons of various ways to build an analytics platform with Hadoop. Options include Hadoop as a source of truth from which a data warehouse is populated, a parallel data warehouse, Hadoop on an appliance, and analytics directly from Hadoop. The post also includes suggestions for successfully using Hadoop as an analytics platform.



Mortar, makers of the Pig-as-a-Service platform, have announced integration with Luigi. Luigi is an open-source workflow management tool originally written at Spotify. Mortar’s introductory blog post explains some of the advantages of Luigi, details the integrations they’ve built for it, and links to a tutorial for getting started.


VMWare’s vSphere Big Data Extensions (BDE) 2.1 includes integration with Cloudera Manager and Apache Ambari for provisioning Hadoop clusters. After provisioning VMs, BDE makes API calls to the management software to build and configure the cluster. There’s much more information about the integration in the blog post below.


Protegrity Avatar is a new system for data protection in HDP. It supports encryption at rest and fine-grained access controls for Hive, Pig, HBase, and MapReduce.


Cloudera has released Cloudera Enterprise 5.2. The announcement highlights several improvements in the release—security (including the fruits of joint work with Intel), data management & governance, cloud deployment, and more. The release includes new versions of HBase (0.98.6), Apache Spark (1.1), Impala (2.0), and several other components. Apache Kafka integration is also available via Cloudera Labs.


Hortonworks released HDP 2.2, the next release of their distribution. Release highlights include phase 1 of Stinger.next to improve performance of and add (simple) transactions to Hive, Spark on YARN, the inclusion of Kafka, Apache Ranger (previously Argus) for cluster security, and support for cloud backup. There’s a much more complete overview of the release, which features new versions of every component of the distribution, on the Hortonworks blog.


MapR announced this week that they are including MapR-DB within the MapR Community Edition. MapR-DB implements the HBase API but is built with a different architecture (which leverages the MapR FileSystem).


Microsoft announced this week that Azure HDInsight, a Hadoop-as-a-Service system, is adding support for Apache Storm. The integration is available in preview form starting now. Also, they expect to land support for HDP 2.2 on HDInsight in November.


Action announced a free community version of their Actian Analytics Platform. Action’s SQL-in-Hadoop system stores data in HDFS but doesn’t interoperate with the rest of the Hadoop ecosystem. The community version is free for an unlimited number of nodes and up to 500GB of data.


Rackspace has announced the OnMetal Cloud Big Data Platform, which is used to run a bare-metal Hadoop/Spark cluster. This is an interesting product that lies between a dedicated cluster and one running in the cloud on virtualized hardware.


Pivotal announced new versions of GemFire XD and SQLFire. GemFire XD is a distributed database that runs atop of Pivotal HD. Both releases include improved integration with HDFS.



Curated by Mortar Data ( http://www.mortardata.com )



Storage Solutions for Big Data with Hadoop Architect Sameer Tiwari (Palo Alto) – Tuesday, October 21


Hadoop Effortlessly: A Data Inventory Is Key to Data Self-Service (Sunnyvale) – Thursday, October 23


Data Science Camp @ Bay Area ACM (San Jose) – Saturday, October 25



Data Science Using Big R for in-Hadoop Analytics (Las Vegas) – Sunday, October 26



Drill Down into Apache Drill! Plus, Pinsight Media + Hadoop and Hive Use Case! (Overland Park) – Thursday, October 23



Resource Management in Modern Hadoop Clusters (Saint Louis) – Tuesday, October 21



Welcome to the Nashville Cloudera User Group (Nashville) – Thursday, October 23



Hands-on MapReduce and Spark Programming by Roger Ding (McLean) – Wednesday, October 22



Escape from Hadoop: Spark Streaming, Cassandra, Scala & Akka, with Helena Edelson (Philadelphia) – Tuesday, October 21



Apache Spark in Four Parts (Annapolis Junction) – Tuesday, October 21



Real-Time Analytics in Hadoop, and Hadoop in 2015 (Saint Petersburg) – Wednesday, October 22


New York

Index-Based SQL-on-Hadoop: An Architectural Comparison of Tools (New York) – Monday, October 20



ADAM, Spark, and Tachyon (Cambridge) – Monday, October 20



HBase, What’s It All About? (Colchester) – Friday, October 24



Hadoop Continued: Hive and Spark + Experiences (Vienna) – Tuesday, October 21



Disruptive Applications and Hadoop… on the Cloud (Vancouver, BC) – Tuesday, October 21


Connecting Visual Analytics Tools to Enterprise Big Data with Spark SQL (Vancouver, BC) – Thursday, October 23



“Apache Spark 101” with Paweł Szulc (Wroclaw) – Tuesday, October 21



Scala.IO (Paris) – Wednesday, October 22 and Thursday, October 23



High-Availability Hadoop and Apache Cassandra (Sydney) – Wednesday, October 22



Big Data/Data Science Meetup (Cluj-Napoca) – Thursday, October 23



Shanghai Spark Meetup, with Jason Dai (Shanghai) – Saturday, October 25


MLlib and Distributed Machine Learning (Beijing) – Sunday, October 26



Read More…

Hadoop Weekly Issue #91

Hadoop Weekly Issue #91

12 October 2014

With Strata+Hadoop World taking place this week in New York, we can expect to see a lot of announcements. But a number of folks have jumped out ahead of the conference, and there are several partnership and technical announcements in this week’s issue. On the technical side, Databricks posted a benchmark for terasort on Spark, and eBay has open-sourced Kylin, their Hadoop OLAP system. If you’re in NYC for Statra+Hadoop World, be sure to check out some of the 14 meetups happening this week!


This tutorial walks through the steps necessary to configure a Shark (Hive on Spark) thrift server and use it to power Tableau over ODBC.


In an in-depth and interesting post, Netflix has described their use of Presto, an SQL-on-Hadoop (or this case S3) system open-sourced by Facebook, on AWS. Netflix has over 10 PB of data in S3, runs a Presto cluster consisting of 250 m2.4xlarge instances, and supports around 2500 queries per day. They’ve contributed a number of improvements to Presto, including improving support for the Parquet file format and S3.


A paper presented at the OSDI conference this week focusses on testing in distributed systems. The paper considers reports of real-world failures of several distributed systems from the Hadoop ecosystem—Cassandra, HBase, HDFS, and MapReduce. The authors have a number of interesting findings including: 98% of failures are guaranteed to manifest on <= 3 nodes, 77% of failures can be reproduced by a unit test, and 92% of catostrophic failures are due to incorrect handling of non-fatal errors. They introduce Aspirator, a system for statically analyzing software to find these types of errors.


The Los Angeles Spark User Group recently hosted a panel of data scientists from Cloudera, MapR, and Pivotal. The panelists discussed Spark’s conception and history, their vision for the future of Spark, and more. Inside Big Data has a video of the panel.


This post covers setting up the Google Cloud Storage Hadoop FileSystem integration with Apache Spark. It covers the installation and configuration steps as well as some simple smoke tests to ensure the system is setup correctly.


The SequenceIQ blog has a post describing a system they’ve built for Hadoop monitoring. The system consumes metric log files generated by the Hadoop metrics system using collect. From their, the metrics are sent via Logstash to an ElasticSearch cluster. Kibana is used for dashboarding and visualization. SequenceIQ has published a development preview of the client and server daemons, which are run as docker containers.


DataBricks has published results on using Apache Spark to sort 100 TB and 1 PB of data. The benchmark used 206 nodes in AWS EC2 and completely the sort of 100TB in 23 minutes, which is just under 3x as fast as the previous record from 2013 on a 2,100 node Hadoop cluster. The post on the DataBricks blog has details on the experiment as well as background on several of the recent improvements to Spark that helped them achieve the speedup.


The Cloudera blog has a guest post from Syncsort on their work to add support for importing data from a mainframe to Hadoop. The post gives a bit of background about mainframes (which expose their data via FTP), the design and implementation, and experiences going through the patch submission and review process.


In another guest blog post, Syncsort writes on the Hortonworks blog about integrating their DMX-h product with Apache Ambari. DMX-h adds a new Ambari Service definition, which is exposed via the REST API.



ZoomData, makers of big data analytics and visualization software, announced $17M in Series B funding. ZoomData’s software supports Hadoop, Spark, and several other connectors.


After last week’s announcement that Cloudera has acquired visual analytics startup DataPad, we’re hearing from DataPad’s CEO and co-founder about the acquisition. This post has some background on the founding of DataPad (including the types of problems the company is trying to solve) and a glimpse into the future of DataPad’s software inside of Cloduera.


In a post celebrating Storm’s graduation from the Apache Incubator, Storm founder Nathan Marz recounts the history of the project. The post covers the creation of Storm, the process of open-sourcing Storm, the marketing and support that went into the early project, Storm’s technical evolution, and Storm at Apache.


Cloudera written about some of the work they’ve done for Apache Spark and some of their plans for the future of Spark. Examples of completed work include improving Spark-on-YARN, better support for HDFS caching, and integrating Spark streaming and Apache Flume. Plans for the future include Hive-on-Spark, lossless Spark streaming, and integrating Spark with the YARN timeline server.


Cloudera and O’Reilly have announced an expanded partnership around conferences. In addition to Strata + Hadoop World in New York, Strata conferences in Barcelona, San Jose, and London, have been rebranded “Strata + Hadoop World.”


Cloudera and Teradata announced an extended partnership as they work to optimizing the integration between Cloudera’s enterprise data hub and Teradata’s data warehouse through the Teradata Unified Data Architecture.


Businessweek has an article marking Hadoop’s success at permeating industries outside of silicon valley. They cite the Detroit Crime Commission, agriculture enterprise Monsanto, and the Indian government’s national identity registry as examples. The article also includes a discussion about the merits of open-source.


The post walks through the “data lake” metaphor… and introduces a few new metaphors along the way. There’s a good discussion of semi-structured data and the importance of generating useful data to put into a lake.


This is a quick post offering some commentary on the Cloudera-Terdata partnership announced this week. It points out that the partnership highlights the fact that Hadoop isn’t replacing the data warehouse, like a lot of folks have predicted.



A few weeks ago, Apache Accumulo 1.5.2 was released. Accumulo is a distributed key/value store based on BigTable built atop HDFS and Zookeeper. The 1.5.2 release contains performance and bug fixes for the 1.5.x branch (version 1.6.1 is the latest).


Accumulo 1.6.1 was also recently released. The release contains several performance improvements including better write-ahead log sync performance (by avoiding multiple syncs). There are also several bug fixes including a fix for upgrading from 1.5.x to 1.6.1 and an updated Guava version dependency (to match Hadoop 2.x).


Sematext announced support for monitoring of Apache Spark jobs as part of their Performance Monitoring (SPM) product. The introductory blog post includes screenshots of the Spark integration, which provides metrics for Spark Workers, Executors, and more. SPM is available both as a SaaS or on premises deployment.


Hadoop SaaS vendor Altiscale announced a new SQL-on-Hadoop offering this week. The system is built on Hive 0.13 and Tez, offers a web-base SQL query tool, and is leverages a partnership with Simba Technologies to offer ODBC access to the service.


Flue is a new project to add a transaction layer atop of Accumulo. The first alpha release was made alongside the announcement, and it uses Apache Twill for deploying into YARN.


Apache BigTop 0.8.0 was released. For those not familiar, BigTop is a project for integrating and testing a large number of ecosystem projects. This release is based on Hadoop 2.4.1, HBase 0.98.4, the latest version of Phoenix, and contains upgrades of several other ecosystem projects.


Cloudera Live is a zero-install demo of Hadoop available via a web browser. The demo has been updated to include an interactive tutorial that includes loading data into HDFS using Flume and Sqoop, creating and querying Hive/Impala tables, and indexing data into Cloudera Search.


eBay has open-sourced Kylin, their Hadoop OLAP engine. In addition to being another SQL-on-Hadoop system (supplying ANSI SQL), Kylin supports data cubes, approximate queries (using HyperLogLog), ACLs at the Cube/Project Level, and more. In comparison to other SQL-on-Hadoop systems, Kylin is a Multi-Dimensional OLAP whereas most others are closer to Relational-OLAP. The presentation below has many more details on the system, including information on the architecture and technical pieces.



Cascading 2.6 was released. The new version includes about 20 changes, including a new DecoratorTap and DistCacheTap to wrap existing classes.


Dataguise has announced that its DgSecure data governance software for securing Hadoop deployments now supports several Hadoop-as-a-Service offerings. Those include Altiscale, Qubole, and Amazon Web Services.


Trifacta v2 was released this week. The software, which focusses on data wrangling, includes visual data profiling tools, support for many common formats from JSON to Parquet, and uses both Spark and MapReudce. More details about each of these parts in the announcement.


HUE 3.7 was released. The new version includes a new app for Sentry with tools for managing roles and privileges, improvements to the Search app including several new widgets, and improvements to Oozie, HBase, Hive/Impala and more.


Version 0.17.0 of the Kite SDK was released. The new version adds support for namespaces, improved examples, new tools for running against development mini clusters, and more.



Curated by Mortar Data ( http://www.mortardata.com )



48th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) – Wednesday, October 15


DevOps Special: Deploying Hadoop Using Docker Containers (Santa Clara) – Thursday, October 16



3rd Thursday Huddle! Hadoop & NoSQL Joining Forces (Dallas) – Thursday, October 16



A Leap Forward for SQL on Hadoop (Milwaukee) – Tuesday, October 14



HUG Pittsburgh Meetup (Pittsburgh) – Wednesday, October 15



Big Data & Analytics Developer Day (Chattanooga) – Wednesday, October 15


New York

Strata Conference Big Data: Commercialized Hadoop+Spark+R Solution (New York) – Monday, October 13


Practical On-line Approximation Algorithms in Storm with Ted Dunning – Monday October 13


2-for-1: Resource Management in Modern Hadoop + Hadoop Application Architecture (New York) – Tuesday, October 14


Sandy Ryza: Why Is My Spark Job failing? (New York) – Tuesday, October 14


Becoming a Scalable Data Scientist with GraphLab (New York) – Wednesday, October 15


Cloudera User Group Meetup at Strata + Hadoop World (New York) – Wednesday, October 15


The Past, Present and Future of Apache Kafka (New York) – Wednesday, October 15


Why Pig? + Pig on Spark Update during Strata (New York) – Wednesday, October 15


Going Beyond Hadoop: Faster Big Data (New York) – Wednesday, October 15


Elasticsearch Meetup at Twitter (New York) – Wednesday, October 15


Sqoop Meetup at Strata + Hadoop World (New York) – Wednesday, October 15


HBase Meetup on the Night before Strata/HW (New York) – Wednesday, October 15


Big Cybersecurity Analytics Meetup with Sqrrl (New York) – Thursday, October 16


Informal Hue Meetup at Strata + Hadoop World: Hue 3.7 (New York) – Thursday, October 16



Full-day Hadoop MapReduce Hands-On (Cambridge) – Saturday, October 18



Web-Scale Data Mining and Processing (Warsaw) – Wednesday, October 15



Big Data, Bases de Données Graph, Démo Hadoop et MapReduce (Casablanca) – Wednesday, October 15



Introduction to Apache Flink (Berlin) – Wednesday, October 15



Introduction to Big Data & Hadoop (Bangalore) – Thursday, October 16


Big Data/Hadoop Forum (Chennai) – Saturday, October 18



Read More…

Hadoop Weekly Issue #90

Hadoop Weekly Issue #90

05 October 2014

It’s a relatively quite week with only two releases (the calm before the Strata + Hadoop World storm?). In the technical and news areas, two themes are playing out this week. First, there is a lot of great content on stream processing frameworks—namely Storm and Spark streaming. Second, there are several articles about integration YARN with other systems and frameworks (OpenStack, Mesos, AWS). There are also pieces on Spark MLlib, RStudio on Amazon EMR, and the cost-based optimizer for Hive—something for everyone.


Getting started with a new distributed system typically requires looking through tutorials, documentation, and even source code. This presentation aims to gather all of that information (and more) into a single training deck for Apache Storm. It covers five key areas—an introduction, Storm’s core concepts, operational considerations, Storm app examples, and wirbelsturm for local development.


This presentation gives an introduction to Apache Optiq (incubating) and describes how the Optiq cost-based optimizer is being added to Apache Hive 0.14. There are some examples of optimizing the query plan for star schema, left-deep tree, and bushy tree queries. It also explores the importance of having statistics about the data, and there are some impressive benchmarks on TPC-DS queries at the end.


This post walks through five different types of logs that are important for understanding and debugging a Hadoop cluster. Given that YARN is relatively new, this is a good introduction to the new types of logs introduced in recent versions of Hadoop.


Spark’s MLlib contains a decision tree implementation which can be used in data classification problems. Even if you don’t know what a decision tree is, the article contains an introduction before it dies into the technical details. The post has an example in python (and links to examples for Java and Scala), describes the optimizations in the implementation, and has an overview of scalability (both dataset size and number of features). There were also some impressive speed gains in Spark 1.1 vs. Spark 1.0.


DataStax Enterprise 4.5 integrates Apache Cassandra with Apache Spark using the Spark Cassandra Connector. This post includes a walkthrough of using Spark’s MLlib with data stored in Cassandra.


The SequenceIQ blog has an example of implementing a correlation function for Spark. While the implementation duplicates some functionality found in MLlib, the example shows how to write testable Spark code (and has example tests). The code is available in its entirety on github.


Many folks get started with Hadoop in the cloud and end up storing data in object stores like S3 as a result. This post from the Altiscale blog discusses some of the drawbacks of storing data in an object store vs. a true file system.


Datameer has written about how they’ve reengineered the backend to Datameer 5 to be framework agnostic. Previously, the system was tightly coupled with MapReduce, but it can now also use Tez and small job/local execution engines. The post also describes why they use Apache Tez over Spark (although they do say that Spark will eventually be integrated).


While Spark has had integration with Kafka for several releases, this post goes much further than the Spark-bundled KafkaWordCount example. In fact, the post contains everything needed to get started with Kafka and Spark Streaming—including overviews of both systems that describe core concepts. The post culminates with a full example that reads Avro-encoded data from Kafka (in parallel across partitions), does some simple computing, and writes the data back to Kafka. There is also a summary of known issues, testing, and performance testing.


This post shows how to build an Amazon Elastic MapReduce (EMR) cluster that integrates RStudio. After bootstrapping a cluster, it walks through changing security settings to allow access to the RStudio web interface, describes how to use the rmr2 package to run a MapReduce job from R, and shows how to pull in some real-world (global weather measurement) data for analysis.


This tutorial explains how to install Apache Spark in the MapR sandbox (a VM running in VMWare or Virtualbox). After that, it has some examples with the spark-shell to run simple queries against a text-based Spark RRD.



In recent years, a number of systems for managing clusters in a general purpose way have emerged. Among them are YARN, Mesos, kubernetes, and OpenShift. It seems likely that we won’t see one clear winner, but that these systems will learn to coexist. This post on the Hortonworks blog describes plans for integrating OpenShift and Kubernetes with YARN.


Meanwhile, a framework for mesos, Myriad, is looking to integrate YARN and Mesos—but in the other direction. In short, Myriad is used for scaling YARN clusters in Mesos. This post has some more details on Myriad and its roadmap.


Cloudera announced the addition of Martin Cole (former Group Chief Executive of Technology at Accenture) and Steve Sordello (CFO, LinkedIn) to their board of directors. The new appointees will work on extending Cloudera’s vertical applications and serve as the Audit Committee Chair, respectively. While these appointments are well deserved, they also bring the gender composition of the board members of top Hadoop venders (Cloudera, Hortonworks, and MapR) to 20:1.


A new book from O’Reilly, “Getting started with Impala,” is now in early release. A post introducing the book has a Q&A with the book’s author, John Russell.


Cloudera announced this week that they’ve acquired DataPad, makers of collaborative BI/analytics software. In the press release, Cloudera says that DataPad’s co-founders will build data backends for business intelligence tools aimed at “simplifying use of Cloudera’s products.”


This post questions the conventional wisdom of running a real-time database separately from a Hadoop cluster. It discusses a few arguments for running NoSQL solutions on Hadoop (real-time analytics, scalable storage) and several DB-on-Hadoop solutions like MapR-DB, HBase, and Apache Accumulo.


O’Reilly has announced a new book by Jay Kreps on logging in distributed systems. The book is based on several blog posts, and covers a number of concepts at the heart of a big data platform.


SequenceIQ announced that they’ve joined the Hortonworks Technology Partner Program. SequenceIQ is developing Cloudbreak, a cloud agnostic tool for provisioning and autoscaling HDP clusters.


Hortonworks and Oracle have announced that the Oracle Data Integrator (ODI) is certified with HDP 2.1.


Datanami has an overview of the Forrestor Wave report on NoSQL databases. The report looked at key-value databases and document-oriented systems. Product offerings form MapR, DataStax, and Amazon Web Services all scored high in the report.


The DBMS2 blog has two posts this week, the first on Streaming for Hadoop. It discusses both stream processing frameworks (Spark streaming, Storm) and data transfer systems (Flume, Kafka) in the wild. There are some interesting observations, such as that Kafka is being used by internet companies more than enterprises (citing lack of security as a concern). The post also tries to articulate the politics of streaming software tools with respect to vendors.


This post rehashes the argument of whether Spark or Tez is the successor to MapReduce. While many companies seem to be throwing their weight behind Spark, Hortonworks sees a place for both Spark and Tez.



Ferry is a tool for provisioning distributed systems (with a focus on several in the Hadoop ecosystem). It began as tool for running a local setup in docker containers, but has recently announced support for OpenStack and Amazon Web Services. With this addition, it’s incredibly easy to build a Hadoop cluster (with whichever components you want) inside of an Amazon VPC.


Red Hat Storage Server 3 was announced. The new version adds a plug-in for the Hadoop FileSystem API and integration with Apache Ambari.



Curated by Mortar Data ( http://www.mortardata.com )



Making Hadoop Enterprise Ready, by Brett Rudenstein of WANDisco (Santa Monica) – Monday, October 6


Self-Service Data Exploration Using Apache Drill, by David Kewley of MapR (El Segundo) – Thursday, October 9


#SDBigData Monthly Meetup (San Diego) – Wednesday, October 8


Washington State

Deep Dive into Spark, Tachyon, and Mesos Internals (Bellevue) – Wednesday, October 8



PDI on Hadoop (Addison) – Monday, October 6


AT&T Foundry Tour and Meetup with AT&T Employees and Big Data in the Big D (Plano) – Thursday, October 9


Kiyu Gabriel: Cassandra and DataStax (Houston) – Wednesday, October 8


Introduction to Hadoop Course, Part 2 (Austin) – Saturday, October 11



AZSSUG Oct Meeting: Big Data Presenters Josh Sivey/Orion Gebremedhin (Tempe) – Wednesday, October 8



Celebrate Data Science in the Cloud (Denver) – Thursday, October 9



R on Hadoop (Saint Petersburg) – Wednesday, October 8


New Jersey

Apache Ambari and Slider: Deployment & Resource Management (Hamilton Township) – Tuesday, October 7


New York

This Ain’t Your Father’s Search Engine (New York) – Thursday, October 9



Hadoop User Group (Paris) – Monday, October 6



NoSQL in a Hadoop World (Manchester) – Tuesday, October 7


October Hadoop Meetup (London) – Tuesday, October 7



Rapture I/O + Apache Spark (Prague) – Tuesday, October 7



Introducing Apache Flink (+) Hadoop Operations Powered By … Hadoop (Stockholm) – Wednesday, October 8



Hadoop Security and Apache Sentry (Hyderabad) – Thursday, October 9


Hadoop Workshop (Hyderabad) – Thursday, October 9



Read More…