Hadoop Weekly Issue #89

Hadoop Weekly Issue #89

28 September 2014

This week’s issue has a lot of great content. It includes new open-source projects from Netflix and LinkedIn, several articles about Apache Spark (including details from Hortonworks on their plans for it), and news on Cascading on Tez. There’s also coverage of news in the ecosystem and several additional releases.


This post is aimed at getting started with a non-trivial Spark cluster without any existing infrastructure. It leverages Apache Mesos via the free-tier of Mesosphere for Google Cloud Platform. The tutorial explains how to launch a cluster, download VPN credentials in order to access the cluster, how to access the Spark and mesos consoles, and details running the Spark shell to execute a simple distributed computation.


A few months back, Cloudera announced that they plan to adopt Apache Spark as a successor to MapReduce for many systems. In the time since then, a lot of work has gone into making that a reality. This post gives updates on the status of Spark integration for Apache Crunch, the Kite SDK, Apache Solr, Apache Pig, and Apache Hive.


This is a quick walkthrough of setting up Apache Drill with the Pentaho Data Integration (PDI). There are instructions for starting an embedded Apache Drill and connecting PDI to Drill in order to execute a simple query against a json file.


Two folks from the Hadoop team at Yahoo have shared details on and recommendations for when to use Spark and Storm. Their presentation includes an introduction to both of these technologies (including example applications), and a detailed overview of the strengths and weaknesses of both.


The Databricks blog has details on two performance improvements in Spark 1.1—torrent broadcast and tree aggregation. Both improvements help to better utilize the network, which lead to 1.5-5x speed improvements in MLlib (ML algorithms tend to broadcast and aggregate lots of data across several iterations).


Databricks has created two reference applications—for analyzing logs and classifying the language of a tweet stream. These references aim to show how to build a fuller-featured application than is included in a basic tutorial/walkthrough.


Most developers won’t need to use the Apache Tez API directly—it’s predominantly intended to be used by other frameworks (e.g. Hive, Pig, and Cascading all have built atop of Tez). But if you were interested in what a standalone Tez application looks like, this post describes how to do a top-K calculation using Tez. It includes snippets (with descriptions) and the full code is available on Github.


Cloudera has posted new benchmarks for their SQL-on-Hadoop system, Impala (as always with a vendor benchmark—you might find different results with your own data). This time, they’ve compared it to Hive-on-Tez, Spark SQL, and Presto on a 21-node cluster. The results show that Impala has much query throughput in a multi-user environment and that it’s faster for three different types of single-user queries, too.


Based on experience from having Apache Spark available as a tech preview for HDP, Hortonworks has put together a two-phase initiative to improve Spark. Phase 1 consists of improved integration with Apache Hive, support for the ORCFile format, and improvements with security and operations (namely integration into Apache Ambari). Phase 2 focuses on improving scale and reliability (mostly around YARN integration), improving debug ability, adding wire encryption and authorization, and integration into the YARN Application Timeline Server.


The Hortonworks blog has a guest post from Concurrent CEO Gary Nakamura on the state of Cascading on Apache Tez. In a recent milestone, Cascading 3.0 WIP added support for Apache Tez as part of a new pluggable query planner. Future work includes improving scalability and performance and to add support for other cascading-powered libraries such as scalding and cascalog.


This post walks through enabling SSL encryption (including a client key) between HUE and Hive. It has an overview of the network communication between the two services in an encrypted setup, a guide for generating keys with keytool and openssl, and example configuration files.


Hortoworks has put together a few tutorials for Apache Kafka and Apache Storm. The first tutorial uses Kafka as a transport for real-time trucking events, the second show how to consume data in real-time using Storm, and the third is the old standby, WordCount, in real-time with Storm.



The Qubole blog has a recap of some recent announcements and news related to Hadoop. There’s some overlap with the content of this newsletter, but there are several new articles as well.


Apache Storm recently graduated from the Apache Incubator. A post on the Hortonworks blog has a bit more about Storm, how it’s being used, and background on incubator graduation.


Hadoop startup Continuuity has rebranded itself as Cask. At the same time, they’re open-sourcing/rebranding several products. First, their flagship product, Continuuity Reactor, is now open-sourced as the Cask Data Application Platform. Second, they announced a preview release of a real-time stream processing framework called Tigon that was built in conjunction with AT&T Labs. Third, there is a new name for their cluster management software (formerly Continuuity Loom), Coopr.


Videos of the presentation at the recent Strange Loop conference are now available on Youtube. The talks cover a number of topics ranging from programming languages to distributed systems. From the Hadoop ecosystem, there are talks on Samza and Cassandra.


SequenceIQ, makers of the Hadoop-as-a-Service platform Cloudbreak, announced an investment from Euroventures. The financial details of the deal were not disclosed.



HBase 0.99.0 was released. This is a developer preview, which is not intended for production use. It contains a number of enhancements (over 1,000 tickets were resolved) that will eventually become the basis of the 1.0 release. A couple of highlights include removal of Hadoop 1.x support, support for stripe compaction, and the addition of a Dockerfile to run HBase from source.


Cloudera Enterprise 5.1.3 was released this week. It contains fixes/improvements across Hadoop, HBase, HDFS, Hue, Hive, Impala, Oozie, YARN, Cloudera Manager and Cloudier Navigator.


Cloudera has announced a new version of their ODBC drivers for Apache Hive and Impala. The release includes bug fixes including better support for DECIMAL data types.


Hortonworks has updated the Spark technical preview to include Spark 1.1.0. Notable fixes include better integration with Hive 0.13 and support for ORCFile.


Inviso is a new open-source tool from Netflix for Hadoop job search and visualization. Job search is powered by ElasticSearch, which indexes job configurations. The visualization portion of the application includes plots of task attempts for a job, which are loaded from job history files. See the post for more details, including screenshots of the interface.


LinkedIn has open-sourced ml-ease, a large-scale machine learning library that includes backends for Hadoop and Spark. The software, which is available under an Apache License, supports Alternating Direction Method of Multipliers (ADMM) logistic regression.


Version 1.0.19 of Luigi, the workflow management system, was released. This release includes centralized resource limits, S3 api improvements, test fixes, and more.



Curated by Mortar Data ( http://www.mortardata.com )



Women in Analytics: Big Data Hadoop and Other Databases Pros/Cons (San Francisco) – Thursday, October 2

Securing Enterprise Data with Hadoop: What Are Your Options? (Santa Clara) – Thursday, October 2

Big Data and Data-Driven Business Security Considerations (Fremont) – Thursday, October 2


Reporting against Hadoop Data Sources Using Jaspersoft (Tempe) – Wednesday, October 1


How Apache Spark Fits into the Big Data Landscape (Westminster) – Thursday, October 2


Big Data Developer Day: A Leap Forward for SQL on Hadoop (Hopkins) – Wednesday, October 1


Big Data Everywhere (Chicago) – Wednesday, October 1


Talend: DI & Hadoop Integration Albert Mayer (Cincinnati) – Friday, October 3


Big Data Developer Kickoff (Calgary, Alberta) – Friday, October 3


HUG Italy: Primo Incontro a Milano (Milan) – Tuesday, September 30


Workshop: How to Think in MapReduce (Cluj-Napoca) – Tuesday, September 30


Architecture Night (Singapore) – Thursday, October 2


Read More…

Big Data Market worth $46.34 Billion by 2018

The report “Big Data Market By Types (Hardware; Software; Services; BDaaS – HaaS; Analytics; Visualization as Service); By Software (Hadoop, Big Data Analytics and Databases, System Software (IMDB, IMC): Worldwide Forecasts & Analysis (2013 – 2018)” segments the global big data market in to various sub segments with in-depth analysis and forecasting of revenues. It […]

Read More…

Hadoop Weekly Issue #88

Hadoop Weekly Issue #88

21 September 2014

There are a number of posts covering the recently released Apache Spark 1.1, Apache Drill 0.5.0-incubating, and Apache Tez 0.5.0. In addition, there’s a look at Hadoop in the healthcare industry, a look at ORCFile for non-Hive workloads, instructions for building a Hadoop setup on Mac, and more. The amount of content this week shows that we’re past the summer lull, and I expect to see lots more great content this fall.


The first of several posts on Apache Spark 1.1.0, this one covers Spark Streaming. New features of Spark Streaming in the 1.1.0 release include integration with Amazon Kinesis and high availability for Apache Flume. The post gives an intro to Spark Streaming, the new features, and talks through several example use cases.


Spark 1.1 also includes improvements to PySpark, which provides Python access to Spark. The main improvement is API support for arbitrary Hadoop InputFormats. This post includes an example of using SequenceFiles from PySpark and a walk-through of the new Converter trait, which is used to map custom types to POJOs.


This tutorial has instructions for setting up a single-node Hadoop cluster including HDFS, YARN, HBase, and Flume on a Mac. The instructions are for CDH but most of the content should be useful for any distribution.


This post is an introduction and overview of stream processing frameworks. It introduces use-cases (including a closer look at fraud detection), gives an overview of stream processing architecture, and intros several different systems (Apache Storm, Apache Spark, IBM InfoSphere Streams, and TIBCO StreamBase). The article also has a section on integrating with Hadoop and other data warehouses.


The Cloudera blog has a post summarizing a research project out of Zurich University to evaluate Cloudera Impala for mixed workloads. The post describes the use case, the report conclusions (namely that Impala scales linearly with more users), and includes a link to the full evaluation.


This post looks at using Apache Spark in two different ways. First, it shows how to compute summary statistics and other aggregations on a data stream. Second, it explores generating parquet files from Spark. After a few false starts (trying to use Scala libraries for Avro, Thrift, and Protobuf), the post shows how to use Java Protobufs as the write support for Parquet.


Apache Samza (incubating) is a stream processing framework that integrates well with Apache Kafka. Samza apps run on Apache YARN. This post on the LinkedIn blog describes how they use Kafka and Samza as a platform for distributed tracing (looking at the interaction between services in a service-oriented architecture). They describe a number of improvements to Smaza that have made to scale distributed tracing. The post also includes a description of using Samza to implement some foundational operations such as a cogroup.


These slides are from a presentation walking through the evolution of the data pipeline at Tapad. They first describe Tapad’s data challenges and then walk through the data pipeline, which eventually converged on Avro and Kafka as the core components. The slides include details about how Tapad moves data from Kafka to HDFS (rewriting as Parquet along the way) and uses Scalding to build MapReduce jobs.


ORCFile is a columnar storage format which comes with Hive. The format can be used with other systems, such as Cascading and Apache Crunch. This article provides an introduction to ORCFile, provides examples of using it with Cascading and Crunch, and provides some example benchmarks demonstrating impressive performance improvements.


Altiscale, as a Hadoop as a Service provider, has seen customers write and deploy Spark applications. They’ve put together some guidelines for when it’s worth considering Spark instead of MapReduce. Among them, when you need one of Spark’s machine learning or graph algorithm implementations or when existing MapReduce jobs are slow or are implementing iterative algorithms.



A post on SiliconANGLE looks at a number of surveys and reports about Hadoop adoption and usage. It tries to answer questions about who is using Hadoop, how it’s being used (e.g. SQL vs search), and in which industries Hadoop adoption is the strongest.


Apache Spark has gained a lot of momentum over the past year, and a lot of folks see it as an evolutionary replacement of MapReduce. A post on Datanami suggests three areas that Spark needs to improve in order for that to happen. The areas are high-end scalability (thousands or 10s of thousands of nodes), publication of successful case studies, and short- and long-term backwards compatibility (learning from Hadoop’s mistakes).


DataBricks and O’Reilly have announced a new certification program for Apache Spark developers. The program includes an exam that validates technical expertise in Spark. The first set of certifications will be done around Strata NY + Hadoop World in October.


This article looks at the role that Hadoop plays in the healthcare industry. Notably, while centralizing data in Hadoop has a lot of advantages for analysis and generating insight, it also adds security risk (since all your data is in one place). The article includes a brief look into how the Children’s Hospital Los Angeles is using Hadoop.


The Amplify Partners Data & Analytics Fellowship covers the cost of travel and registration to the upcoming Strata Conference. Applicants should “demonstrate passion and potential to meaningfully contribute to the field.” Amplify is giving extra consideration to individuals who do not typically have access to these types of opportunities. Applications are due on September 30th.


Cloudera and Dell announced that Dell has joined the Cloudera systems integrator program. More details on the program and Dell’s offering are in the press release.



Apache Tez 0.5.0 was recently released. It includes a number of improvements across APIs, documentation, security, and more. More details on the improvements of the release can be found in the original release announcement on the Tez mailing list as well as in a post on the Hortonworks blog.


Adding to recent buzz around Apache Tez, SequenceIQ has announced a new docker image to run Tez. It builds on the Ambari docker image, and there’s a script to deploy a multi-node cluster.


A new beta release of Apache Drill, version 0.5.0, was announced. Drill is a SQL-on-Hadoop (and data stored in other places) system. The new version uses Hadoop 2.4.1, has improvements for sorting when data doesn’t fit in memory, and several other improvements. The release resolves over 100 issues, and the Drill team is aiming to do monthly releases as they march towards GA.


After the recent release of Apache Spark 1.1.0, the folks at SequenceIQ have published a new docker image for that project. Running Spark in a docker container can be a great way of getting started, and the post has a few examples of running Spark jobs with it.


MapR announced that they’re including the recently released Apache Drill 0.5 in their distribution. An introductory post on the MapR blog provides a number of reasons why you might want to adopt Drill. At the top of the list is Drill’s ANSI SQL support, which eases integration with other systems. It also highlights several features that (as a whole) differentiate Drill, such as query without centralized schema, support for nested data, and out-of-the-box ease of use.


In addition to adding support for Apache Drill, the 4.0.1 release of the MapR Distribution includes updates to several ecosystem projects. Updates include Hadoop core 2.4.1, Spark 1.0.2, and HBase 0.98.4. Also, Apache Storm is certified with MapR 4.0.1 and Tez is in a developer preview.


Amazon Web Services announced a new EMR File System, which replicates metadata about the state of data in S3 to DynamoDB. By using DynamoDB, the EMR File System can provide a consistent view of data in S3, which is eventually consistent (particularly when doing S3 prefix listings). The post has some details on getting started with the EMR File System, which requires an initial sync command to initialize the data in DynamoDB.


EPIC is a new product from BlueData for provisioning Hadoop. It bundles KVM, RHEL, and cloud management from OpenStack. EPIC is certified with the Hortonworks distribution (HDP) and also supports Cloudera’s CDH. EPIC One, a single node version, is available now. The enterprise edition is still in beta but expected in Q4.


SequenceIQ has also released new docker images for Apache Hadoop 2.5.1. Their post features some examples for interacting with a container running the image.



Curated by Mortar Data ( http://www.mortardata.com )



Intro to Hadoop: Hype or Reality? You Decide, with Kevin Crocker (Palo Alto) – Tuesday, September 23

Women in Analytics September Event: Hadoop and Other DB Technology (San Francisco) – Thursday, September 25

How to Offload the ELT SQL in Your Data Warehouse into Hadoop Automatically (Mountain View) – Thursday, September 25

Discussion of Hadoop Use Cases vs. Runtime Environments, by Tom Phelan of BlueData (Los Angeles) – Thursday, September 25

Hadoop: Where Did It Come from and What’s Next? by Eric Baldeschwieler (Pasadena) – Thursday, September 25


Getting Started with SQL on Hadoop (Seattle) – Tuesday, September 23

Seattle Scalability Meetup: Google, HWX, Zulily (Seattle) – Wednesday, September 24


EMR, S3, and Hadoop Use Cases (South Jordan) – Thursday, September 25


Large-Scale Analytics with Apache Spark (Saint Paul) – Monday, September 22


Tech Talk with Eddie Garcia, Info Security Architect at Cloudera (Southfield) – Tuesday, September 23


NKU Big Data Series: Hadoop 101 (Cincinnati) – Thursday, September 25


Jeff Holoman Presents on Cloudera Distribution of Apache Hadoop (Huntsville) – Wednesday, September 24


Centralized Logging: Industry First Approach to HBase Fans (Jacksonville) – Thursday, September 25

North Carolina

Making Business Decisions with SAS & Hadoop (Charlotte) – Wednesday, September 24


YARN (Pittsburgh) – Tuesday, September 23

New Jersey

Tableau Deep Dive: Big Data Visualization (Hamilton Township) – Tuesday, September 23


Full-Day Hadoop MapReduce Hands-On Workshop (Cambridge) – Friday, September 26


Hadoop User Group: YARN, Falcon, HBase… (Paris) – Monday, September 22


Third Spark Barcelona Meeting (CSIC) (Barcelona) – Monday, September 22


Real-Time Analytics Using Indexed MapReduce (London) – Thursday, September 25


Intro to Lambda Architectures & Development (Toronto) – Friday, September 26


SolrCloud, Solr + Hadoop 2 & Nutch Integration (Bangalore) – Saturday, September 27


HadoopKitchen (Moscow) – Saturday, September 27


Read More…

Hadoop Weekly Issue #87

Hadoop Weekly Issue #87

14 September 2014

There were several releases in the Hadoop ecosystem this week, including Apache Hadoop 2.5.1 and Apache Spark 1.1.0. There’s a lot of interesting technical content, including testing HBase’s consistency with Jepsen and an in-depth look at an end-to-end big data infrastructure with Hadoop. On that node, there’s an interesting look into the growing demand for Data Engineers to build out Hadoop infrastructure.


A post on The AWS Big Data Blog covers custom configuration of Elastic MapReduce (EMR) clusters using bootstrap actions. One such bootstrap action, which is presented as an example in the post, installs Presto (the SQL-on-Hadoop system open-sourced by Facebook).


The latest post in a series on frameworks for big data analytics looks at Shark, Hive-on-Spark, and Spark SQL. The post describes the design/architecture of Shark and Spark SQL in detail. Spark SQL has the interesting quality of enabling SQL queries over data that Hive doesn’t know about, such as a local JSON file.


The Hortonworks blog has another set of curated Hadoop Summit content, this time focussing on Apache Hive. They highlight slides and video from seven presentations, which cover ACID for Hive, Hive and Tez, the cost-based query optimizer for Hive, and more.


This is a fantastic article about all the data plumbing/infrastructure that’s required to build a production big data system. There are several parts, each covered in depth—cluster planning (which includes reference architectures), data ingestion (batch ingest, event ingest, storage formats, data partitioning, access control), data processing (data transformation, analytics), egress/querying, and productionization. This is one of the best and most complete guides to what a big data platform with Hadoop should strive towards in order to be successful.


Apache Kafka is gaining popularity as a tool for data ingestion into Hadoop clusters. Unlike other systems, such as Apache Flume or Scribe, Kafka is a pull-based system, which allows for multiple consumers/destinations of data (rather than just HDFS or HBase). This article introduces Kafka and includes an example use-case of using Kafka to flag transactions in a massively multiplayer online game. There’s also an in-depth comparison of Kafka and Flume, which explores the advantages of and trade-offs between the two systems.


The MapR blog has a two part-series on OpenTSDB. The first part introduces the notion of time series data and OpenTSDB’s data model/API. The second article covers backfilling a massive amount of time series data into OpenTSDB. For this, they used MapR-DB (which is compatible with the HBase API) and a modified OpenTSDB that supports bulk importing (the code for these changes is available on GitHub). With these changes, they can load about 110 million points/sec.


This post covers the coding standards for Apache Hadoop. It discusses much more than just code style—best practices covering everything from concurrency to logging in detail. If you’re planning to submit code to the Hadoop codebase, it’ll be useful to get familiar with these (formerly unwritten) policies and rules.


Hue, the web front-end for Hadoop clusters, is a hybrid python/java application. Given that those technologies can pose some challenges in setup, this article has a walkthrough of building a Hue development environment on Ubuntu 14.04.


StackIQ makes tools for managing HPC, cloud, and big data clusters. The StackIQ Cluster Manager integrates with Apache Ambari (using the REST api) for provisioning or adding nodes to a Hadoop cluster. This post walks through the manager’s CLI, rocks, and shows how to use it to do several administrative tasks.


This post explores the internals of the YARN Fair Scheduler. Throughout the post, it explores how the Fair Scheduler differs from the Capacity Scheduler—both on features and implementation. The bulk of the post describes what happens during the scheduler event loop (events such as NODEADDED or APPADDED).


The folks at SequenceIQ are at it again with integrating parts of the Hadoop ecosystem with Docker. This time, they’ve announced a preliminary Docker image for Apache Drill which allows you to query data on a shared (from the host machine) data directory. This post introduces the key parts of Apache Drill, explains how it’s been integrated with Docker, and provides some examples of how to use Drill in Docker.


Jepsen is a tool to test distributed databases by simulating network partitions and quantifying the database’s consistency and availability. The Call Me Maybe series by @aphyr has looked at a number of databases using this tool, and the Yammer blog looks at a new one—Apache HBase. HBase strives to be consistent in the case of a network partition (as a result availability will suffer), and the results of the Jepsen testing agree with that (be sure to checkout the addendum for some clarifications of the results).


This post has several details on comScore’s big data infrastructure. They ingest terabytes of data each day to a 400-node MapR cluster. The post describes some of the other tools that comScore uses, such as SyncSort’s DMX to sort data as it is being loaded (which helps compress the data much more efficiently). In addition to tools for SyncSort, comScore has a 200-node EMC Greenplum cluster.


This post introduces Accumulo’s server-side programming hooks—Filters, Combiners, and Iterators. While Filters and Combiners are quite simple, one must dable with Iterators to do more complex operations (such as consuming one type of data but producing another. The post walks through code snippets of a few Iterators, and the full source is available on Github.



Two chapters on an upcoming book on Hadoop Security are available in the early preview program from O’Reilly. If you’re thinking of preordering, the Cloudera blog has details on the goals of and planned content for the complete book.


Linux.com has the story of Hadoop’s move from SVN to Git. The article includes interviews with several folks from the ASF in which they discuss the motivation for switching (tooling, easier feature branches, easier sharing of code) as well as some of the trade-offs (namely the lack of fine-grained access controls). The article also details the set of steps taken to do the SVN to Git migration.


At the Intel Developer Forum this week, Intel and Cloudera spoke about technical collaboration coming out of their business partnership. Specifically, Cloudera’s distribution has been optimized for the new Intel Xeon E5 v3 processor, which Intel says is >2x more performant at running Cloudera software. Intel also said that they expect Hadoop to be the top application on data center servers within the next couple of years.


TechRepublic has an interview with Peter Cnudde, VP of Engineering at Yahoo, on Hadoop at Yahoo. They talk about massive scale of Yahoo’s Hadoop and YARN deployment, some of the interesting challenges & opportunities this presents, the advantages of Hadoop for enterprises and non-web companies, and how Hadoop (and its ecosystem) fit together with non-Hadoop enterprise data warehouse systems.


Datanami has an article on the growing demand for data engineers—the type of engineer that works with Hadoop to build out core infrastructure like data ingestion and data quality. The article notes that data engineers often work in conjunction with data scientists and that data engineers are quite difficult to find.



Apache Hadoop 2.5.1 was released. The release resolves a handful of issues, including a MapReduce bug related to job ACLs.


Aperture Tiles is an open-source project for visualizing massive data sets. It builds on Apache Hadoop, Apache Spark, Apache Avro, and Apache HBase to provide interactive data exploration.


Apache Spark 1.1.0 was released. The new version includes improvements to MLlib, Spark SQL, PySpark, and Spark Streaming. It also includes support for memory management, several improvements for monitoring Spark jobs, and an improved integration with Apache Flume. Both the Spark website and the Databricks blog have more details on the new features.


Apache Cassandra 2.1 was released. A post on the Apache blog touts aspects of the new release including performance improvements (over 50% better), production support for Windows, and the CQL3 tuple and user defined type (UDT).


The Metanautix Quest Data Compute Engine is the latest entry into the SQL-on-Hadoop space. Their offering is commercial and aims to support a wide-array of data sources—from data in a OracleDB to data in Amazon S3. More details about the product in an introductory blog post.



Curated by Mortar Data ( http://www.mortardata.com )



Hadoop Ask Me Anything (Palo Alto) – Tuesday, September 16

#OCBigData Monthly Meetup (Irvine) – Wednesday, September 17

Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) – Wednesday, September 17

Deep Learning for MLlib with Sparkling Water (Mountain View) – Thursday, September 18

Big Data Science Meetup Event (Santa Clara) – Friday, September 19


Analyzing SQL on Hadoop / Living with Polyglot Persistence – SQL vs NoSQL (Overland Park) – Thursday, September 18


Apache Drill (Saint Louis) – Tuesday, September 16

Cloudera, Intel and Hadoop Security & Cisco IoT Living Lab Announcement (Kansas City) – Wednesday, September 17


Apache Hadoop Founder Doug Cutting Speaking on The Future of Data (Urbana) – Monday, September 15

Spark and Storm at Yahoo: Why to Choose One over the Other (Chicago) – Wednesday, September 17

North Carolina

Triad Hadoop Users Group (Winston Salem) – Thursday, September 18


Philly Hadoop Meetup: YARN, the Data Operating System for Hadoop 2.0 (Philadelphia) – Wednesday, September 17


Apache Tez: Accelerating Hadoop Data Processing (Cambridge) – Tuesday, September 16

The Role of Hadoop in the Transformation of the Data Center (Boston) – Thursday, September 18

New York

Storm vs Spark Streaming Face-off (New York) – Wednesday, September 17

Starting with Apache HBase to Create an Advanced in-Hadoop NoSQL Database (New York) – Wednesday, September 17


Ottawa Big Data + HUG Meetup (Gloucester, Ontario) – Wednesday, September 17


SQL and NoSQL on Hadoop – A Look at Performance (London) – Tuesday, September 16

Couchbase and Hadoop, plus Sub-millisecond Response Times with Couchbase (London) – Wednesday, September 17

Scoring Models, plus Apache Drill for Querying Structured and Unstructured Data (London) – Thursday, September 18


SolrCloud, Solr + Hadoop 2, plus Nutch Integration (Bangalore) – Saturday, September 20


Apache Spark (Shenzhen) – Sunday, September 21


Read More…

Hadoop Weekly Issue #86

Hadoop Weekly Issue #86

07 September 2014

While last week’s issue had posts covering a few common themes, this week’s issue has content for a wide number of topics. Those topics include: Spork (Pig on Spark), Hive (specifically the new Stinger.next initiative), and Presto. There is also some interesting news from established enterprise companies—Teradata has acquired Think Big Analytics, and Cisco has released management and monitoring software for Hadoop.


A project called “Spork” has been working to build a Spark-based execution engine for Pig. This post gives an overview of how that implementation was built, describes how to get started (it’s as simple as pig -x spark), and what the current status is (passing 100% of tests, but not yet merged to Pig-trunk).


Hortonworks has announced “Stinger.next,” a continuation of the Stinger Initiative (which had a goal of speeding up Hive by 100x). The project has three goals—speed improvements to support sub-second queries, improved scalability, and improved SQL support for transactions and SQL-2011’s analytic functions. There are a few other enhancements planned, too, like materialized views and streaming ingestion of data. The project is split into three phases, which will deliver in 2H 2014, 1H 2015, and 2H 2015.


This post is a good introduction to Spark’s RRD APIs. It looks at how to translate some mapper and reducer functions from a traditional MapReduce implementation. Some concepts translate directly, but you’ll quickly see several new methods like reduceByKey, groupByKey, and flatMap. The post also shows that a simple flow becomes much more complex if you need to do some setup or teardown (a la the MapReduce APIs setup() or cleanup()).


This blog series is an excellent overview and introduction to Spark, Pig, Hive, and Cloudera Impala. For each, it gives a brief introduction to the computation model and features of the framework. The coverage of Hive includes a walkthrough of new features from the Stinger Initiative—ORCFile, query planner improvements, Tez as a backend, and vectorization (there’s quite a good technical overview of each). The coverage of Impala is also quite interesting—it enumerates several reasons that Impala tends to be faster than Hive (e.g. JVM GC overhead, Impala is better at pipelining, pull vs. push of intermediate data, and more).


This post describes integrating recommendations built by Mahout with a search engine to solve the cold-start problem (i.e. how to recommend to a new user). Using some preferences collected from during registration, the system queries item-similarity data stored in the search engine to create recommendations.


Qubole’s Presto-as-a-Service exposes Presto, which is a SQL-on-Hadoop system from Facebook, as a hosted service. It operates on data stored in S3, which means that data isn’t local to compute nodes. In order to get better speed, Qubole has architected a caching layer for Presto which supports both in-memory and SSD-based caches. This post explains the implementation, which uses consistent hashing and takes advantage of the kernel’s file caching (rather than building their own in-memory store). The post also has some experimental results, which show a 10-15x speedup by enabling caching and switching to ORCFile (from text).


With MapReduce on YARN, there’s no longer a long-lived JobTracker. For inspecting job histories, the MR Job History Server was introduced. Recently, a more general-purpose implementation called the Timeline Server was conceived. The Timeline Serve supports frameworks other than MapReduce, such as Tez (Spark and MR support are on the way). This post includes an introduction to the Timeline Server, including an overview of the data it exposes (which is much richer than the MR Job History Server).


This post serves as a good example of how easy it is to inadvertently write a poor MapReduce job, even when using a higher-level framework like Hive. Specifically, the post describes the steps taken to discover that a Hive query was inadvertently doing a full cross-product. It also mentions how you might identify the cross-product from the EXPLAIN query plan output.


SequenceIQ recently announced Periscope, a system for auto-scaling YARN clusters to meet SLAs. This post introduces some of the terminology for Periscope—alarms and metrics. It describes some of the metrics, e.g. PENDING_CONTAINERS and LOST_NODES, how to build an alarm as using a REST API call, and how to set a scaling policy using the REST API.


This post gives a high-level overview of the steps needed to deploy Hadoop for ETL. It discusses using Apache Flume and Apache Sqoop to bring data into Hadoop, introduces the concept of “schema-on-read,” provides suggestions for which frameworks to use to do ETL, and gives an intro to workflow management (suggesting that Apache Oozie is often insufficient).



Databricks, the company founded by the team originally behind Apache Spark, have produced an infographic highlighting some of the progress made by Spark in the past year. It highlights how far the project has come in a year, particularly in terms of community growth (e.g. Spark 0.7 had 31 contributors, Spark 1.0 had 117).


This post describe five influential Google projects that have spawned open-source equivalents. The five projects are Google MapReduce (Apache Hadoop MapReduce), Bigtable (HBase), Borg (Mesos), Chubby (Zookeeper), and Dremel (Drill).


Infobright and MapR announced a partnership for joint deployment of MapR’s distribution and Infobright’s analytics platform.


DataStax, who sells commercial software for Apache Cassandra, made news this week with a giant Series E round of financing. The deal brings in $106 million (bringing the total investment to $190 million) and values the company at $830 million.


Teradata has acquired Think Big Analytics, a big data enterprise services company. Notably, Think Big Analytics helps companies integrate Hadoop and NoSQL with existing technologies. Thus, industry pundits are seeing this acquisition as Teradata embracing the non-enterprise DW big data ecosystem.



Amazon Web Services announced support for Hive 0.13.1 for their Elastic MapReduce (EMR) Hadoop-as-a-Service offering. In a secondary announcement, EMR has removed the 256-step limit from EMR clusters as part of the AMI 3.1.1.


Hivemall version 0.2 was released. It’s a stable release of the machine learning library for Hive, which is built using Hive User-Defined Functions. The 0.2 release follows five pre-releases, and a new 0.3 beta is also available. Hivemall supports functions for Classification, Regression, item-similarity, k-nearest neighbor, and feature engineering. It requires Hive 0.11 or later.


Version 1.1.1 of hbase-client, the async HBase client for Node.js was released. The release supports HBase 0.94.x.


Cisco has announced UCS Director Express for Big Data, which is a automation and configuration tool for Hadoop services. It also provides monitoring of physical hardware alongside of Hadoop services.


The folks at SequenceIQ have published Docker images for running Apache Phoenix (if you’re not familiar, Phoenix is a SQL engine that runs atop of Apache HBase). The post describes how to launch a container running Phoenix 4.1 on HBase 0.98.5, create some tables, and connect to the database using both the sql shell and using JDBC.


WANdisco released version 1.9.6 of Non-Stop Hadoop. The new version adds zoning (virtual clusters or sub-clusters) and support for rolling upgrades.


Qubole, who offers Presto (the SQL-on-Hadoop) as a service, has added a new auto-scaling feature. By inspecting statistics kept by Presto, the system determines when the cluster us under- or over-provisioned and automatically adds/removes nodes.



Curated by Mortar Data ( http://www.mortardata.com )



SD Big Data Monthly Meetup #3 (San Diego) – Wednesday, September 10

September SF Hadoop Users Meetup (San Francisco) – Wednesday, September 10

Real-Time Analytics with Storm by Ron Bodkin of Think Big Analytics (Los Angeles) – Thursday, September 11


Hunk for Hadoop (Houston) – Wednesday, September 10


Monthly SEMOP meeting (Southfield) – Tuesday, September 9


Cleveland Big Data at the Great Lakes Science Center (Cleveland) – Monday, September 8


Sep 11: Igniting Data Analysis with Apache Spark by Ryan Gimmy (Reston) – Thursday, September 11

North Carolina

Nikhil Kumar (SyncSort) on Converting SQL to MapReduce (Durham) – Tuesday, September 9

New Jersey

RDBMS on Hadoop? Yes, talk and hands-on session from Splice Machine (Flemington) – Tuesday, September 9

New York

Get Hands-on with Big SQL on Hadoop (New York) – Wednesday, September 10

FREE EVENT! Hadoop and Mainframes: Crazy, or Crazy Like a Fox? (New York) – Wednesday, September 10

Machine Learning on the Azure Cloud Platform (New York) – Friday, September 12


A Detailed Look at R on Hadoop (Moscow) – Thursday, September 11


Apache Spark – In Memory Map-Reduce (Hyderabad) – Saturday, September 13


Read More…

HDFS File System Commands

  HDFS File System Commands  The FileSystem (FS) shell is invoked by “hadoop fs <args>”   All FS shell commands take path URIs as arguments. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. – If not specified, the default scheme specified in the configuration […]

Read More…


Appfluent provides IT organizations with unprecedented visibility into their Big Data systems to reduce costs. Appfluent helps companies put the right workload on the right system, across data warehouses, business intelligence, and Hadoop. Seeing every analytic and Extract, Transform and Load (ETL) that hits Big Data systems, Appfluent can see where organizations are wasting expensive […]

Read More…