Hadoop Weekly Issue #80

Hadoop Weekly Issue #80

27 July 2014

Two large pieces of news this week: HP and Hortonworks announced a $50 million investment in Hortonworks as part of an expanded partnership, and Apache Tez graduated from the Apache Incubator. Additionally, there were a number of interesting technical posts this week on Pig, MapR FS, SQL on Hadoop, HDFS, and more.

Technical

The Hortonworks blog has a post highlighting some of the new features of the recently released Apache Pig 0.13. The 0.13 release adds preliminary support for multiple backends (i.e. something other than MapReduce like Tez or Spark). The post talks about several new features, including new optimizations for small jobs, the ability to whitelist/blacklist certain operators, a user-level jar cache, and support for Apache Accumulo.

http://hortonworks.com/blog/announcing-apache-pig-0-13-0-usability-improvements-progress-toward-pig-tez/

A post on the Pythian blog discusses how the small files problem, which is well-understood with HDFS and MapReduce, can also effect MapR FS in certain situations. It gives a brief overview of the MapR FS architecture, describes the problem, and suggests some best practices.

http://www.pythian.com/blog/small-files-on-mapr-fs/

As the number of projects in the Hadoop ecosystem grows, understanding how all the pieces fit together becomes more challenging. This post from the rackspace blog tries to bucket the various components into six areas, and it gives a good introduction to each aimed at the beginner.

http://www.rackspace.com/blog/new-to-hadoop-heres-a-handy-guide-to-get-you-started-part-1/

This post on the sonra blog is one of the most comprehensive and up to date overviews of the SQL-on-Hadoop space that I’ve seen. It covers all the latest announcements such as Hive on Spark and Spark SQL. The post also goes into details on Hive on Tez, Cloudera Impala, Presto, Apache Drill, and InfiniDB.

http://sonra.io/war-of-the-hadoop-sql-engines-and-the-winner-is/

Testing distributed systems can be very hard, but there are good tools for doing so such as the Jepsen test framework. This post looks at applying a Jepsen test to HDFS High Availability via the Quorum Journal Manager. Results show that HDFS performs consistently under a network partition, although availability can suffer (as is expected).

https://www.growse.com/2014/07/18/partition-tolerance-and-hadoop-part-1-hdfs/

This post serves as an updated guide for running MapReduce jobs that read from and write to Cassandra. It includes sample code for configuring the input and output formats, building the MapReduce job, and generating Cassandra Mutation objects to update the output database.

http://www.orpiske.net/2014/07/using-apache-cassandra-with-apache-hadoop/

This presentation gives an overview of structor, which is a tool for building virtual Hadoop clusters with Vagrant. It describes the system architecture, which uses Puppet for provisioning Hadoop components. It also details the various configuration options and instructions for using the tool.

http://www.slideshare.net/oom65/structor-automated-building-of-virtual-hadoop-clusters

Flambo is a recently open-sourced Clojure DSL for Apache Spark. This post serves as a detailed introduction to the API by walking through how to generate TF-IDF for an example dataset.

http://yieldbot.com/index.php?p=blog/tf-idf-using-flambo

The Apache blog has a post detailing the Apache Sentry project, which aims to offer fine-grained access control to data stored in Hadoop. This post looks as the Hive integration in particular, but there are also integrations with Cloudera Impala and Apache Solr. It discusses the authentication primitives such as privileges, roles, and groups as well as the policy engine and policy provider components.

https://blogs.apache.org/sentry/entry/apache_sentry_architecture_overview

Datanami has an article discussing enforcing SLAs on Hadoop clusters. It focuses on Pepperdata’s product offering, which does real-time monitoring of a cluster to do fine-grained enforcement of SLAs. Hadoop systems (like the fair/capacity schedulers) can be a bit coarse in enforcing SLAs, which causes some folks to go to extremes to guarantee SLAs (like building dedicated clusters). If you’re in this situation, you might want to hear more about Pepperdata.

http://www.datanami.com/2014/07/23/enforcing-hadoop-slas-big-yarn-world/

The Pinterest blog has a post about their big data infrastructure that ingests 20 terabytes of new data per day for a total of around 10 petabytes. Pinterest is entirely in AWS and using S3 for storage. They use the Hive metastore as a source of truth, and they migrated from Amazon EMR to Qubole’s service (from which they’ve seen major benefits). The post also details how they provision the instances in a Hadoop cluster.

http://engineering.pinterest.com/post/92742371919/powering-big-data-at-pinterest

The SequenceIQ blog has a post on the YARN Capacity scheduler. It explores the internals of the scheduler, including the configuration and scheduler event loop. It takes a detailed look into each of the types of SchedulerEvents (e.g. node added/removed, app added/removed) that change the state of the scheduler.

http://blog.sequenceiq.com/blog/2014/07/22/schedulers-part-1/

This post describes document-level security for Cloudera Search, which is a new feature of CDH 5.1. Implemented by Apache Sentry, a Solr SearchComponent adds additional filterQueries based on the roles associated with a particular query.

http://blog.cloudera.com/blog/2014/07/new-in-cdh-5-1-document-level-security-for-cloudera-search/

In the second part in a series summarizing broad concepts from Hadoop Summit, the Hortonworks blog has a post about YARN. It discusses several themes that came out of the Summit regarding YARN, and it highlights seven related presentations.

http://hortonworks.com/blog/apache-hadoop-yarn/

News

Hortonworks and HP announced that they’re deepening their partnership, and HP is investing $50 million in Hortonworks. This investment joins the $100 million round that Hortonworks announced in March.

http://recode.net/2014/07/24/hp-makes-50-million-strategic-investment-in-hortonworks/

Apache Tez was promoted to a top-level project this week by the Apache Software Foundation. Tez entered the incubator in February 2013, and has seen contributions from employees of several companies, including Cloudera, Facebook, Hortonworks, LinkedIn, Microsoft, Twitter, and Yahoo.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces60

MapR and Tata Consultancy Services announced a partnership this week. The two companies are offering joint products based on TCS’s data analytics/management solutions and MapR’s distribution.

http://www.mapr.com/blog/sensing-more-reasons-celebrate-tata-consultancy-services-partners-mapr

GigaOm has a post about the rise of Spark and Tez as evolutionary replacements for MapReduce. It talks about how these frameworks fit in with YARN, Hive, and Pig, and the history of both frameworks.

http://gigaom.com/2014/07/20/spark-and-tez-out-of-phase/

The Gartner blog has a recap of some of the Hadoop-related investments that took place this week. It puts them into context of the wider DBMS/IT industry and adds some color to the HP investment into Hortonworks. It also discusses the push for global sale/support in many of these moves.

http://blogs.gartner.com/merv-adrian/2014/07/24/hadoop-investments-continue-teradata-hp-jockey-for-position/

Releases

The Cloudera Oryx project is system for real-time machine-learning. This week, a reboot of the project, Oryx 2, was announced. The new version implements the lambda architecture for the large scale machine learning using Apache Spark for both batch and the speed layer (using Spark Streaming).

https://github.com/OryxProject/oryx

Oink is a gateway server to Apache Pig/Hadoop providing a REST API. Built at eBay, it was open-sourced this week. The main design goals include governance, scalability, and change management.

http://www.ebaytechblog.com/2014/07/22/oink-making-pig-self-service/

Avro 1.7.7 was released. The new version includes a Perl implementation of Avro, support for a DECIMAL type, schema validation utilities for Java, and more. It also contains several bug fixes.

http://mail-archives.apache.org/mod_mbox/avro-user/201407.mbox/%3CCALEq1Z8Ei0GaAAygNbHB_o0ieZ7d55FmNjPU3H9z4iLVrbfLKA%40mail.gmail.com%3E

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Hadoop Talk: Details of Anomaly Detection in Big Data (San Jose) – Monday, July 28
http://www.meetup.com/SF-Bay-ACM/events/183069232/

Big Data, Docker, and Apache Mesos (San Francisco) – Wednesday, July 30
http://www.meetup.com/Bay-Area-Mesos-User-Group/events/196882692/

Spark Machine Learning Bonanza (Sunnyvale) – Wednesday, July 30
http://www.meetup.com/spark-users/events/195574482/

Washington

Seattle Scalability Meetup: Eastside Edition (Seattle) – Wednesday, July 30
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/events/174605462/

Minnesota

Inaugural Elasticsearch Meetup (Minneapolis) – Thursday, July 31
http://www.meetup.com/Elasticsearch-User-Group-Minneapolis/events/195301272/

Wisconsin

An Introduction to Apache Spark and Mesos (Madison) – Tuesday, July 29
http://www.meetup.com/BigDataMadison/events/181286042/

Illinois

A Leap Forward for SQL on Hadoop (Chicago) – Wednesday, July 30
http://www.meetup.com/Big-Data-Developers-in-Chicago/events/189788472/

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS (Chicago) – Wednesday, July 30
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Maryland

Social Text-Analytics and Visualization Using Hadoop & Streams Computing (Bethesda) – Tuesday, July 29
http://www.meetup.com/Big-Data-Developers-in-DC/events/194902582/

North Carolina

Rethinking SQL for Big data – Don’t Compromise on Flexibility or Performance (Durham) – Tuesday, July 29
http://www.meetup.com/TriHUG/events/185859612/

July CHUG: Matt Jones (CTS) on Protecting PII in the Hadoop/Analytics World (Charlotte) – Wednesday, July 30
http://www.meetup.com/CharlotteHUG/events/167351172/

Georgia

Hadoop Demystified (Alpharetta) – Monday, July 28
http://www.meetup.com/Atlanta-Net-User-Group/events/193980882/

Florida

Centralized Logging – Industry First Approach to HBase Fans (Jacksonville) – Tuesday, July 29
http://www.meetup.com/HUGNOFA/events/184997382/

CANADA

Presentation Corner – Couchbase & Query Engines in Spark (Toronto) – Monday, July 28
http://www.meetup.com/TorontoHUG/events/191410172/

Introduction to Apache Hive (Ottawa) – Thursday, July 31
http://www.meetup.com/HadoopOttawa/events/196236692/

AUSTRALIA

Hadoop 101 – Beginners Only! (Melbourne) – Tuesday, July 29
http://www.meetup.com/Big-Data-in-Practice/events/193112952/

NEW ZEALAND

Spatial and Hadoop Integration with Netezza (Auckland) – Thursday, July 31
http://www.meetup.com/Auckland-Netezza-Meetup/events/195573292/

[…]

Read More…

Hadoop Interview Questions

1. What is Hadoop framework? Ans: Hadoop is a open source framework which is written in java by apche software foundation. This framework is used to wirite software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-paralle on large clusters which could have 1000 […]

Read More…

Hadoop Weekly Issue #79

Hadoop Weekly Issue #79

20 July 2014

This week is full of releases and new products—ranging from Oracle’s new Hadoop-SQL product to a new CDH 5.1 release from Cloudera to new tools for transactions on HBase from Continuuity and deploying Hadoop-as-a-Service from SequenceIQ. There are also a number of quality technical articles covering Spark, Kafka, Luigi, and Hive.

Technical

This post covers using the Transformer class to manipulate data as it flows into Sqrrl Enterprise. It details loading the enron email dataset and using a Transformer to build a graph of users sending email. It includes the code for thisTransformer and also some examples of querying the dataset using tools found in Sqrrl Enterprise.

http://blog.sqrrl.com/bulk-loading-in-sqrrl-pt.2-custom-transformers-for-graph-construction

The Databricks blog has the first post in a series on some of the new features of MLlib in Spark 1.0. This post focusses on Spark’s improved support for sparse dataset (both storage and performance improvements). The post has some code examples for pyspark and suggestions for when sparse representations work best.

http://databricks.com/blog/2014/07/16/new-features-in-mllib-in-spark-1-0.html

Jay Kreps (LinkedIn, Kafka architect) recently spoke at Cloduera on Apache Kafka. The Cloudera blog has a summary of his talk, which describes the goals and design of Kafka. The slides for the presentation are also available.

http://blog.cloudera.com/blog/2014/07/jay-kreps-apache-kafka-architect-visits-cloudera/

Luigi, the open-source workflow engine from Spotify, is the dark horse in Hadoop workflow engines. This presentation provides a great introduction and overview of Luigi. If you’re unhappy with your current engine, I suggest you give it a look.

https://speakerdeck.com/rantav/luigi

The Databricks Cloud is a new product announced at the Spark Summit. This post motivates the product (e.g. deploying Hadoop can take a long time) and describes its components. In addition to hosted Spark clusters, the product includes notebooks, dashboards, and a job launcher. There is also a plan for integrating third-party applications.

http://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.html

This post describe how to use Apache Spark for Monte Carlo simulations. It uses the simulations to estimate a financial statistic called value at risk (VaR). The post describes VaR, Monte Carlo simulations, and the Spark program to calculate the value. It includes some example code (the Monte Carlo code is bing added to Spark’s MLLib, but isn’t yet integrated).

http://blog.cloudera.com/blog/2014/07/estimating-financial-risk-with-apache-spark/

The Hortonworks blog has a post on supporting incremental updates for data stored in Hive. Rather than doing SQL UPDATE statements (which Hive does not yet support), the post describes using a base table and an incremental table, which contains updates to the base. These two tables are then reconciled with a Hive VIEW. The post has many more details on how to implement this scenario, including how to use Sqoop to load incremental data.

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

Another post on the Hortonworks blog covers integrating Kerberos for Hadoop with Active Directory. It details the steps to setup a Kerberos KDC, use Apache Ambari to enable security on the Hadoop cluster, enable the kerberos domain and trust in Active Directory, and enable security in Hue.

http://hortonworks.com/blog/enabling-kerberos-hdp-active-directory-integration/

News

SQL-on-Hadoop vendor Hadapt was acquired by Teradata. The deal is rumored to have been worth $50M, and Teradata is supposedly increasing the size of their Boston (the location of Hadapt) office.

http://betaboston.com/news/2014/07/16/source-hadapt-acquired-by-teradata-will-add-lead-to-more-employees-in-boston/

Cloudera announced that they’re starting a three-day course called “Cloudera Developer Training for Apache Spark.” The course kicks off in August and costs $2295.

http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2014/07/16/cloudera-announces-new-apache-spark-training-course-for-big-data.html

A team of Cloudera employees are working together on a new book entitled “Hadoop Application Architectures.” In early release, the first two chapters covering data modeling and data movement are available via O’Reilly.

http://blog.cloudera.com/blog/2014/07/the-new-hadoop-application-architectures-book-is-here/

This post talks about some of the reasons that Spark is all the rage right now. Based on a talk by MapR CTO M.C. Srivas at Spark Summit, it covers some advantages of Spark and several use-cases that MapR is seeing for Spark. It also discusses some of the advantage that Spark gives of MapReduce for real-time computation.

http://www.mapr.com/blog/why-spark-hadoop-matters

Videos from the talks at Spark Summit (which took place earlier this month) have been posted on the conference website. Talks cover three tracks—Applications, Developer, and Data Science. There are also a number of keynotes from both days.

http://spark-summit.org/2014/agenda

Releases

Oracle announced Oracle Big Data SQL this week for running queries against data stored across an Oracle Database, a NoSQL data store, and Hadoop. A post on the DBMS2 blog has more details on the implementation (and how it isn’t SQL-on-Hadoop as is commonly understood).

http://www.zdnet.com/oracle-big-data-sql-lines-up-database-with-hadoop-nosql-frameworks-7000031564/
http://www.dbms2.com/2014/07/15/the-point-of-predicate-pushdown/

Another big vendor announced a SQL and Hadoop integration recently. Datanami has coverage of Trafodion, a recently announced ANSI-compatible SQL project from HP. Trafodion runs atop of HBase, aims to support OLTP, and is open-source (at trafodion.org).

http://www.datanami.com/2014/07/14/hp-throws-trafodion-hat-oltp-hadoop-ring/

Cloudera Enterprise 5.1 was released. CDH 5.1 includes HBase 0.98.1, Spark 1.0, Sentry 1.3, Impala 1.4.0, HUE 3.6, and more. A post on the Cloudera blog discusses some of the security-related improvements. Among them, Cloudera Manager now has an automated workflow for securing a non-secure cluster with Kerberos, HBase has gained cell-level access control, and HDFS has extended ACLs. The full post has more details on Cloudera’s grand vision on security as well as how they’ve integrated the Gazzang offering into Cloudera Navigator.

http://blog.cloudera.com/blog/2014/07/cloudera-enterprise-5-1-is-now-available/
http://vision.cloudera.com/cloudera-enterprise-5-1-continues-to-improve-security-and-performance-in-hadoop/

spark-cassandra-csv is a command-line tool for loading CSV files into Cassandra using Spark.

https://github.com/RussellSpitzer/spark-cassandra-csv

Version 0.15.0 of the Kite SDK was released this week. The release contains updates to the Datasets api, several updates to the morphlines library, improved documentation, and more.

http://kitesdk.org/docs/0.15.0/release_notes.html

Cloudera announced support for Apache Accumulo 1.6.0. The release is compatible with both CDH 5 (5.1+) and CDH 4 (4.6+).

http://community.cloudera.com/t5/Release-Announcements/Announcing-Apache-Accumulo-1-6-0-support-for-CDH-4-and-CDH-5/m-p/15366

Continuuity announced a new open-source project called Tephra. Tephra is a distributed transaction engine for HBase and Hadoop (and is extensible to support other systems like MongoDB). Transactional secondary indexes for HBase are a key use-case that the introductory post highlights.

http://blog.continuuity.com/post/92085524375/meet-tephra-an-open-source-transaction-engine

The SequenceIQ blog has been quite active discussing Hadoop and Docker. This week, they announced Cloudbreak, which provides a cloud-agnostic Hadoop-as-a-Service API using Docker to provision Hadoop. The system also uses Apache Ambari, Serf, and dnsmasq. Cloudbreak has a UI, API, CLI, and a REST-client. Code is available on github, and you can sign up for Cloudbreak on the SequenceIQ website.

http://blog.sequenceiq.com/blog/2014/07/18/announcing-cloudbreak/

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Meetup at Cloudera (Palo Alto) – Tuesday, July 22
http://www.meetup.com/Bay-Area-Bigtop-Meetup/events/195296762/

Enterprise Security for Apache Hadoop: Finding and Filling the Gaps (Sunnyvale) – Wednesday, July 23
http://www.meetup.com/SF-Bay-Areas-Big-Data-Think-Tank/events/192432692/

Accelerate Big Data Application Development with Cascading (San Francisco) – Tuesday, July 22
http://www.meetup.com/DevBrill-Developers-Meetup-SF-Bay-Area/events/190470222/

All-Day Event : “Foundations of Big Data” (San Diego) – Thursday, July 24
http://www.meetup.com/sdbigdata/events/191661622/

Datameer & Cloudera Presents the Big Data Analytics City Tour (San Francisco) – Thursday, July 24
http://www.meetup.com/Datameer/events/193993012/

Introduction to Apache Spark for Enterprise Architects (Mountain View) – Thursday, July 24
http://www.meetup.com/SVForum-SoftwareArchitecture-PlatformSIG/events/194780892/

Oregon

Impala: MPP SQL Engine for Apache Hadoop & Kite SDK: It’s for Developers (Portland) – Wednesday, July 23
http://www.meetup.com/Hadoop-Portland/events/194930422/

Texas

Introduction to Spark Course: Intro to Shark (3 of 7) (Austin) – Wednesday, July 23
http://www.meetup.com/Austin-ACM-SIGKDD/events/187688372/

Minnesota

Hadoop for Newbies (Saint Paul) – Thursday, July 24
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/193800622/

Virginia

Cloudera, Hortonworks, MapR, and Pivotal Come Together to Discuss Apache Spark (Arlington) – Tuesday, July 22
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/190122942/

CANADA

Hands-on Workshop on Distributed Machine Learning and Computing with Spark (Vancouver, B.C.) – Saturday, July 26
http://www.meetup.com/Vancouver-Spark/events/178126142/

ISRAEL

Interactive SQL-on-Hadoop: from Impala to Hive/Tez to Spark SQL to JethroData (Tel Aviv) – Monday, July 21
http://www.meetup.com/Big-Data-Israel/events/189161122/

CHINA

Hadoop Just Got a Lot Sexier – Spark on YARN (Shanghai) – Monday, July 21
http://www.meetup.com/Shanghai-Data-Science/events/192548512/

SPAIN

Spark, the Most Active Apache Project in Big Data (Madrid) – Wednesday, July 23
http://www.meetup.com/Madrid-Apache-Spark-meetup/events/195241462/

GERMANY

Michael Hausenblas: Lambda Architecture with Spark (Berlin) – Thursday, July 24
http://www.meetup.com/Big-Data-Beers/events/189314292/

INDIA

How YARN Made Hadoop Better (Hyderabad) – Saturday, July 26
http://www.meetup.com/hyderabad-scalability/events/181714212/

[…]

Read More…

Hadoop Weekly Issue #78

Hadoop Weekly Issue #78

13 July 2014

This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.

Technical

The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.

http://blog.gopivotal.com/pivotal/products/1-command-15-minute-install-hadoop-in-memory-data-grid-sql-analytic-data-warehouse

Datanami has a post about how Concur, who provides expense reporting software, is implementing Hadoop. They’re running a 40-node CDH cluster and currently using it for classification of expense report items and personalized recommendations. The post is full of anecdotes about their Hadoop rollout that will be useful for anyone in a similar situation.

http://www.datanami.com/2014/07/07/hadoop-remaking-travel-expense-reporting-concur/

The Cloudera Kite SDK provides tools and APIs for working with the components of the Hadoop ecosystem. One of these tools is Morphlines, which aims to streamline ETL. This two-part article talks about how to use Morphlines to validate records from a text file and save them into a Hive table. It goes through the Morphlines configuration file options and describes the steps of the process.

http://techidiocy.com/cloudera-kite-morphlines-getting-started-example/
http://techidiocy.com/anatomy-configuration-file-cloudera-kite-morphlines/

The Qubole blog has an article on best practices when working with Apache Hive. It covers how to organize your data on the file system (partitioning and bucketing), choosing serialization formats, configuration parameters to get the most of hive (parallel execution and vectorization), and more.

http://www.qubole.com/hive-best-practices/

This post covers PigPen, which is a MapReduce library for Clojure open-sourced by Netflix. It walks through some background on Hadoop, Apache Pig (which serves as the execution engine for PigPen), and Clojure. It also gives a brief introduction to Cascading and related projects (such as pattern, lingual, and drive), and how these compare to the pig-based stack that Netflix uses. Finally, it goes through some examples of PigPen jobs.

http://bugra.github.io/work/notes/2014-07-09/pigpen-hadoop-pig-clojure-cascading/

In the third part of their series on Apache Oozie, Altiscale has a number of tips for working with the workflow engine. The six tips mostly cover aspects of submitting and running jobs with Oozie.

https://www.altiscale.com/apache-oozie-tips-tricks/

Hortonworks has curated a list of presentations covering Hadoop operations from the recent Hadoop Summit. Slides and videos for each presentation are available via the Summit archive.

http://hortonworks.com/blog/apache-hadoop-operations-scale/

The Cloudera blog has a post on analyzing time-series data with Apache Crunch. The article covers generating Avro-serialized time-series data from Sequence Files (including the event time series avro schema), doing some simple analysis with the Crunch API (e.g. finding min, max, and counts), and doing a cross-join for multivariate analysis. The code for the post is available on github.

http://blog.cloudera.com/blog/2014/07/how-to-build-advanced-time-series-pipelines-in-apache-crunch/

The Databricks Cloud was announced at the Spark Summit last week. This post highlights some of the interesting features of the product, including dashboarding and real-time processing. As highlighted in the post, the Databricks Cloud makes it very easy to build products from data.

http://gradientflow.com/2014/07/12/databricks-cloud-makes-it-easier-to-build-data-products/

News

Recordings of presentations from HBaseCon were posted. There are talks from four tracks—operations, features & internals, ecosystem, and case studies.

http://hbasecon.com/archive.html

The Gartner blog has a post analyzing the rise of Apache Spark, which a number of vendors are jumping to support. It talks about how Spark tends to be easy to integrate (if a Hadoop integration was already done), and also how companies don’t want to be slow to adopt Spark (as many were for Hadoop).

http://blogs.gartner.com/nick-heudecker/spark-restarts-the-data-processing-race/

This week, Cloudera announced a partnership with Capgemini and Hortonworks announced a partnership with Accenture. In both agreements, Capgemini and Accenture will help customers deploy their partners Hadoop distribution. A post on SiliconAngle talks about how these types of partnerships show that Hadoop is maturing as an enterprise product.

http://siliconangle.com/blog/2014/07/11/tsunami-of-team-ups-reaffirms-accelerating-hadoop-maturity/

Actian, makers of the Actian Analytics Platform for SQL on Hadoop, announced a number of partnerships including one with Hortonworks.

http://www.marketwatch.com/story/industry-leaders-rally-behind-actians-sql-in-hadoop-platform-to-industrialize-hadoop-2014-07-08

Releases

InformationWeek has an article on the recently announced DataStax Enterprise 4.5 release. In addition to Spark support, the release has improved supports for joining data between a Cassandra cluster and a Hadoop cluster (DataStax says they don’t aim to solve DataWarehousing and are happy to leave that to Hadoop).

http://www.informationweek.com/big-data/big-data-analytics/datastax-cassandra-release-packs-more-than-spark/d/d-id/1279086

Jumbune is a profiler and debugger for Hadoop MapReduce. It offers per job, per job flow, and cluster-wide analysis tools. It was recently open-sourced under the LGPLv3 license by Impetus Technologies.

http://www.marketwired.com/press-release/impetus-open-source-solution-jumbune-to-accelerate-hadoop-based-solution-development-1926600.htm

Scoobi, the Scala library for building MapReduce jobs, released version 0.8.5 this week. The maintenance release includes a number of improvements and some bug fixes.

http://notes.implicit.ly/post/91095690499/scoobi-0-8-5

Spring for Apache Hadoop 2.0.1 was released. It bumps versions of several dependencies, including Apache Hadoop to 2.4.1.

http://spring.io/blog/2014/07/08/spring-for-apache-hadoop-2-0-1-released

Version 1.0.0 of Cloudera Oryx, a system for real-time machine learning and predictive analytics, was released. The release contains several new endpoints and bug fixes.

http://community.cloudera.com/t5/Data-Science-and-Machine/Oryx-1-0-0-released/m-p/14822

Cloudera Enterprise 5.0.3 was released. There are a number of fixes to the CDH stack, including Flume, HBase, HDFS, Hue, Oozie, YARN, and Solr.

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-0-3-CDH-5-0-3-and-Cloudera/m-p/14950#U14950

ProtectFile for Hadoop is new enterprise encryption software from SafeNet. ProtectFile offers encryption at rest for HDFS and includes automation tools for deploy.

http://data-protection.safenet-inc.com/2014/07/big-data-encryption-addresses-hadoop-security-concerns/

Pentaho 5.1, which was released in June, added support for Hadoop YARN. It also includes integrations with MongoDB, and has a Data Science Pack which integrates with R and Weka. This post from InformationWeek has many more details on the new release.

http://www.informationweek.com/big-data/big-data-analytics/pentaho-preps-data-on-hadoop-analyzes-on-mongodb/d/d-id/1279187

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Cloudera & Lucidworks: SolrCloud Failover, Testing, and Integration with Hadoop (Palo Alto) – Tuesday, July 15
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/191046852/

46th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) – Wednesday, July 16
http://www.meetup.com/hadoop/events/129795442/

Hadoop Ask Me Anything (Palo Alto) – Wednesday, July 16
http://www.meetup.com/Hadoop-Ask-Me-Anything/events/194173032/

OC Big Data Monthly Meetup #3 (Irvine) – Wednesday, July 16
http://www.meetup.com/OCBigData/events/179381122/

July SF Hadoop Users Meetup (San Francisco) – Wednesday, July 16
http://www.meetup.com/hadoopsf/events/189897052/

Hey Big Data, Meet Apache Spark, by Marco Vasquez of MapR (Santa Monica) – Wednesday, July 16
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/175709772/

Colorado

In-Memory Computing Principles (Denver) – Monday, July 14
http://www.meetup.com/Data-Science-Business-Analytics/events/189837112/

Texas

Extending Apache Ambari (Houston) – Thursday, July 17
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/188066532/

Hadoop and Big R (Irving) – Saturday, July 19
http://www.meetup.com/Dallas-R-Users-Group/events/192928382/

Nebraska

Shawn Hermans Presents Big Data (Omaha) – Thursday, July 17
http://www.meetup.com/Heartland-Big-Data-Meetup/events/191993412/

Missouri

Apache Cassandra (Saint Louis) – Tuesday, July 15
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/189775412/

Illinois

Deep Learning: Theory, Practice and Predictions with H2O (Chicago) – Wednesday, July 16
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

Georgia

Beyond MapReduce: In-Memory Analysis with Spark and Shark (Atlanta) – Tuesday, July 15
http://www.meetup.com/atlcassandra/events/188461182/

North Carolina

Triad Hadoop Users Group (Winston Salem) – Thursday, July 17
http://www.meetup.com/Triad-Hadoop-Users-Group/events/187375842/

New York

Introduction to Apache Mesos (New York) – Monday, July 14
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/184053172/

A Leap Forward for SQL on Hadoop (New York) – Monday, July 14
http://www.meetup.com/Big-Data-Developers-in-NYC/events/189542182/

Massachusetts

Boston Spark User Group July Presentation Night (Cambridge) – Tuesday, July 15
http://www.meetup.com/Boston-Apache-Spark-User-Group/events/184426442/

SINGAPORE

Technical Workshop – Revolution Analytics and Cloudera (Singapore) – Monday, July 14
http://www.meetup.com/R-User-Group-SG/events/193625622/

GERMANY

Couchdoop and Other Consumer Use Cases from the Hadoop Ecosystem (Munich) – Thursday, July 17
http://www.meetup.com/Hadoop-User-Group-Munich/events/188851932/

POLAND

Hadoop 2.0 Processing Framework (Krakow) – Friday, July 18
http://www.meetup.com/datakrk/events/193755742/

INDIA

Hadoop Map-Reduce with Cascading (Hyderabad) – Saturday, July 19
http://www.meetup.com/Hyderabad-Programming-Geeks-Group/events/189970072/

Big Data Meetup (Bangalore) – Saturday, July 19
http://www.meetup.com/Big-Data-Developers-in-Bangalore/events/194094032/

Hadoop Meetup (Bangalore) – Saturday, July 19
http://www.meetup.com/Bangalore-Baby-Hadoop-group/events/189310322/

[…]

Read More…

Hadoop Weekly Issue #77

Hadoop Weekly Issue #77

06 July 2014

I was expecting a dearth of content to match the short week in the US for July 4th. But with Spark Summit this week in San Francisco, there were a number of partnerships, new tools, and other announcements. Both Databricks and MapR announced influxes of cash this week, and there was a lot of discussion about the future of Hive given a joint announcement by Cloudera, Databricks, IBM, Intel, and MapR to build a new Spark backend for Hive. In addition to that, Apache Hadoop 2.4.1 was released, Apache Pig 0.13.0 was released, and Flambo, a new clojure DSL for Spark was unveiled.

Technical

Pivotal HD and HAWQ support Parquet field natively in HDFS. This tutorial shows how to build a parquet-backed table with HAWQ and then access the data stored in HDFS using Apache Pig.

http://www.pivotalguru.com/?p=727

Spark Summit was this week in San Francisco. Slides from the presentations (there are over 50) have been posted on the summit website. In addition to keynotes, there are three tracks—Applications, Developer, and Data Science.

http://spark-summit.org/2014/agenda

This article proposes an alternative to the Lambda Architecture. For those not familiar, the Lambda Architecture is an idea of combining batch and real-time workloads to build course-correcting streaming applications. The alternative, from Jay Kreps (who builds data infrastructure using Kafka and Samza at LinkedIn), is to use the stream-processing framework to backfill data (thus performing the role of batch in the Lambda Architecture). The article discusses the trade-offs and benefits of using the Lambda Architecture vs. a stream processing framework for everything.

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

The altiscale blog has a post on event transport for Hadoop. It gives an introduction to the problem that systems like Apache Flume and Apache Kafka are solving—namely moving data from applications to durable storage in Hadoop. The post also talks about the processing models of Flume and Kafka and the different tradeoffs of the two.

https://www.altiscale.com/event-transport-hadoop/

Altiscale has the first two parts of a three part blog series on Apache Oozie. The first covers how to use wildcards in path expansion for Oozie datasets (there are several gotchas). The second covers using Oozie to run Hadoop streaming jobs (written in Ruby and Python). They show how to dump the environment (useful for debugging), how to configure Oozie to support custom ruby gems in streaming jobs, and how to build a simple MultipleTextOutputFormat subclass support multiple outputs from streaming jobs.

https://www.altiscale.com/wildcards-oozie-2/
https://www.altiscale.com/running-streaming-jobs-oozie/

Pivotal has posted benchmark numbers of their HAWQ system for SQL on Hadoop. The analysis used a 10 node cluster running RHEL 6.2. They compared Impala 1.1.1, Presto 0.52, Hive 0.12, and HAWQ 1.1. Pivotal HAWQ shows average 6x performance improvement over Impala and a 21x speedup over Hive (like most vendor benchmarks, the results should be taken with a grain of salt). The post also touts the SQL compliance of HAWQ, which allows it support many more TCP-DS queries than other systems.

http://blog.gopivotal.com/pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions

This article contains an overview of YARN and YARN schedulers with a focus for HPC audiences. After an intro to YARN architecture, the post describes 11 types of scheduling options familiar to users of HPC systems, many of which aren’t yet available in YARN. After that, it dives into the details of the YARN capacity and fair schedulers.

http://www.linuxjournal.com/content/how-yarn-changed-hadoop-job-scheduling

This presentation discusses Twitter’s experiences with running Spark at scale. For evaluation, they built a 35 node YARN cluster with Spark 0.8.1 and compared it to Pig and Scalding. They found that Spark produced a 3-4x wall-clock speedup over Pig and a 2-3x speedup vs. scalding. They mentioned that tuning Spark jobs required a good understanding of the system, and that there were some limitations for productionization inside of YARN (but that more recent versions of Spark are aiming to address these).

http://www.slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter

Cloudera, MapR, Intel, IBM and Databricks announced a partnership to build a new Spark backend for Hive (more about that below). This post discusses the technical details and motivation for the new project. One of the main motivations is to help Spark shops have a single backend in place (rather than also requiring MapReduce or Tez). The article discusses Query Planning, Job Execution, and the main design considerations of the implementation.

http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/

The Gartner blog has a post about how Hadoop development tools have been falling behind while the ecosystem concentrates efforts on SQL-on-Hadoop. It mentions four areas—development tools, application deployment, testing and debugging, and integrating with non-HDFS sources. There are some projects working on these areas, but there hasn’t been significant improvement.

http://blogs.gartner.com/nick-heudecker/dontforgetthehadoopdevelopers/

News

MapR announced $110 million in financing this week. Google Capital led the round with $80 million (the other $30 was debt financing). InfoWorld has more details on the deal, including MapR’s popularity in enterprise and its expertise in machine learning.

http://www.infoworld.com/t/hadoop/how-much-hadoop-worth-google-80-million-245233

Databricks announced $33 million in series B funding and a new cloud platform. The funding round was led by New Enterprise Associates (NEA). The cloud platform provides an easy way to deploy Spark in Amazon Web Services with expansion to more cloud providers on the roadmap. It provides notebooks, dashboards, and a job launcher.

http://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.html

Pentaho and Databricks announced an integration between Pentaho and Apache Spark. The integration currently includes support for ETL and Reporting, and they’re working on a new backend for their Weka machine learning suite built on Spark.

http://databricks.com/blog/2014/06/30/application-spotlight-pentaho.html

Alteryx and Databricks announced a collaborative effort to work on SparkR. SparkR is a Spark backend to the R analytics system providing distributed computation.

http://www.alteryx.com/press-releases/alteryx-and-databricks-to-lead-development-of-apache-sparkr-for-scalable-hadoop

Fortune has the story of Hadoop’s birth at Yahoo as part of the Nutch project. It features interviews with Hadoop co-founders Doug Cutting and Mike Cafarella, who say they never anticipated the demand for Hadoop, which is driving a $50 billion market. It also discusses the role of open-source in Hadoop’s success, and how Cutting is now working on updating policy for big data.

http://fortune.com/2014/06/30/hadoop-how-open-source-project-dominate-big-data/

DataStax and Hortonworks announced that DataStax completed Hortonworks Certification for HDP.

http://hortonworks.com/blog/datastax-certified-hortonworks-data-platform/

Datanami has coverage of Hortonworks’ certification of Apache Spark on YARN. The article features an interview with Arun C. Murthy and Shaun Connolly of Hortonworks where they discuss the process of evaluating a new system for YARN and new features (such as node labels) they’re adding to YARN for optimizing jobs run on different systems.

http://www.datanami.com/2014/06/26/apache-spark-gets-yarn-approval-hortonworks/

Databricks and SAP announced a partnership this week. As part of the deal, Databricks will certify Spark to run on SAP HANA. The Databricks blog has more details on the partnership.

http://databricks.com/blog/2014/07/01/integrating-spark-and-hana.html

This post summaries the highlights from this week’s Spark Summit. In addition to big announcements from Datastax, Databricks, and more, the post discusses the growth of the summit (450 -> 1000+ attendees), some of the keynotes, and vendor turnout.

http://thomaswdinsmore.com/2014/07/03/spark-summit-2014-roundup/

MapReduce and Hadoop have been tied together for most of the Hadoop’s history. But with the introduction of YARN, MapReduce is just one of the applications. This article points out that Google’s recent revelations about MapReduce don’t mean the end of Hadoop. The author also argues that Google’s new Cloud Dataflow also isn’t meant to be a replacement for Hadoop (especially given Google’s investment in MapR this week).

http://www.datacenterknowledge.com/archives/2014/07/03/hadoop-mapreduce-ties-broken-dataflow-not-a-hadoop-killer/

WANdisco, who specializes in uptime for distributed systems, announced that they’ve acquired OhmData, makers of the C5 database. The C5 database is compatible with HBase APIs but providers different trade-offs and features.

http://gigaom.com/2014/06/30/hadoop-specialist-wandisco-acquires-hbase-like-startup-ohmdata/

Cloudera, Databricks, IBM, Intel, and MapR announced at Spark Summit a partnership to build a new Spark backend for Hive. This announcement caused a lot of confusion and speculation around the companies product offerings—particularly around Cloudera and Impala. The Register has coverage of the initial announcement including reactions from Hortonworks. The Cloudera blog has a post describing their vision for a future in which Cloudera Impala and Hive on Spark exist concurrently—the former for interactive queries and BI tools and the latter for everything else.

http://www.theregister.co.uk/2014/06/30/cloudera_and_co_spark/
http://vision.cloudera.com/broadening-support-for-apache-spark/

To add confusion to the announcement of Hive on Spark, Databricks announced that they’re no longer planning to support Shark, which is the original project for Hive on Spark (the new project will be a rewrite taking advantage of changes to the Hive APIs introduced in order to support Apache Tez as a backend). On top of that, they believe that Spark SQL, their system for invoking SQL queries from a Spark job, is the future of SQL on Spark. The post also acknowledges the need for Hive on Spark, which adds further complication to the discussion.

http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

A post on the Hortonworks blog tells the tale of Hadoop Then, Now, and Next. It describes traditional Hadoop based on HDFS and MapReduce, the arrival of YARN (and declares that Traditional Hadoop, built on mappers and reducers, is dead) as the basis for Enterprise Hadoop, and discusses how YARN will power the future of Hadoop.

http://hortonworks.com/blog/enterprise-hadoop-whats-next-data-management/

Releases

Apache Hadoop 2.4.1 was released. The new version contains a number of bug fixes include a security fix for HDFS admin sub-commands.

http://mail-archives.apache.org/mod_mbox/hadoop-general/201406.mbox/%3C95C9844E-FB00-4DFB-BECF-5C49E1A727F4%40hortonworks.com%3E
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/releasenotes.html

Sparkling Water is a new system combing OxData’s H20 with Apache Spark. H20 is an open-source machine learning framework for big data. It supports a number of algorithms for data science including k-means, random forest, stochastic gradient descent, and naive bayes. It previously supports a stand-alone cluster or running on Hadoop, and Sparking Water adds Spark as a runtime.

http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

Pydoop 0.12 was released with support for YARN and CDH 4.4/4.5.

http://mail-archives.apache.org/mod_mbox/hadoop-

mapr-sandbox-base is a docker image for running the MapR sandbox in docker.

https://registry.hub.docker.com/u/maprtech/mapr-sandbox-base/general/201407.mbox/%3C53B40660.5000903%40crs4.it%3E

Apache Pig 0.13.0 was released. The release contains a number of new features and performance improvements. Among the most interesting features are a pluggable execution engine and auto-local mode.

http://mail-archives.apache.org/mod_mbox/pig-user/201407.mbox/%3CCAB2zpW9cqbeMbuVFg7WS8%3DoOwHC5Or6Xi7ELkvFB91P8U-7yxA%40mail.gmail.com%3E

Flambo, which was open-sourced this week by Yieldbot, is a new project that provides a Clojure DSL for Apache Spark. Flambo’s README provides examples of using the idiomatic Clojure API.

https://github.com/yieldbot/flambo

MapR announced support for new versions of Hive, Httpfs, Mahout, and Pig. All are available for MapR 3.0.3, 3.1.1, and 4.0.0 FCS.

http://www.mapr.com/blog/apache-open-source-projects-release-update

The cassandra-driver-spark project is a new project from DataStax to integrate Cassandra with Apache Spark. With the driver, it’s possible to store a Spark RRD into Cassandra with a single statement.

https://github.com/datastax/cassandra-driver-spark

Events

Curated by Mortar Data ( http://www.mortardata.com )

UNITED STATES

California

Unlimited Analytics in Hadoop with Actian Vector (San Francisco) – Wednesday, July 9
http://www.meetup.com/SF-Data-Warehouse-Group/events/188742072/

Deep Dive Apache Drill: Building Highly Flexible, High Performance Query Engines (Menlo Park) – Thursday, July 10
http://www.meetup.com/Hadoop-Talks/events/180632322/

Hadoop: Past, Present and Future (Irvine) – Thursday, July 10
http://www.meetup.com/Orange-County-Java-Users-Group-OCJUG/events/192610022/

Texas

Extending Apache Ambari (Houston) – Wednesday, July 9
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/188066532/

Utah

Big Data Utah Meeting @ IHC – Discussion on Architecture and Best Practices (Salt Lake City) – Wednesday, July 9
http://www.meetup.com/BigDataUtah/events/191685552/

Colorado

Graph Processing with Hadoop & HBase by Brandon Vargo, Senior Platform Engineer (Boulder) – Thursday, July 10
http://www.meetup.com/Graph-Nerds-of-Boulder/events/192207712/

Kansas

MapR Talks Apache Spark & Tableau’s Rel.8.2 (Kansas City) – Thursday, July 10
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/182908202/

Georgia

Hey Hadoop, Meet Apache Spark! (Atlanta) – Wednesday, July 9
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/181968972/

Washington, D.C.

MapR: Security and Hadoop Discussion (Followed by Happy Hour and Networking) (Washington) – Thursday, July 10
http://www.meetup.com/Hadoop-DC/events/187536342/

CANADA

Introduction to Apache Spark (Toronto) – Tuesday, July 8
http://www.meetup.com/TorontoHUG/events/191210182/

SQL on Hadoop Party – Downtown Session 1 (Vancouver) – Thursday, July 10
http://www.meetup.com/Big-Data-Developers-in-Vancouver/events/189972172/

SQL on Hadoop Party – Burnaby Session 3 (Burnaby, B.C.) – Friday, July 11
http://www.meetup.com/Big-Data-Developers-in-Vancouver/events/189972822/

INDIA

Hadoop by Use Case and Example (Hyderabad) – Saturday, July 12
http://www.meetup.com/hyderabad-scalability/events/182572882/

[…]

Read More…