Hadoop Weekly Issue #76

Hadoop Weekly Issue #76

29 June 2014

Google made news this week by proclaiming that MapReduce is dead at Google—there are two reactions in this week’s issue. And with that in mind, there are several good posts covering non-MapReduce projects in the Hadoop ecosystem—Accumulo, HDFS, Storm, Spark, and more. Apache Storm also released a new version this week, and there were announcements from Hortonworks, IBM, and RainStor about their Hadoop-related products.


Apache Accumulo, the distributed key-value store, supports bulk loading of data in its native format, RFile. Loading data as RFiles, which can be generated via MapReduce jobs, is much more efficient than loading the same data one record at a time. The Sqrrl blog talks about some tools they’ve built to load data using RFiles from data stored in JSON and CSV.


A post on the Cloudera blog talks about extended attributes, which are a precursor to encryption at rest and other filesystem features. Extended attributes come in four flavors (user, trusted, system, and security), and they are a mapping of String -> byte[]. The feature is slated for the Hadoop 2.5 release, and there are a number of new HDFS command-line options and api changes to support them.


This post describes how to setup a local core-Hadoop dev environment with IntelliJ. Mostly, the process seems to just work, but there are a few tips to customize the environment and workaround an issue with missing projects on the classpath.


Hortonworks has posted (without a registration-wall) the slides and recording of a recent webinar on Apache Storm. This post has answers to some questions asked during the webinar. They cover how Storm fits together with HBase, Flume, Hadoop, Spark, and more.


Hortonworks also posted a video on their webinar on advanced security on HDP. Again, there is a lot of good information in the Webinar Q&A text included in the write-up. It adds some details about XA Secure (which was recently acquired by Hortonworks), Apache Knox (for perimeter security), the role of active directory/LDAP, encrypting data at rest, and more.


The GigaOm Structure Show podcast features an interview with Databricks co-founder and CTO Matei Zaharia. The interview, for which there are some highlights posted, covers Apache Spark (which Matei is also one of the creators). It covers the genesis of Spark to build a better computing framework, the flexibility and improved programming model of Spark, and more.


This presentation, from the East Bay Java User Group, covers building a Hadoop-based application for clickstream analysis. The talk does a high-level design, which includes things like deduplication and sessionization, data storage with Avro, dataset partitioning in HDFS, and data ingestion with Flume. For each component, there’s a discussion of alternatives (e.g. Flume vs. Kafka) and why a particular alternative was chosen.


Datanami has a case-study of T-Mobile, who recently switched from a petabyte Netezza appliance to Hadoop with RainStor. The post covers T-Mobile’s scaling challenges (they have a 2.5x increase in data every 18 months), the security considerations that T-Mobile addressed (including an isolated network), and their choice of RainStor for SQL and encryption/compression.



This post talks about how we’re in the third wave of Hadoop. According to the article, the first wave was the early adopters that had new types/volumes of data, the second wave created a number of new projects/products and companies offering Hadoop support, and the third is a movement to use Hadoop as a database rather than deploying individual MapReduce jobs.


Hortonworks and IBM announced that IBM InfoSphere Guardium is certified with HDP 2.1. Guardium provides real-time monitoring, alerting, and reporting for audit logging and mitigating data breaches.


The Gartner blog has a post on defining Hadoop. A couple of years ago, the definition was limited to six projects, but now as many as fifteen are supported by commercial distributions. And there are more projects likely to be included in that list as time goes on.


Cloudera, Dell, and Intel announced a new Dell In-Memory Appliance for Cloudera Enterprise. The appliances are optimized for Apache Spark, Apache Solr, and other memory-intensive workflows. Cloudera mentions that more memory is necessary given the push to use Hadoop for real-time analytics rather than batch processing.


This post on Big Data as a Service (BDaaS, the more general version of Hadoop as a Service) tries to answer the question “What are the different types of BDaas available?” It covers Core BDaaS (e.g. Amazon EMR), Performance BDaaS (e.g. Altiscale), Feature BDaaS (e.g. Qubole), and Integrated BDaas.


Continuuity Founder and CEO Jonathan Gray has written a post about Hadoop Summit. He’s identified a few trends—the push towards enterprise support (notably work on security), the balancing act of Hadoop and the traditional EDW, and the fragmentation of Hadoop (vendors supporting different stacks, competing projects, and more). He also mentions that Hadoop needs to be simplified, which is a theme that seems to be gaining traction in some areas (e.g. newer programming models).


Google’s Urs Hölzle made the news this week when he proclaimed “We don’t really use MapReduce anymore” at Google I/O. While many folks were surprised by the announcement, this post explores why it’s not that surprising. Rather, with the research on systems like Dryad (from 2007) and MPP database products, it’s a little surprising that MapReduce is still so prevalent in Hadoop.


In another post triggered by the Google/MapReduce news, the author discusses why MapReduce gained popularity as a processing framework by contrasting it with MPI. With the MapReduce primitives, you can solve a lot of problems. But to solve more complex problems, the Hadoop platform needs something more.



Hortonworks has officially classified Apache Spark as “YARN Ready.” The project is still available as a HDP 2.1 Tech Preview, and Hortonworks has some suggestions for deployment (e.g. multiple Spark deploys on a single YARN cluster if you have many concurrent Spark users).


SequenceIQ has posted a new docker image for Apache Hadoop 2.4 to the official Docker registry. Their post describes how to build the image for yourself, and instructions for some simple testing.


Hortonworks announced that HDP Advanced Security, which is based on the XA Secure acquisition, is now available as an add-on download for HDP 2.1. As part of the announcement, Hortonworks also reiterated their commitment to submitting the software to the Apache Incubator.


structor is a project for building Hadoop VMs using Vagrant. While their are several solutions available for doing so, this setup also includes support for building a secure Hadoop cluster using kerberos.


Apache Storm, the stream processing framework, released version 0.9.2. The new version includes improvements to netty transport and the storm UI, a new Kafka Spout, and more.


hRaven, a tool for collecting metadata about MapReduce jobs, released version 0.9.15. The new version includes updates to the several components, instrumentation of REST API calls, and other improvements.


RainStor, which provides interactive SQL-on-Hadoop, released version 6 this week. The new release, which is certified for Cloudera 5 and Hortonworks HDP 2.1, includes a new archive application, and integration with Apache Ambari and HCatalog.


IBM released IBM InfoSphere BigInsights V3.0 this week. The new release includes a new Big SQL component for low-latency SQL-on-Hadoop (including SQL 2011 support), and an updated version of Solr. The release is available in three editions—quick start, standard, and enterprise.



Curated by Mortar Data ( http://www.mortardata.com )



Spark Summit 2014 (San Francisco) – Monday, June 30


Introduction to Pig with Live Demonstration (Tempe) – Wednesday, July 2

North Carolina

NC State and IBM Discussion of Hadoop Usage Patterns (Durham) – Monday, June 30

Washington, District of Columbia

Elasticsearch-DC meetup with a Federal Agency Presenting (Washington, D.C.) – Monday, June 30


Big Data June Meetup (Mannheim) – Monday, June 30


BigData/HadoopSG meetup (Singapore) – Tuesday, July 1


Big Data Mining and Graph Processing (Sydney) – Thursday, July 3


SQL for Hadoop (Zurich) – Thursday, July 3


Big Data Beyond Hadoop: Spark and Message Queuing Systems (Bangalore) – Friday, July 4

Hadoop by Use Case and Example (Hyderabad) – Saturday, July 5


Read More…

Big data world

Understanding Big Data Big data is going to change the way you do things in the future, how you gain insight, and make decisions (the change isn’t going to be a replacement, rather a synergy and extension). This book to help you get quickly up to speed on this technology and to show you the […]

Read More…

Hadoop Weekly Issue #75

Hadoop Weekly Issue #75

22 June 2014

Security in Hadoop is a big topic in this week’s issue—there’s coverage from Accumulo Summit, and posts from both Cloudera and Hortonworks on the topic. This week’s issue also covers technical posts on the Kiji Framework, Hadoop + Docker, and Etsy’s predictive modeling software, Conjecture.


Slides from the recent Accumulo Summit have been posted online. There are 17 presentations from folks at Cloudera, Hortonworks, Sqrrl, and more. Topics include Accumulo and YARN, the Accumulo community, and security for Accumulo.


The DataStax Cassandra SF users group hosted a talk on building applications with the Kiji Framework. Kiji, which is a framework for building applications, used to be tied to HBase but recently added Cassandra as a storage backend. Hakka Labs has a video of the presentation, which covers integrating Cassandra and Kiji.


There have been plenty of articles and presentations on the differences between Hadoop 1 & Hadoop 2, but there are so many new features that it’s good to reiterate them. These slides do a good job of summarizing the major changes without diving into too much of the implementation details and also provide a good overview of what’s next for Hadoop.


I’ve been spending a lot of time with docker recently, and it seems to open up a whole lot of new possibilities. In this case, SequenceIQ is deploying Apache Ambari as a docker container, which in turn spins up a full a Hadoop cluster. They’re doing some interesting things with serf for discovering the nodes in the cluster and setting up DNS. The ultimate goal is a docker-based, cloud-independent system for deploying Hadoop clusters.


In the first of two posts by Cloudera on Project Rhino and Hadoop security, the Cloudera blog covers some of the technical details of implementing encryption at rest on HDFS. As a quick recap—Project Rhino is a push started by Intel to bring hardware-accelerated encryption support to Hadoop. As part of the Cloudera/Intel deal, Cloudera joins Intel working on Project Rhino. This post covers the role of a key-server and hardware acceleration in implementing HDFS encryption at rest.


Etsy has open-sourced Conjecture, their predictive modeling pipeline. The framework uses Scalding for parallelized model training. The post unveiling Conjecture contains more background on the scalding jobs as well as other parts of the system.


Kickstarter has posted a presentation on their data pipeline. While they’re using Redshift for the majority of the analysis, they make use of Sqoop and Hadoop Streaming on Amazon’s EMR as well. There are some interesting ideas around their workflow for data requests using Trello.


The MSDN Blog has a post on using a JSON SerDe with Hive running on Microsoft Azure HDInsight. The post is fairly Windows specific (the Hortonworks blog has a similar post a few months ago targeted towards Linux), but is full of details that should be helpful regardless.


This post covers tuning Cloudera Search (which is built on SolrCloud). It covers the HDFS cache configuration and tuning parameters as well as the SlorCloud settings.



GigaOm has an article about Yahoo’s testing of and driving of new features for Hadoop at scale. The post talks about how Hortonworks, who was spun off from the Hadoop team at Yahoo, benefits given their close relationship between the two teams. The post mentions HDP 2 and Storm on YARN as examples of the close partnership.


The reliability of the Hadoop platform has increased leaps in bounds in recent years via various high-availability and scalability initiatives. But the reliability and ease of debugging complex jobs running on Hadoop hasn’t really gotten any better. A Forbes contributor article explores this question from a business/competitor perspective. The article asks if this is hurting Hadoop adoption, discusses some of the main causes of job failure, and talks about what some of the Hadoop-as-a-Service vendors are doing to help reliability.


The Hortonworks blog has a post with details on the security offering from the recently acquired XA Secure product. XA Secure supplements Hadoop’s current security with centralized administration, authorization across Hive, HDFS, and HBase, and an audit system. Hortonworks has previously said that this software will be open-sourced in the future.


Cloudera has also been talking a lot about security, and this post summarizes the state of security on Hadoop. This post explores in detail all of the goals of Project Rhino as well as how it fits together with Apache Sentry (incubating). The post also mentions that some of the employees Cloudera gained as part of the Gazzang acquisition will be working on Project Rhino.


Radoop, makers of analytics and visualization software for data stored in Hadoop, were acquired this week by RapidMiner. RapidMiner and Radoop were previous partners, and this will help give RapidMinder better support for Hadoop.


The Ovum blog has a post on the Actian Analytics Platform-Hadoop SQL Edition. Yet another entry in the SQL-on-Hadoop system, Actian takes a different approach. Unlike Hive, Impala, and other solutions, Actian doesn’t use the Hive megastore. In contrast, it supports full ANSI SQL 2003 and ACID (via version control). Interestingly, Actian is using Hadoop to scale out their existing solution, which was previously limited to a single node (about 10TB).


Datanami has a post about stealth startup BlueData, which aims to be “VMWare for big data.” The software aims to ease the complexity of installing Hadoop as well as offering elasticity in the data center. It’s currently in beta.



Google has added support for automatic provisioning of Apache Spark and Shark for clusters running in the Google Compute Engine. Version 0.34.3 of the bdutil for the Google Cloud Platform contains these changes. The new version also contains support for configuring a SOCKS proxy to access the cluster’s web ui.


ElasticSearch announced version 2 of their Hadoop connector, which is certified for CDH 5 (in addition to supporting Hortoworks and MapR distributions).


MapR has announced support for Hive 0.13, Hue 3.5, Sqoop 2, Oozie 4, and HBase 0.94.17. for their distribution. They support multiple versions of Hive, including 0.11 and 0.12, and each of the new versions can run across MapR 3.x releases (3.0.3, 3.1.0, and 3.1.1).



Curated by Mortar Data ( http://www.mortardata.com )



An Intro to Apache Spark (with Hadoop) for Plain Old Java Geeks (Palo Alto) – Tuesday, June 24

Unraveling Hadoop Meltdown Mysteries (Palo Alto) – Tuesday, June 24

YARN Over MR1: An Operational Win, Presented by Karthik Kambatla (Santa Monica) – Tuesday, June 24

Pepperdata Meetup: War Stories from the Hadoop Trenches (Sunnyvale) – Wednesday, June 25

Rubicon’s Hadoop Program (Irvine) – Wednesday, June 25

Apache Sentry: Enterprise-Grade Security for Hadoop (San Jose) – Wednesday, June 25

Application Architectures with Apache Hadoop: Putting the Pieces Together (Oakland) – Wednesday, June 25

Open Space, Including “Test-Driven Hadoop Workshop” (Westlake Village) – Wednesday, June 25

LOPSA SD Monthly Meeting: An Introduction to Hadoop (San Diego) – Thursday, June 26

Cancer Genomics and Data Sciences (Milpitas) – Sunday, June 29


Apache Spark: Resilient Distributed Datasets (Denver) – Thursday, June 26


Apache Spark as Cross-Over Hit for Data Science (Chicago) – Wednesday, June 25

North Carolina

June CHUG: Enterprise-Grade Security on Hadoop, with Johndee Burks of Cloudera (Charlotte) – Wednesday, June 25


Apache Spark (Philadelphia) – Tuesday, June 24

New Jersey

Intro to Big Data and Hadoop with a Cloudera Architect (Upper Montclair) – Tuesday, June 24

New York

Network Design Considerations and Challenges for Hadoop ‘Big Data’ Environments (New York) – Wednesday, June 25



Ottawa Hadoop User Group Kick-Off Meetup, with Guest Speaker from Hortonworks (Ottawa) – Tuesday, June 24

YARN Roadmap, with Guest Speaker Joseph Niemiec (Toronto) – Wednesday, June 25


Bol.com’s Multifactor Hadoop-Based Recommender + Hadoop Warehousing with Impala (Utrecht) – Wednesday, June 25


SHUG 12: Improving Developer Productivity in the Big Data World, with David Whiting (Stockholm) – Wednesday, June 25


Read More…

Hadoop Weekly Issue #74

Hadoop Weekly Issue #74

15 June 2014

With Hadoop Summit in recent memory, there are several posts from or summarizing the summit in this week’s newsletter. Technical articles cover a wide range of topics from Hive and Pig tips to logging infrastructure at Loggly. SQL-on-Hadoop was also a big topic this week—discussions about the need for it to drive Hadoop adoption.


The Mortar blog has a post with some tips for using Apache Pig. It features some lesser-known features of Pig such as writing UDFs in JavaScript, data sampling, and casting a relation to a scalar. If you use Pig and are looking to level-up your game, this is a great place to start.


HDFS RAID is a mechanism to use erasure codes instead of replicas in HDFS. Glossing over the technical details (which are covered in this article), you can do 2.2x or 1.4x replication instead of 3x, which makes for huge savings on large clusters. Facebook has posted about their experience deploying HDFS RAID to save petabytes of storage. There are a lot of tips and details on problems they faced in the road to reclaiming lots of storage space.


Loggly, who makes a log management service, has written about their usage of Apache Kafka. Kafka let them simplify their deployment, which lets them process hundres of thousands of events per second. They also talk about some of the technical details and operational concerns of their deployment (such as what machines they use on AWS and how they control resource utilization).


Apache Flink (incubating), formerly known as stratosphere, is a next generation processing framework with similar goals to other frameworks like Apache Tez and Apache Spark. This post explains the philosophy and design behind Flink, which is heavily influenced by relational database optimizers. Essentially, Flink will try to rearrange or rewrite the pipeline you’ve described in order to improve performance based on statistics and other knowledge of the underlying data.


There are quite a few options for running SQL queries against data stored in Hadoop (HDFS, HBase, or API-compatible File Systems). This post covers a number of them—Apache Hive, Impala, Presto, Shark, Apache Drill, Pivotal HAWQ, IBM BigSQL, Apache Pheonix, and Apache Tajo. For each one, there’s an overview of the tool and recommendation for when to use it.


The MapR blog has a tutorial on deploying Apache Accumulo 1.5 on MapR 3.1. The tutorial walks through the various MapR FileSystem and Accumulo configuration settings.


The Hortonworks blog has a post on using Cascading to build a flow for parsing log files, grouping by IP, and generating counts per IP. The post has the code and a full walkthrough of how the code works.


The Apache Accumulo summit was this week, and there were a number of great presentations. This one on scaling Accumulo clusters has lots of details on its under-pinnings, which help it support large datasets at high throughputs.


The SF Data Mining meet up recently featured a presentation entitled “Mining Big Data for Apache Spark.” Hakka Labs has a video of the presentation, which features the MLLib library from Spark and a live demo of the tools.


This post shows how to use Apache Spark to classify the Reuters 1987 dataset. The code for the tutorial is written in Scala and features XML parsing (using SAX), stemming/tokenization using Lucene, computing TF-IDF, and building a naive bayes model. The code for the example is on github, and there are instructions for building the example in the post.


The Cloudera blog has a post on a rolling upgrades, which is a feature of Cloudera Manager since version 4.6. While most native packages like RPMs and debs don’t allow the simultaneous install of multiple versions of a package, Cloudera Manager can distribute binaries as ‘parcels.’ This, along with the highly available NameNode, facilitate rolling restarts. The Cloudera blog has more details on the process.


Hadoop Internals has a number of details on various parts of Hadoop. It covers Hadoop architecture, the anatomy of a MapReduce job, the various daemons in a Hadoop cluster, a list of key configuration parameters (what they affect), and more.


This post has five tips for working with Hive. They cover two important configuration parameters, a tip on writing queries, and two builtin UDFS—percentile_approx() and histogram_numeric(). There are several example queries illustrating the tips.



The Hortonworks blog has a post with some key takeaways of Hadoop Summit. They include, momentum (highlighting the number of attendees at the summit), the rise of YARN and all the tooling around it, and enterprise Hadoop.


Datanami has a post on Hadoop-as-a-service and hosted Hadoop, which seem to be gaining steam. The post includes interviews with Qubole and Altiscale. There also some numbers from these and other companies showing that managed Hadoop is gaining a lot of steam.


Big Data and Brews has a conversation with Ovum’s Tony Baer about SQL and Hadoop. The conversation, for which there is a both a video and a transcript, contains a lot of interesting points about scaling Hadoop within an organization and across many enterprises. This is where SQL comes in, because many BI tools and applications (which are driving forces for scaling Hadoop) expect to pull back data via SQL.


ScalingData is a new company founded by several Cloudera veterans to build tools using Hadoop to help companies with IT operations. This week, they announced $4.4M in funding to build their platform.


Forbes has a contributor article from SilconAngle founder John Furrier on SQL, Open Source, and Security on Hadoop. The piece highlights some of the recent advancements and new tools in the SQL-on-Hadoop market, the oft-discussed spectrum of open-source strategies for Hadoop vendors, and the role of security in enterprise adoption (which recently picked up steam with acquisitions by Hortonworks and Cloudera).



Spring for Apache Hadoop 2.0 reached GA this week. The new release includes support for a number of distributions, including Apache Hadoop 1.x/2.2/2.4, Pivotal HD 2.0, Cloudera CDH 5, and Hortonworks 2.1. Spring for Hadoop has tools for developing YARN applications, abstractions for reading from/writing to HDFS, and POJO support for Hadoop datasets using the Kite SDK.


Cloudera announced Cloudera Enterprise 5.0.2 (which includes CDH 5.0.2 and Cloudera Manager 5.0.2). The new release of CM includes a fix for Impala query monitoring and CDH includes fixes for Hadoop, HBase, HDFS, Hive, Pig, and YARN.


Hortonworks announced HDP Security, which includes some new features as a result of their XA Secure acquisition. The new features include a centralized security tool, fine grained access control for HBase, Hive, and HDFS, and audit logging.


Continuuity Loom 0.9.7 was released this week. Loom is a cluster provisioning and management suite for private and public clouds. The new release includes a number of changes, including cluster reconfiguration and service addition. There are more details of the release on the Continuuity blog.



Curated by Mortar Data ( http://www.mortardata.com )



Productionizing Spark Streaming, Tableau Spatial Queries, Spark Search Indexing (Mountain View) – Tuesday, June 17

45th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale) – Wednesday, June 18

Washington State

Scalable Analytics with R and Hadoop (Seattle) – Monday, June 18


Big Data Technologies – Apache Spark with MapR (Portland) – Wednesday, June 18


UHUG – Can a Pig Wear Lipstick? (Salt Lake City) – Wednesday, June 18


Leverage what you already know with BigSQL 3.0 on Hadoop (Scottsdale) – Wednesday, June 18


Hortonworks Educational Workshop (Fort Worth) – Thursday, June 19


St. Louis Hadoop Users Group Meetup (Saint Louis) – Tuesday, June 17


Hello Hadoop, meet Apache Spark (Chicago) – Wednesday, June 18

North Carolina

SQL for Hadoop (Durham) – Monday, June 16

This Ain’t Your Father’s Search Engine (Durham) – Thursday, June 19

First meeting of the Triad Hadoop Users Group (Winston Salem) – Thursday, June 19

New Jersey

Princeton Tech Meetup w/ Gilt Groupe (Princeton) – Wednesday, June 18

New York

YARN Tech Talk: The Data Operating System for Hadoop 2.0 (New York) – Tuesday, June 17


June Hadoop Meetup (London) – Tuesday, June 17


R & Hadoop (Singapore) – Wednesday, June 18


Let’s Discuss Hortonworks Bigdata, Its Significance, Future & Training (Bangalore) – Saturday, June 21

Hadoop Ecosystem (Hyderabad) – Saturday, June 21


Bigdataeverywhere Conference – MAPR & VERTICA (Herzeliyya) – Sunday, June 22


Read More…

Changes in Flickr tables both v1 and v2

We recently announced that the Flickr API is going SSL-only.
To support this move, we have also restricted the Flickr YQL tables to be available over SSL-only.

All developers using the Flickr YQL tables will need to make the following updates to their API settings by June 24, 2014:

Protocol: HTTPS
Port: 443
The domain name query.yahooapis.com will remain the same.

As of June 24, 2014, we will limit all access to Flickr YQL tables to secure SSL connections only. No Flickr API data will be accessible over HTTP from this date onwards. If you don’t switch the access protocol to HTTPS, your users will not be able to access Flickr data via your service.
Thank you for supporting us and our users in making the shift to HTTPS.

You go to the Flickr Developer Guide for more information. 


Read More…

Hadoop Weekly Issue #73

Hadoop Weekly Issue #73

08 June 2014

Hadoop Summit was this week in San Jose, so this week’s newsletter is full of lots of interesting technical content and news. I tried to capture as much as I could, but there is just an overwhelming amount. Enjoy!


Hivemall is a machine learning tool for Apache Hive. Implemented as Hive UDFs, it’s easy to test out (just add the jar to a hive session), contains a number of machine learning algorithm implementations (including several not found in other Hadoop libraries), and can do iteration without multiple MapReduce jobs. These slides from a talk at Hadoop Summit provide many more details including details on the implementation.


The Apache HBase community has recently implemented a lot of improvements to the mean time to recover (MTTR) for a failed region server. Facebook, who are large users of HBase, are looking to take this one step further by hosting each region in multiple region servers. A post on the Facebook blog describes the problem and the solution (called HydraBase) in more detail. Facebook plans to roll out the system internally soon, and hopefully it’ll inspire a similar solution in Apache HBase.


While few companies are running Hadoop clusters the size of Twitter (they have 100s of PBs in HDFS), it’s still really interesting to see and learn from their experiences. This post talks about the Hadoop 1 to Hadoop 2 migration at Twitter, including some of the scale and operational issues they ran into and solved as part of the migration. Overall, the migration seems to be a success—they’re seeing much better compute/memory utilization and are starting to see adoption of new frameworks on YARN like Spark.


The Kite SDK is a framework that aims to simplify a number of common Hadoop workflows through abstractions. For instance, there is a concept of a “dataset” which can be backed by HDFS or HBase and stored in various file formats (like Avro and Parquet). This post shows how to use some recently added command-line tools for the Kite SDK to load the contents of a delimited file into HDFS and HBase.


Spotify Data Engineer Adam Kawa tells the story of optimizing Hive in order to (quickly) find the answer to a question related to a bet with Spotify’s CEO (the stakes of which was meeting his favorite recording artist). The post walks through his experience with several of the latest Hive features, such as ORCFile, Hive-on-Tez, and vectorization. He also has details on sampling and writing unit tests for Hive queries.


There’s an interesting post on Quora comparing Apache Spark and Apache Stratosphere (incubating). The two projects aim to support a lot of similar use cases, butthere are some major differences in the research and fundamentals of the systems. There are answers from Databricks’ CTO and a Stratosphere project member.


Hortonworks has posted a Hive benchmark that compares Hive version 0.10 to version 0.13. Version 0.10 was the first release before the start of the “stinger initiative” which aims to speed up Hive by 100x. One of the most impressive stats about the benchmark was that the time taken by all queries went from nearly 8 days to 9.3 hours. The post has a lot of detail about the setup and configuration used.


Inserting data into Hive is always done at the partition (or the entire table) level, meaning you can’t do INSERT INTO .. VALUES .. or similar statements. A new project is working to bring ACID to Hive, which will add support for these types of insert, delete, and update statements. The implementation uses new Acid Input/Output formats that can read and write delta files describing updates to a table. Details on the implementation and how delta files are compacted are in the following slides.


JethroData, makers of an analytics database build on Hadoop, have posted a benchmark in response to the recent benchmarking done by Cloudera on Impala. They’ve recreated the test on a smaller cluster running in Amazon EC2 with a much smaller scale factor (15TB vs. 1TB), but the results are quite promising (with the same caveat as always about vendor benchmarks). In addition to the results, it’s interesting to see that someone has been able to use Cloudera’s Impala TPC-DS kit to run tests.


This post shows how to integrate Apache Flume and Apache Spark to do near-time processing of event data. It uses the Java API for Spark streaming to compute top-10 queries on a rolling window.


The Hortonworks blog has a post on the recently added Blueprints feature of Apache Ambari. Blueprints aim to implement best practices for cluster management and to make provisioning easier and repeatable. The post explains how blueprints work and some of the APIs available to inspect a cluster.


There were a lot of interesting talks at Hadoop Summit—way too many to cover in this newsletter. I’ve put together a bundle of all the slides that I’ve found from last week’s conference. There’s certainly a lot of interesting stuff happening in the ecosystem!



In a respite from the “Hadoop Wars,” big names from Cloudera and Hortonworks got on the stage at Hadoop Summit in a positive setting. It’s sometimes hard to remember that folks from these and other companies are collaborating on open-source software, so it’s great to see some positivity in public.


Concurrent, makers of Cascading, Lingual, Driven and other tools for Hadoop, announced a new round of funding totaling $10 million. VentureBeat has some quotes from CEO Gary Nakamura about how they’ll use the money and their business model.


This story has a lot of good observations about the Hadoop business and what it’ll take for Hadoop to gain enterprise adoption. While some companies got into Hadoop in order to offload ETL, SQL might attract even more. The article is also quick to point out that Hadoop has much more to offer than ETL and SQL.


On the heels of Hortonworks’ acquisition of XA Secure, Cloudera announced that they’ve acquired Gazzang. An article on GigaOm has some more details on the deal, Gazzang’s security products, how Cloudera will integrate Gazzang, and more.


Cloudera announced the end of their funding round and disclosed more details of the deal. Details include finance (of the $900M, $530 was ‘primary capital’) and a new Cloudera board member—Kimberly Stevenson (who is the CIO of Intel).


InfoQ has coverage of Hadoop Summit by way of recaps of days one and two. There are details on some of the keynotes as well as various talks throughout the days.


Datanami has a story on Yahoo’s use of Hadoop, and it’s spinout of Hortonworks. There’s mention of several products built on Hadoop at Yahoo, the close relationship between Hortonworks and Yahoo, and more.


AT&T and Continuuity announced that they are collaborating on a new stream processing tool called jetStream that will combine functionality of Continuuity’s BigFlow and AT&T’s streaming analytics tool. It sounds like jetStream, which will be open-sourced in Q3 2014, will have a lot of interesting features.


Databricks announced a new series of Spark workshops. Now through August, they’ll be holding training workshops in New York, San Jose, San Francisco, Austin, and Chicago.


MapR and Syncsort announced a partnership to bring Syncsort’s DMX-h ETL framework to MapR’s distribution.



Microsoft’s Hadoop-as-a-service offering, HDInsight, was updated this week to support Hadoop 2.4. Microsoft notes that the new version includes major speedups (up to 100x) when querying data stored in the Azure Blob store.


A new Kiji Bento Box SDK was released this week. The “Ebi” 2.0.3 SDK includes new versions of all components in the Kiji framework.


Hue 3.6 was released with a new Search application. The tool, which uses Solr Cloud running on Hadoop, provides a number of fancy new interfaces for looking at data. The new release also brings improvements to support for snappy compression and Impala.


DataTorrent announced general availability of their stream-processing framework for Hadoop. There are a number of open-source streaming tools, but DataTorrent seems to be going after the enterprise with their feature set (which includes SLAs, Alerts, and more).


Altiscale announced support for Apache Hive 0.13 as part of the Hadoop-as-a-Service platform.


Hortonworks announced a preview of Apache Slider for HDP2. Slider is a system for deploying applications inside of YARN. The preview includes sample applications for Apache HBase, Accumulo, and Storm, and there is an integration with Apache Ambari.



Curated by Mortar Data ( http://www.mortardata.com )



June SF Hadoop Users Meetup (San Francisco) – Wednesday, June 11

SF: How to Integrate Hadoop with Systems, with Jemish Patel (San Francisco) – Thursday, June 12

AdvancedAWS: June Meetup (San Francisco) – Thursday, June 12

Big Data Camp LA (Los Angeles) – Saturday, June 14


Hey Hadoop, meet Apache Spark! (Salt Lake City) – Wednesday, June 11


Big Data Developer Day (Boulder) – Thursday, June 12


A Leap Forward for SQL on Hadoop (Coppell) – Tuesday, June 10

Houston Hadoop Meetup Series (Houston) – Wednesday, June 11


Doug Cutting, founder of Hadoop, on “The Future of Data” (Kansas City) – Tuesday, June 10


Chicago Apache Lucene/Solr User Group – Leveraging Solr for Big Data Insight (Chicago) – Tuesday, June 10

Introduction to Kiji & Real-Time Personalization with Hadoop, HBase & Cassandra (Chicago) – Thursday, June 12


Big Data Developer Day (Malvern) – Tuesday, June 10

Workshop for SQL on Hadoop (Pittsburgh) – Thursday, June 12

District of Columbia

The 1st Washington DC Area Apache Spark Interactive Meetup (Washington) – Thursday, June 12


Indexing Strategies on Apache Accumulo (Largo) – Wednesday, June 11

Conference: Accumulo Summit (East Hyattsville) – Thursday, June 12


Considerations for a Holistic Approach to Big Data Security and Privacy (Cambridge) – Tuesday, June 10

Advanced Analytics in Hadoop (Cambridge) – Wednesday, June 11


Impala Roadmap, with Eli Collins (Toronto) – Monday, June 9

Hands on Hadoop (Vancouver) – Tuesday, June 10


MapReduce (BigData) et Oracle Database 12c avec Kuassi Mensah (Paris) – Monday, June 9


Gary Short: From Zero to Hadoop (Cambridge) – Tuesday, June 10

Hands-on Clojure: Introducing Cascalog (Cambridge) – Wednesday, June 11


Introduction to Hive, HiveQL, Hive with Hadoop (Mumbai) – Wednesday, June 11


Read More…

Hadoop Weekly Issue #72

Hadoop Weekly Issue #72

01 June 2014

Apache Spark 1.0 was released this week. And there are a number of posts this week about Spark, including a post describing how eBay is starting to use Spark. This week is Hadoop Summit in San Jose, and there’s some anticipation building including two posts on the Hortonworks blog about Discardable Memory and Materialized Queries that will be presented at the summit. I’m sure there will be a lot of great presentations; please forward them my way as I won’t be attending.


The SequenceIQ blog has a post about building a command-line tool for Apache Ambari using Spring Shell and the Ambari REST API. The code for ambari-shell is available on github, there’s a binary available as a docker image, and ambari-shell is slated for inclusion in Ambari 1.6.1. The post has a tour of the features of the tool.


Discardable Memory and Materialized Queries (DMMQ) are proposed enhancements to HDFS and Hadoop query engines (such as Pig, Hive, and Cascading) that aim to take better advantage of the RAM available in a Hadoop cluster. My understanding of the features (in brief summary form) is that systems can build materialized queries (in the ‘materialized view’ DB sense) that become available to a query optimizer. Another optimization system looks at query load to discard materialized queries that are under-used. The article has much more detail, and there is a comment from the author comparing DMMQ with Spark’s RRDs.


Related to DMMQ, Discardable Distributed Memory (DDM) is a proposed new feature of HDFS for implementing DDMQs. Data in DDM is discardable and lazily backed by an HDFS file. It might be used to store intermediate outputs from Hive or Spark RDDs. The full post has a lot more details on the architecture and links to relevant background references.


Lots of folks love Spark at first glance, but stories from early adopters describe several gotchas that crop up as you try to do non-trivial tasks with it. This post looks at many of the problems and types of debugging & tweaking that need to be done to make Spark scale. There’s also a good discussion of other tools and probable fixes in newer versions of Spark in the comments.


The Cloudera blog has a post describing the implementation details of Spark-on-YARN. It contains an overview of the various daemons and data flow paths that exist during the lifetime of a Spark application (and contrasting these to that of a MapReduce job running on YARN). It discusses two options for running a Spark job on YARN—yarn-cluster and yarn-client—as well as the tradeoffs of these options. If you’re looking to deploy Spark on a YARN cluster, these details will be very valuable.


Puppet is a configuration management tool for provisioning compute instances. The latest episode of the weekly Puppet podcast covers the CDH module for Puppet, which can be used to deploy the CDH stack.


As I’ve said before, I’m always a bit wary of vendor benchmarks because a vendor typically won’t publish results unless they’re favorable for their own system. Regardless, it’s still interesting to read about how folks benchmark and what kinds of results they’re seeing with various system. In this experiment, Cloudera tried out Cloudera Impala, Hive-on-Tez, Shark, and Presto.


BOSH is a tool for provisioning distributed systems with support for several cloud providers and IaaS tools like AWS EC2, Google Compute Engine, and OpenStack. A post on the Pivotal Blog details how to use BOSH to provision a Hadoop cluster on Google Compute Engine in less than three minutes.


eBay has started using Apache Spark on their Apache YARN Hadoop cluster. A post on the eBay tech blog describes some of the things that they’re using Spark for including k-means clustering and Shark (the Hive-on-Spark implementation). There’s a code-snippet detailing how to use the k-means clustering library in Spark and there’s some background on what Spark is and who it works.


RHadoop is a system for accessing data in HDFS and HBase from R as well as for running MapReduce jobs from R. This tutorial details how to setup Hadoop, HBase, R, and RHadoop on a Mac using Homebrew. There’s a fair amount of configuration required on the R side of things once all the pieces are in place, and the tutorial caps off with running WordCount as a MapReduce job using RHadoop.


Java Code Geeks has a post describing how to index a data set in ElasticSearch using Hive. The post is the fourth in a series revolving around end-to-end processing of search click data. The post also shows some examples of using the ElasticSearch functionality of Spring Data to run queries against the ElasticSearch cluster.



Hadoop Summit is this week in San Jose, and the Hortonworks blog has some details on the event. There are a number of meetups taking place in addition to the conference (see the events section below).


Last week there were two articles that touched on running Hadoop in the cloud. Whereas they focussed more on performance (particularly HBase) and price, this article takes a more holistic view of why we might start seeing more Hadoop deployments in the cloud. For instance, if you already have all of your data in a blob store like Amazon S3, it might not make sense to copy all of your data to the data center.


Trifacta announced a series C financing round of $25 million. Trifacta’s system aims to increase analyst and developer productivity by pre-processing raw data. Their new round will help them expand marketing and sales efforts.


Forbes Contributor Dan Woods has written a piece that raises some frank and tough questions about Cloudera’s business model. The two main questions raised in the article are 1) Does Cloudera want to position itself as a replacement for or complement of the enterprise data warehouse? and 2) How will Cloudera differentiate itself (and keep customers)? I find it hard to believe Cloudera would raise so much money without good answers to these questions, but there definitely seems to be some miscommunication in the press.


Videos from Berlin Buzzwords have been posted, and their are a number of talks about components in the Hadoop ecosystem. Speakers include Roman Shaposhnik of Pivotal (on Apache Giraph), Steve Loughran of Hortonworks (on YARN application development), Ted Dunning of MapR (on deep learning on time series), and Mark Miller & Wolfgang Hoschek of Cloudera (on integrating search and Hadoop). Seems like it was a great conference with a number of high-quality talks.



Cloudera released CDH 4.7 and Cloudera Search 1.3. CDH 4.7 contains a number of bug fixes to HBase, HDFS, Hue, Hive, and more. Search 1.3.0, which requires CDH 4.7, is also a bug fix release.


Apache Spark has hit version 1.0. With the new release, the project is guaranteeing API compatibility across the 1.x release. Version 1.0 also adds a number of new features and improvements, including Spark SQL, better support for secure Hadoop, and updates to MLlib (for machine learning), Spark Streaming, and GraphX (graph processing).


Apache Ambari 1.6.0 was released. This version of the Hadoop cluster management software brings support for “blueprints” which provide reusable templates for configuring clusters. The Hortonworks blog has more details on this feature as well as the other new features of the release, Stacks and PostgreSQL support.


HP announced the next version of their Vertica Analytics Platform, and it includes a number of new features to integrate with Hadoop. In addition to directly supporting data stored on HDFS, the new version can read data stored in Parquet and Avro files.


PivotalR is a new open-source tool from Pivotal to integrate R with a SQL backend running on a Hadoop cluster. It supports MADlib for running machine learning algorithms directly in the DB. The source-code is available on github, and it currently supports Greenplum, Pivotal HD / HAWQ, and PostgreSQL backends.


Hive_test is a tool for writing tests against an embedded Hive using an embedded Derby DB. It also provides Maven tools for integrating with Hadoop.



Curated by Mortar Data ( http://www.mortardata.com )




“Mathematical Shape of Big Data” @ Hadoop Summit (San Jose) – Monday, June 2

A Leap Forward for SQL on Hadoop & Lightning Talks @ Hadoop Summit (San Jose) – Monday, June

Meet the AWS Elastic MapReduce Team at the Hadoop Summit (San Jose) – Wednesday, June 4

Accelerate Big Data Application Development with Cascading and HDP (San Jose) – Monday, June 2

Mining Big Data with Apache Spark (San Francisco) – Thursday, June 5

Hadoop Summit 2014 Oozie Meetup (San Jose) – Tuesday, June 3

Apache Ambari Birds of Feather Session (San Jose) – Thursday, June 5

Hadoop 2 Changes Everything. What’s Next? (La Jolla) – Thursday, June 5

Big Data Analytics by Alteryx (Irvine) – Friday, June 6


Make the Elephant Fly: Real-Time Data Performance (Denver) – Wednesday, June 4


Introduction to Hadoop, Part 1 (Tempe) – Wednesday, June 4


Big Data, Big Challenges: To Hadoop or Not to Hadoop? (Cambridge) – Thursday, June 5


Big Data Kickoff in Vancouver (Downtown) (Vancouver) – Tuesday, June 3

Big Data Kickoff in Vancouver (Burnaby) (Vancouver) – Wednesday, June 4


A Deep Dive into Apache Drill – Fast Interactive SQL on Hadoop (Munich) – Thursday, June 5


Cluster de Hadoop con BigSQL (Madrid) – Friday, June 6


Bangalore Baby Hadoop Meetup (Bangalore) – Saturday, June 7


Read More…