Hadoop Weekly Issue #85

Hadoop Weekly Issue #85

31 August 2014

This week’s issue features a lot of good technical content covering Apache Storm and Apache Spark. There are also a number of releases—Apache Flink, Apache Phoenix, Cloudera Enterprise, and Luigi. In addition, Hortonworks announced a technical preview of Apache Kafka support for HDP, and SequenceIQ unveiled Periscope, an open-source tool for YARN cluster auto-scaling.


The eBay blog has a post about NameNode Quality of Service. While running a large cluster, they’ve found that certain jobs can cause major issues by overwhelming the NameNode with too many RPCs. To combat that, they’ve worked on the FairCallQueue, which replaces the NameNode RPC handler’s FIFO queue. The post details the status of the implementation and shows how the implementation performs in their tests.


A number of organizations are working hard to enhance Apache Storm on several fronts. The fronts include security/multi-tenancy (Kerberos authentication, Hadoop security integration, user isolation), scalability improvements, high availability for the Nimbus service, and enhanced language/tooling support. The Hortonworks blog has an in-depth article discussing these features and more.


On the topic of Storm, the Hortonworks blog has posted more Hadoop Summit curated content covering Storm. It highlights seven presentations, which cover scaling Storm, Pig on Storm, R and Storm, and more.


Apache Storm integrates well with Apache Kafka (more below in an announcement from Hortonworks), and this tutorial builds a local environment using Docker and Fig for testing. It uses that environment to build a system for streaming log data through Kafka, using the Trident API to implement an exponentially weighted moving average, and sending alerts from Storm using XMPP.


Continuing on the Storm theme, the Hortonworks blog has a post on performing micro-batching with Storm. The post focuses on implementing micro-batching with the Storm APIs (the Trident API provides micro-batching, too). This post describes three different ways to implement micro-batching and provides an example implementation using the “tick tuples” approach.


Switching gears to the first of several articles on Apache Spark, this post covers Bayesian Machine Learning on Apache Spark. It discusses integrating the PyMC framework with Apache Spark to implement Markov Chain Monte Carlo (MCMC) methods. There are five parts to the post—an introduction to MCMC methods, an overview of the PyMC python package and its API, integrating PyMC with Apache Spark, using the integration for topic modeling with MCMC, and performing distributed LDA on Spark with PyMC.


An upcoming release of Apache Spark will contain implementations of several common statistics functions found in many statistical computing packages like R and SciPy.stats. This post describes the new implementations, which cover correlations (spearman and pearson), hypothesis testing (chi-squared), stratified sampling, and random data generation.


The Lambda Architecture is a popular idea for building hybrid batch and speed (near-realtime) data processing systems. This tutorial provides an example of implementing this type of system using Apache Spark. In addition to the normal batch operation, Spark also has a micro-batch mode called Spark streaming. The same data processing function can be used by both the normal and streaming operation, as is demonstrated in the post. The accompanying source code (written in Scala) is available on github.


Cloudera has posted a new roadmap for Impala, its SQL on Hadoop system. The post recaps the features of Impala 1.2, 1.3, and 1.4, and it describes what will be delivered in version 2.0 (by end of 2014) and version 2.1 (in 2015). The highlights for 2.0 include new analytic window functions, spilling of queries to disk, and subqueries. The highlights of version 2.1 include long-anticipated support for nested data, CRUD for HBase, and an exciting feature for folks running in AWS—Amazon S3 integration.


The Hortonworks blog has a post on HTTPS for HDFS. The implementation makes use of client certificate for HTTPS client authentication, which in turn are verified by the HDFS daemons. It has details on the configuration changes required to enable and setup HTTPS as well as a walkthrough of the various SSL certificates that need to be generated (complete with example keytool invocations).



Qubole has written about how they see Hadoop complementing an existing data warehouse (DW) deployment. They suggest a DW is more appropriate for structured data, whereas (large amounts) of unstructured data are better handled by Hadoop. Workloads that require SLAs/predictable runtimes should use the DW, but Hadoop is good at ad hoc or fluctuating workloads (and this is a key area where Qubole’s cloud offering adds additional flexibility). It can be hard to find the right place to draw this line, so it’s interesting to hear from a vendor (who is likely working with customers for something like this).


As someone who has run a production Hadoop cluster, a number of the points and anecdotes in this article ring true. The overarching theme is that Hadoop is not particularly good at meeting SLAs partly because it’s easy to use Hadoop in unpredictable ways. The article has quotes from some Pepperdata folks about how their cluster orchestration software helps solve these types of issues.


This article is focussed on Hadoop for non-engineers, particularly folks in the healthcare industry. After giving a brief intro to the key components of Hadoop, it talks a bit about some of the implications to the healthcare industry. Specifically, there are a number of types of analyses that can be powered by Hadoop which couldn’t be done before. But with that said, there’s an interesting point that rings true in nearly every profession. The industry isn’t being held back by the data processing systems—the barrier is in acting on the data to improve healthcare.


Big news in the development process of Hadoop this week—the Apache Hadoop codebase has migrated from SVN to git.


Once a Hadoop cluster gets to a certain size, there are ultimately conflicts related to required native libraries on the cluster. Docker, which provides a mechanism to running an isolated environment on a linux host, has great promise for solving this and other types of issues. GigaOm has an article that describes the state of the YARN and Docker integration, which is being lead by Altiscale.



Luigi, the batch processing framework for Hadoop, released version 1.0.17 this week. The new release has a number of fixes and improvements, including support for storing data in ElasticSearch, support for loading JSON data into redshift, an FTP task, and a new luigi command.


Apache Flink (incubating), previously known as Stratosphere, has released version 0.6. Flink is a data processing engine built atop of YARN, targeting iterative processing and data streaming. The release includes over 100 resolved tickets, which cover things like support for POJO and a new AvroOutputFormat.


Cloudera Enterprise 5.1.2 (which includes CDH 5.1.2 and Cloudera Manager 5.1.2) and CDH 5.0.4 were released. The CE 5.1.2 release includes a number of fixes covering nearly every component in the CDH stack. CDH 5.0.4 also includes a number of fixes across the stack.


Apache Phoenix 3.1 (for HBase 0.94.4+) and 4.1 (for 0.98.1+) were released earlier this week. Both releases contain a number of bug fixes, use of nested tables in queries, a Pig loader, and more. On top of that, the 4.1 release supports distributed tracing and local indexes.


Hortonworks has announced a technical preview of Apache Kafka for HDP 2.1. A post on the Hortonworks blog introduces Kafka and explains how it fits well with Apache Storm.


Periscope is a new open-source tool from SequenceIQ for auto-scaling and enforcing SLAs for YARN clusters. For a static cluster, it offers tools to enforce time-based and cluster capacity SLAs. In cloud environments, it can increase cluster capacity by spinning up new nodes. The code is available on github as part of a public beta. It’s used internally at SequenceIQ, but it relies on unreleased features of Apache Hadoop and Apache Ambari.



Curated by Mortar Data ( http://www.mortardata.com )



What is Practical Data Science? Co-hosted with Palo Alto Data Science Foundation (Menlo Park) – Thursday, September 4

Resistance Is Futile: What You Need to Know about Big Data (San Francisco) – Thursday, September 4


Apache Spark Night – Show and Tell (Austin) – Tuesday, September 2

Introduction to Hadoop Course, Part 1: Hadoop and Its Ecosystem (Austin) – Saturday, September 6


Understanding Your Customer’s Buying Journey Using Path Analysis on Hadoop (Phoenix) – Wednesday, September 3


Hadoop Security Deep Dive (Toronto) – Thursday, September 4


Managing Hadoop Workflows in the Enterprise + Jumpstart your Big Data Projects (Stockholm) – Monday, September 1


Large Datasets with WEKA + Big Data Use Cases & Industry Trends (Auckland) – Tuesday, September 2


OCG Meetup: Hadoop (Vienna) – Thursday, September 4


SQL en Hadoop: Un Gran Paso Adelante! (Mexico City) – Friday, September 5


Bangalore Hadoop – Big Data Meetup (Bangalore) – Saturday, September 6


Read More…

Performance improvements for photo serving | code.flickr.com

Performance improvements for photo serving | code.flickr.com:

We’ve been working to make Flickr faster for our users around the world. Since the primary photo storage locations are in the US, and information on the internet travels at a finite speed, the farther away a Flickr user is located from the US, the slower Flickr’s response time will be. Recently, we looked at opportunities to improve this situation. One of the improvements involves keeping temporary copies of recently viewed photos in locations nearer to users.  The other improvement aims to get a benefit from these caches even when a user views a photo that is not already in the cache.


Read More…

Hadoop Weekly Issue #84

Hadoop Weekly Issue #84

24 August 2014

This week’s edition has a lot of great technical content from prominent Hadoop vendors Hortonworks and Cloudera as well as newcomer SequenceIQ. There are also a couple of interesting articles based on real-world experience covering an A/B testing platform and Apache Zookeeper. Those types of articles tend to be quite good but more difficult to find—as always, if you have suggestions for the newsletter please send them my way!


Hortonworks has posted a video series on the most recent release of their distribution, HDP 2.1. The videos, which are recordings of several webinars, cover a large number components including YARN, HDFS, Hive, and Ambari.


A guest post on the Hortonworks blog describes how SAS is working to bring their High-Performance Analytics (HPA) and LASR Analytics Server to YARN. The systems were originally built to run on as MPI applications in which SSH was used to launch processes. With YARN, HPA uses the framework for process management, and there are improvements like enforcing CPU and memory limitations.


The Hortonworks blog has a post on an in-progress feature called container delegation. Before diving into container delegation, the post gives an intro to YARN’s resource and workload management. The new feature will be used, for among other things, to provide additional per-query resources to a long-running application.


The SequenceIQ blog has a post on the YARN FairScheduler. The post has an introduction to the FairScheduler, the scheduling challenges, and some of its configuration options. Using an example test and an R-based analysis tool (which is open-sourced), the post finds that the FairScheduler is good at maintaining fairness.


The Hortonworks blog has had a number of security related posts in the past week. This post summarizes the coverage, which includes posts on Apache Argus and Apache Knox. It also discusses posts from some partner vendors—Protegrity, Voltage Secruity, and Dataguise. Finally, it touches on some new Hadoop features—Transparent Data Encryption for HDFS and a Key Provider API and accompanying Key Management Server.


Apache Spark ships with the spark-submit script for submitting a job to a Spark cluster. Sometimes, it’s useful or necessary to programmatically submit a job. This post describes how to write a Scala program to do so, and how to invoke the resulting binary jar.


This post serves as a a good introduction to partitioning of Hive tables. It outlines the motivation and benefits of partitioning and includes several tips and best practices.


The Cloudera blog has a post with several tips and examples for writing powerful Hive queries. It includes example queries with the LAG and LEAD analytics function as well as using LATERAL VIEW and a UDTF to execute nested SQL queries. It also suggests some ways of organizing data, including the notion of a “supernova schema” which is somewhat akin to a materialized star-schema as a single table.


DZone has published a cheat sheet for Apache Hadoop. It includes things like HDFS architecture, HDFS command line examples, an overview of YARN, and an introduction to MapReduce. It also covers Pig and Hive as well as providing links to several ecosystem projects.


Camille Fournier, Zookeeper PMC and Rent the Runway CTO, spoke on using Zookeeper in the wild. Her talk covers a number of systems that use Zookeeper as well as a number that do not. One of her conclusions is that, while Zookeeper has a number of use-cases, it’s not always the best tool for the job.


The Pinterest engineering blog has a post on their A/B analytics platform. The post covers the implementation, which uses Kafka, Storm, MapReduce, HBase, and more. There’s an overview of the MapReduce workflow, the serving of metrics via HBase, and real-time processing via Storm. There’s also a discussion of statistical significance and group validation via chi-square.



A new book on Apache Flume is in early release and available as an eBook from O’Reilly. The book is aimed at developers deploying and customizing Flume.


Allied Market Research recently released a report on the Hadoop-as-a-Service (HaaS) market. It expects that market to growth rapidly to $16.1B by 2020. The report notes that HaaS doubled from 2012 to 2013, and it expects that HaaS will become more and more competitive with on-premises deployments.


TPCx-HS is a new benchmark specification aimed at measuring the Hadoop Runtime, Hadoop Filesystem API implementations, and MapReduce layers. It is claimed to be the first “Industry Standard Big Data Benchmark,” and there are already plans for additional. The ODBMS blog has an interview with Francois Raab, the author of the TPC-C Benchmark, and Yanpei Chen of the Performance Engineering Team at Cloudera. In the interview, they discuss some plans for big data benchmarks in more detail.


Using Apache BigTop, CDH5 has been tested in conjunction with GlusterFS 3.3 (specifically its glusterfs-hadoop FileSystem). There are some more details on the implementation in a guest post on the Cloudera blog.


The MapR blog has a transcript and video of a recent presentation by their CEO John Schroeder where he spends 5 minutes talking about several applications of Hadoop. He talks about the Aadhaar project’s biometric database, health care, advertising, music personalization, and MinuteSort.



Version 0.16.0 of the Kite SDK was released. This release adds support for Apache Spark, adds a new command-line ETL tool, fixes generation of Parquet Hive tables on Hive 0.13+, and adds a new parent pom for Kite SDK apps written for CDH5.


The folks at SequenceIQ have released a new docker image for Apache Hadoop 2.5.0. Like previous versions, their are psuedo-distributed and fully distributed variants of the image. The image uses Apache Ambari to provision a cluster.


Microsoft made some announcements about their Azure cloud services this week. Among them, they announced the general availability of Apache HBase for HDInsight. The service had been in preview since June.


Spindle is a new analytics platform recently open-sourced by Adobe Research. It combines Apache Spark for processing, Apache Parquet for a data storage format, and a Spray-based HTTP server.


Mortar, the Hadoop/Pig as a Service system, has announced support for running jobs in local mode to improve development iteration.



Curated by Mortar Data ( http://www.mortardata.com )



eHarmony’s Hadoop Program (Irvine) – Thursday, August 28

Cybersecurity & Big Data Analytics with Hadoop (Mountain View) – Thursday, August 28

HBase Meetup @ Sift Science (San Francisco) – Thursday, August 28


MongoDB and Hadoop: Driving Business Insights (Austin) – Monday, August 25


Enabling Advanced Analytics & From Sandbox to Production PA (Kansas City) – Monday, August 25


Batch Data Processing at Spotify with Luigi (Madison) – Tuesday, August 26


Data Governance in Big Data – Cloudera/Gazzang (Dublin) – Tuesday, August 26

North Carolina

Tresata on Omnichannel Marketing Analytics in Hadoop (Charlotte) – Wednesday, August 27

RTP – Big Data Developer Day (Durham) – Thursday, August 28


Apache Spark Lessons Learned (McLean) – Tuesday, August 26

New Jersey

Storm: Real-Time Big Data Stream Processing at WebMD (Hamilton Township) – Tuesday, August 26


Hadooping @ Prague (Prague) – Monday, August 25


Database as a Service (CouchDB, MongoDB, Cassandra, DB2, Hadoop) in the Cloud (Zurich) – Tuesday, August 26


PaaS and Big Data Tools (Melbourne) – Wednesday, August 27

HDInsight: MapReduce and Beyond (Melbourne) – Thursday, August 28


3rd Spark London Meetup (London) – Thursday, August 28


Apache Spark: In Memory Map-Reduce (Hyderabad) – Saturday, August 30


Spark Meetup (Hangzhou) – Sunday, August 31


Read More…

RDBMS vs Hadoop storage

RDBMS vs Hadoop storage is the topic that come to mind of new Hadoop seekers, of course this topic need to understand better before we dive in to other Hadoop area. Hadoop storage is useful for storing the unstructured data from various systems but RDBMS is used for structured data storage after the data got […]

Read More…

Exploring Life Without Compass

Exploring Life Without Compass:

Compass is a great thing. At Flickr, we’re actually quite smitten with it. But being conscious of your friends’ friends is important (you never know who they’ll invite to your barbecue), and we’re not so sure about this “Ruby” that Compass is always hanging out with. Then there’s Ruby’s friend Bundler who, every year at the Christmas Party, tells the same stupid story about the time the police confused him with a jewelry thief. Enough is enough! We’ve got history, Compass, but we just feel it might be time to try seeing other people. 


Read More…

Hadoop Weekly Issue #83

Hadoop Weekly Issue #83

17 August 2014

The big news this week was the Apache Hadoop 2.5.0 release. There are also a number of interesting technical articles covering the Apache Hadoop HDFS, Apache Drill, and several other ecosystem projects. Also, there’s an interesting post on profiling MapReduce jobs (which is typically quite challenging) with Reimann.


The Cloudera blog has a post on the motivation and design for HDFS caching, which was implemented as part of the Apache Hadoop 2.3.0 release. Cloudera recommends its use in CH 5.1 to speed up Impala and other applications. Data is stored in cache by sending a cache directive to the NameNode, which keeps track of which files are cached where. This design allows applications to take advantage of locality of cached data (and enable zero-copy reads).


MapR is one of the biggest proponents of Apache Drill, so it’s interesting to hear their take on the recently 0.4.0 developer preview. This post talks about Drill’s agility (it can run queries directly over datasets without the need for a metastore), flexibility (its internal data model is JSON-like allow for nested data types), and familiarity (the query language is SQL). MapR also has pre-configured packages of Drill for their distribution.


IPython notebooks are a popular tool for data scientist, particularly when sharing data exploration tooling. Given that Spark has a Python API, it’s a natural (and powerful) idea to marry the two for data exploration and analysis. The Cloudera blog has a detailed tutorial on setting up IPython, pyspark, and a simple IPython notebook to interact with a spark cluster. There is some example code on github and the IPython viewer.


Several months back, the Apache Mahout community announced a migration from MapReduce to Spark for the backend of core algorithms. In addition, they’re developing a Scala DSL for representing data transformations. This post looks at the Scala DSL and the rewritten (for Spark) item-based recommendation system. It also describes the command-line tool that can be used to run this system against data stored in text-delimited files.


Profiling distributed systems can be a complicated task. It’s particularly hard for MapReduce jobs where there is often a mix of user-code, library code (e.g. Hive, Cascading), and framework code. This post describes how factual uses Reimann to profile Hadoop jobs. It describes the system’s profiling strategy and how results collected at a central location for analysis. The post also describes several performance issues that the system helped to uncover and resolve.


This post on the Hortonworks blog describes how to use Apache Knox as a secure gateway to HiveServer2. It’s a fairly complicated setup (Hive client -> JDBC over HTTPS -> Knox -> HTTP -> HiveServer2), but it can be used to achieve perimeter security for a Hadoop cluster (Knox can authenticate users). The post shows how to configure Hive with Apache Ambari and the required connection strings for Knox and the Hive client (beeline). There’s also a section on configuring another client, Simba, over ODBC.


This presentation, recently given at the Chicago Hadoop User Group, describes the Drill data model/architecture (namely, schema “on-the-fly”), the Drill execution engine (which does runtime byte-code generation/compilation), and a Drill demo. The video of presentation is available on vimeo at the second link below.


This post describes an end-to-end solution for building a recommendation engine using Apache Spark’s MLlib. The system uses MLlib’s alternating least squares algorithm to build up predictions for each user of the website, which are stored in MongoDB. It features an application built with the Play framework to serve recommendations. The code for the project is on github.


Apache Spark streaming and Apache Storm are often mentioned as tools solving similar problems. But this presentation makes the observation/point that Spark streaming is a (micro) batch processing framework while Storm is a stream processing framework. Trident, the abstraction atop of Storm, is more comparable to Spark Streaming. The rest of the presentation focusses on comparing Trident and Spark streaming, including considerations for fault tolerance and reliability.


The Tachyon project is trying to solve a similar problem to the HDFS file caching solution described in an earlier post. It takes a different approach, though, by implementing an in-memory FileSystem that also supports writing through to persistent storage on HDFS (or S3 or anything implementing the FileSystem API). This post has several more details about the project, which is currently in an early release.



The Qubole blog has a post summarizing a number of recent announcements in the Hadoop ecosystem. It focusses on the business and enterprise side of the Hadoop news in more depth than this newsletter typically does.


Hortonworks announced that the code of the Hadoop security offering from XA Secure (which Hortonworks recently acquired) was submitted to the Apache incubator as the Argus podling. The post describes the project charter and invites developers to help build a community around the project.


ScaleOut hServer is a drop-in replacement for the Hadoop MapReduce engine that executes on data stored in-memory. ScaleOut announced this week that they’ve attained Hortonworks Certification.


A lot of marketing and news coverage of Hadoop surrounds tech companies in the bay area and New York. This article takes a look at other areas where Hadoop and big data are having major impact—the agriculture, insurance, and automative industries.


Splice Machine, makers of a RDBMS backed by Apache HBase and Apache Derby, recently announced a $18M round of funding. This article has an interview with their CEO during which he explains more about their business plan and target customers. Rather than competing with existing Hadoop vendors, they’re hoping to grab users of Oracle, IBM, or other enterprise RDBMS products.


Hadoop is a relatively young software project, and it’s lacking a number of important features. This article discusses some of those key features (e.g. security and ease of operation) and points out that folks are using Hadoop anyway. The conclusion seems to be that Hadoop is often used as a supplement to existing systems, so folks are willing to use it even given its warts.



Apache Hadoop 2.5.0 was released. The new version includes updates to HDFS (extended file attributes, an improved web UI) and improvements for YARN (better REST API support and security for the application timeline server). The release also contains a large number of improvements (including to documentation) and bug fixes.


MapR has announced support for new versions of AsyncHBase, HBase, Hive, Flume, and Oozie for their distribution. Flume is seeing the largest update, going from Flume 1.4 to 1.5 (which includes a disk-spillable channel and more).


Apache Sqoop 1.4.5 was released. The new version adds support for Apache Accumulo and a new high-performance Oracle connector. There are also a large number of bug fixes and improvements (covering HBase, Avro, Amazon S3, and MySQL support).


Mortar (full disclosure: they help with this newsletter and syndicate Hadoop Weekly) have open-sourced their StoreFunc for DynamoDB. The so-called DynamoDBStorage UDF allows for efficiently writing data to DynamoDB as part of a Pig job. It is customizable in its write throughput and retry behavior.



Curated by Mortar Data ( http://www.mortardata.com )



Escape From Hadoop: Spark One-Liners for C* Ops (Milpitas) – Tuesday, August 19

OC Big Data Monthly Meetup #4 (Irvine) – Wednesday, August 20

Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) – Wednesday, August 20

Network Design Challenges for Hadoop Environments (San Francisco) – Wednesday, August 20


Boise BI User Group Summer Session (Boise) – Thursday, August 21


Hadoop Lunch at Adobe – Competition Rules/Details (Lehi) – Thursday, August 21


Genomic Sequencing & Hadoop (Scottsdale) – Tuesday, August 19

A Detailed Look at Big R: R + IBM InfoSphere BigInsights (Scottsdale) – Wednesday, August 20


Getting Jiggy with Change Data Capture and Slowly Changing Dimensions (Boulder) – Wednesday, August 20


Apache Drill: Building Highly Flexible, High Performance Query Engines (Omaha) – Thursday, August 21


Apache Samza: LinkedIn’s Real-Time Stream Processing Framework (Austin) – Wednesday, August 20

3rd Thursday Huddle! (Dallas) – Thursday, August 21


Hybrid BI Solutions with Hadoop and Microsoft Toolsets (Oak Brook) – Thursday, August 21

What’s New with Apache Spark? An Evening with Paco Nathan (Chicago) – Thursday, August 21


Building a Fully Functional Hadoop Cluster in 1 Hour for Less Than $1 (Richmond) – Tuesday, August 19

North Carolina

Triad Hadoop Users Group (Winston Salem) – Thursday, August 21


HUG Pittsburgh August Meeting (Pittsburgh) – Wednesday, August 20

SQL on Hadoop (Philadelphia) – Wednesday, August 20


Real-World Hadoop Applications, Built in Bucharest (Bucharest) – Thursday, August 21


Hadoop Meetup (Bangalore) – Saturday, August 23


Read More…


(PIG vs HIVE ), this is the most often question asked by new Hadoop information seekers .  Pig is a procedural data-flow language , programmers can execute programs step-by-step  defined by them self. Optimization  can be controlled for each and  every steps. Hive looks like SQL (structured query language) .Hive depends on its own optimizer and […]

Read More…

Pig to ease Hadoop programming

Pig is a Hadoop extension ease the programming  in MapReduce and other levels of  programming  by its  simple high-level data processing language . Pig will  automatically optimize  the pig scripts and make the  scripts  free from manual tuning. Pig has  two main components A high-level data processing language called Pig Latin A compiler that can compiles and execute Pig […]

Read More…

12 Hadoop myths

Hadoop has been touted as one of the newer– and perhaps one of the best—technologies designed to extract value out of “big data”. Hadoop technology has so frequently been linked to the concept of big data that the two often appear in lockstep at conferences, industry briefings and in media reports. But as Hadoop becomes a household name, it […]

Read More…

Hadoop Weekly Issue #82

Hadoop Weekly Issue #82

10 August 2014

We’re in the midst of a summer lull, so this week’s issue is shorter than usual. The lack of quantity is made up for in great quality, though. Technical posts cover YARN, HBase, Accumulo, and building an EMR-like local dev environment. There is also news on Actian, Adatao, Splice Machine, and the HP-Hortonworks strategic partnership. Hopefully there’s something for everyone!


The Hortonworks blog has a post on the ongoing work to improve the fault-tolerance of YARN’s ResourceManager (RM). This post describes phase two of the RM restart resiliency work, which aims to keep existing YARN application running during and after an RM reboot. The post covers the architecture of the solution, including which cluster state information is stored where.


Hortonworks has another post in their series on curated Hadoop Summit content. This time, it focusses on Hadoop security. They highlight four sessions covering recent improvements in Hadoop security, security for the Apache Knox Gateway’s REST APIs, using Hadoop for threat detection, and the future of Hadoop security.


The Apache blog has a post with updated performance evaluations of various HBase BlockCache configurations. They find that the on heap LruBlockCache performs best and the next best configuration is the CombinedBlockCache:OffHeap (a hybrid L1 LruBlockCache and a L2 BucketCache which stores data offheap). The post has details on the experimental setup and a deeper analysis of the results.


An obstacle of adopting AWS’ Elastic MapReduce (EMR) can be building a local dev environment that matches EMR. While Amazon’s distribution isn’t open-source, this post describes how to setup an approximate local environment on a Mac. It shows you how to make configuration changes for s3 uris, sets up the AWS access keys, and add LZO compression support to Hadoop.


This is a good introduction to Apache Accumulo, the distributed key-value store built on HDFS. It describes the architecture at a high-level, contrasts it to Apache HBase, describes the data model (including column visibility), several use cases, and more.


This post looks at using Hadoop and new libraries for iterative computation, such as k-means clustering. It describes Iterative MapReduce, the Twister Programming Model, the Collective Model (the Harp project), and more. There are some experimental results of various frameworks for PageRank, K-means, and broadcast.



Videos of the 2014 Accumulo Summit, which took place in June, have been posted online. There are presentations from folks at Sqrrl, Cloudera, Hortonworks, and more.


The Hortonworks blog has a post from the HP team on the recently strategic partnership between Hortonworks and HP. It has some specifics on the partnership—Apache Ambari will be integrated with HP Operations Manager i (OMi).


Adatao announced a $13M Series A round of founding. The company makes pInsights, a predictive analytics and business intelligence solution built on Apache Spark. They also make pAnalytics, a system aimed at data scientists.


Relational database on HBase startup, Splice Machine, announced that its Series B round was increased by $3m to $18M in total. The latest money comes from Correlation Ventures.


Outspoken Hadoop skeptic and prolific DMBS researcher/creator Michael Stonebreaker has written a post with the provocative title “Hadoop at a Crossroads?” He argues that with the death of MapReduce (focussing in particular on next-generation SQL-on-Hadoop systems), Hadoop (and its vendors) are on a collision course with data warehouse systems. The post also questions the future of HDFS, which he predicts might fall victim to specialized storage layers.


Actian recently announced the Actian Vector Hadoop Edition, which is a SQL-on-Hadoop system. This post has more details on the integration, including how Actian uses HDFS (it has a proprietary file format) and YARN.


Datanami has a post on Sinequa, makers of enterprise search software. The most recent version of their software adds support for analyzing data stored in HDFS and a handler for Apache Mahout to perform analysis using its algorithms.


GigaOm has an article exploring some of the recent momentum of Apache HBase. While Cassandra and MongoDB have seen a lot of press coverage and adoption, HBase is gaining steam. Specifically, it has good integration with the Hadoop ecosystem and a number of companies are starting to build applications on top of it (e.g. Continuuity’s reactor and Splice Machines relational database).



Apache Drill 0.4.0 was released this week. Drill is general purpose analytics software that strives to build a more general framework than existing systems (i.e. SQL-on-Hadoop) by supporting a wide variety of storage systems/formats and queries. The 0.4.0 release is a massive step forward with 100,000 lines of new code from a wide variety of contributors. The Apache Blog has the highlights of the new release.


ZooKeeper 3.5.0-alpha was released this week. The release resolves over 500 Jira tickets, which include a large number of bug fixes and improvements. Among the improvements are the ability to dynamically reconfigure the ZooKeeper ensemble, improvements to recovery, better support for jdk7 and openjdk, and more.



Curated by Mortar Data ( http://www.mortardata.com )



August SF Hadoop Users Meetup (San Francisco) – Wednesday, August 13

Apache HBase: Understanding Where to Use It and How to Use It, with Subash DSouza (Los Angeles) – Wednesday, August 13

Apache Solr (Irvine) – Thursday, August 14


Introduction to Spark Course: Spark Streaming (6 of 7) (Austin) – Wednesday, August 13


Using Apache Drill (Chicago) – Wednesday, August 13


Distributed Data Storage: Comparing Cassandra, HBase, ElasticSearch and GridGain (Conshohocken) – Wednesday, August 13

New York

Neo4j Intro Workshop (New York) – Tuesday, August 12


A Leap Forward for SQL on Hadoop (Manchester) – Tuesday, August 12


Workshop: SQL on Hadoop (Moscow) – Friday, August 15


Read More…