Saving H2O models from R/Python API in Hadoop Environment

When you are using H2O in clustered environment i.e. Hadoop the machine could be different where h2o.savemodel() is trying to write the model and thats why you see the error “No such file or directory”. If you just give the path i.e. /tmp and visit the machine ID where H2O connection is initiated from R, you […]

Continue reading


DRY HiveQL

DRY (don’t repeat yourself) is one of the fundamental principles of software engineering. The main idea is to avoid duplicating business/processing logic throughout the code. However, I rarely see it being applied when writing SQL queries; making it difficult to understand and maintain them. Below are few tips on making HiveQL DRY. Quick Summary Use […]

Continue reading


Optimizing Jaccard Similarity Computation for Big Data

Computing Jaccard similarity across all entries is a hercules task. Its in the order of . However there are many proposed optimization techniques that can significantly reduce the number of pairs that one needs to consider. I spent almost a week studying the various techniques and googling useful resources related to this topic. Below is […]

Continue reading


Writing Hive Custom Aggregate Functions (UDAF): Part II

Now that we got eclipse configured (see Part I) for UDAF development, its time to write our first UDAF. Searching for custom UDAF, most people might have already came across the following page, GenericUDAFCaseStudy.This page has a good explanation of how to write a good generic UDAF but as a newbie it can be daunting, […]

Continue reading


Apache Pig: Macro for Splitting Data Into Training and Testing dataset

Introduction: Apache Pig (> 0.7.0) comes with a handy operator, Split, to separate a relation into two or more relations. For instance let’s say we have a website “users” data and depending on the age of a user we want to create two different datasets: kids, adults, seniors. This can be easily achieved with a […]

Continue reading


Easy-Hadoop, Rapid Hadoop Programming

In today’s digital world, the biggest challenge is how to deal with the large volume of data. Apache Hadoop provides one solution to this problem. It’s an open source software for distributed and scalable computing. Hadoop is usually deployed in a cluster with 100s of nodes. For quite sometime I have been using Hadoop and […]

Continue reading