DRY HiveQL

DRY (don’t repeat yourself) is one of the fundamental principles of software engineering. The main idea is to avoid duplicating business/processing logic throughout the code. However, I rarely see it being applied when writing SQL queries; making it difficult to understand and maintain them. Below are few tips on making HiveQL DRY. Quick Summary Use […]

Continue reading


Writing Hive Custom Aggregate Functions (UDAF): Part II

Now that we got eclipse configured (see Part I) for UDAF development, its time to write our first UDAF. Searching for custom UDAF, most people might have already came across the following page, GenericUDAFCaseStudy.This page has a good explanation of how to write a good generic UDAF but as a newbie it can be daunting, […]

Continue reading


Writing Hive Custom Aggregate Functions (UDAF): Part I – Setting Eclipse

Writing your first user defined aggregation functions (UDAF) for hive can be a daunting task. In particular I found these three challenges while working on my first UDAF: No instructions on how to setup eclipse for UDAF development Often complicated instructions on how to write your first UDAF. No clear instructions on how to debug […]

Continue reading


Finding Hadoop specific processes running in a Hadoop Cluster

Recently I was asked to provide info on all Hadoop specific process running in a Hadoop cluster. I decided to run few commands as below to provide that info. Hadoop 2.0.x on Linux (CentOS 6.3) – Single Node Cluster First list all Java process running in the cluster [cloudera@localhost usr]$ ps -A | grep java […]

Continue reading


Which one to choose between Pig and Hive?

Technically they both will do the job, you are looking from “either hive or Pig” perspective, means you don’t know what you are doing yet. However if you first define the data source, scope and the result representation and then look for which one to choose between Hive or Pig, you will find they are […]

Continue reading