Optimizing Jaccard Similarity Computation for Big Data

Computing Jaccard similarity across all entries is a hercules task. Its in the order of . However there are many proposed optimization techniques that can significantly reduce the number of pairs that one needs to consider. I spent almost a week studying the various techniques and googling useful resources related to this topic. Below is […]

Continue reading


Efficient Textual Similarity Across Millions of Web Queries

Computing textual similarity (such as Jaccard similarity coefficient) between millions of search queries can be an arduous task. The main challenge is the number of pairs that one needs to consider; a relatively small dataset containing ten thousands queries leads to more than 49 million possible query pairs (). Based on Vernica, et.al. paper, I show […]

Continue reading


Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters. With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs. Here are the […]

Continue reading


Accessing Remote Hadoop Server using Hadoop API or Tools from local machine (Example: Hortonworks HDP Sandbox VM)

Sometimes you may need to access Hadoop runtime from a machine where Hadoop services are not running. In this process you will create password-less SSH access to Hadoop machine from your local machine and once ready you can use Hadoop API to access Hadoop cluster or you can directly use Hadoop commands from local machine […]

Continue reading


Hadoop 2.4.0 release (helpful links)

Kudos to Hadoop community as Hadoop 2.4.0 release is available for everyone to consume. A small list of improvements in HDFS, MapReduce along with overall framework are as below but not limited to: Hadoop 2.4.0 Highlights: HDFS: Full HTTPS support ACL Supported HDFS, allows easier access to Apache Sentry-managed data by components using it Native supported […]

Continue reading


Hadoop HDFS Error: xxxx could only be replicated to 0 nodes, instead of 1

Sometime when using Hadoop  either using HDFS directly or running a MapReduce job which access HDFS, user get an error i.e. XXXX could only be replicated to 0 nodes, instead of 1 Example (1): Copying a file from local file system to HDFS $myhadoop$ ./currenthadoop/bin/hadoop fs -copyFromLocal ./b.txt / 14/02/03 11:59:48 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: […]

Continue reading


Troubleshooting YARN NodeManager – Unable to start NodeManager because mapreduce.shuffle value is invalid

With Hadoop 2.2.x you might experience NodeManager is not running and the failure reports the following error message when starting YARN NodeManger: 2014-01-31 17:13:00,500 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Failed to initialize mapreduce.shuffle java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:98) […]

Continue reading


YARN Job Problem: Application application_** failed 1 times due to AM Container for XX exited with exitCode: 127

Run a sample Pi job in YARN (Hadoop 0.23.x or 2.2.x) might fail with the following error message: [Hadoop_Home] $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.10.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.23.10.jar 16 10000 Number of Maps = 16 Samples per Map = 10000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote […]

Continue reading


Troubleshooting Hadoop Problems – FSNamesystem: FSNamesystem initialization failed.

When Namenode starts sometime you do not see the Namenode process running so you can take a look at the logs to understand what is going on… This is the message you get when you start namenode: [exec] Starting namenodes on [localhost] [exec] localhost: starting namenode, logging to */hadoop-0.23.10/logs/hadoop-avkashchauhan-namenode-Avkashs-MacBook-Pro.local.out Now you can see the log […]

Continue reading