Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Handling YARN resources manager issue with decommissioned nodes

If you hit the following exception with your YARN resource manager: ERROR/Exception: 17/07/31 15:06:13 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes over rm1. Not retrying because try once and fail.java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl Troubleshooting: Please try running the following command and you will see the exact same exception: $ yarn node -list […]

Continue reading


Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

What is Kilyn? Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay. Key Features: Extremely Fast OLAP Engine at Scale: Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data ANSI-SQL Interface on Hadoop: Kylin […]

Continue reading


Accessing Remote Hadoop Server using Hadoop API or Tools from local machine (Example: Hortonworks HDP Sandbox VM)

Sometimes you may need to access Hadoop runtime from a machine where Hadoop services are not running. In this process you will create password-less SSH access to Hadoop machine from your local machine and once ready you can use Hadoop API to access Hadoop cluster or you can directly use Hadoop commands from local machine […]

Continue reading


Apache Ambari 1.6.0 support with Blueprints is released

What is Ambari Blueprint? Ambari Blueprint allows an operator to instantiate a Hadoop cluster quickly—and reuse the blueprint to replicate cluster instances elsewhere, for example, as development and test clusters, staging clusters, performance testing clusters, or co-located clusters. Release URL: http://www.apache.org/dyn/closer.cgi/ambari/ambari-1.6.0             Ambari Blueprint supports PostgreSQL: Ambari now extends database support for […]

Continue reading


Free ebook: Introducing Microsoft Azure HDInsight

New Free eBook by Microsoft Press: Microsoft Press is thrilled to share another new free ebook with you:Introducing Microsoft Azure HDInsight, by Avkash Chauhan, Valentine Fontama, Michele Hart, Wee Hyong Tok, and Buck Woody.  Free ebook: Introducing Microsoft Azure HDInsight Introduction (excerpt) Microsoft Azure HDInsight is Microsoft’s 100 percent compliant distribution of Apache Hadoop on Microsoft […]

Continue reading


Setting up Pivotal Hadoop (PivotalHD 1.1 Community Edition) Cluster in CentOS 6.5

Download Pivotal HD Package http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/pivotalhd_community_1.1.tar.gz The package consist of 3 tarball package: PHD-1.1.0.0-76.tar.gz PCC-2.1.0-460.x86_64.tar.gz PHDTools-1.1.0.0-97.tar.gz Untar above package and start with PCC (Pivotal Command Center) Install Pivotal Command Center: $tar -zxvf PCC-2.1.0-460.x86_64.tar.gz $PHDCE1.1/PCC-2.1.0-460/install Log in using  newly created user gpadmin: $  su – gpadmin $  sudo cp /root/.bashrc . $  sudo cp /root/.bash_profile . $  sudo […]

Continue reading


Troubleshooting Cloudera Manager installation and start issues

After Cloudera Manager is installed and running you can access it over the web UI through <Your_IP_Address>:7180 , if you could not access it, then here are few ways to troubleshoot the problem. These details are helpful with you are installing Cloudera Hadoop on a remotely located machine and you just have shell access to […]

Continue reading


Hadoop HDFS Error: xxxx could only be replicated to 0 nodes, instead of 1

Sometime when using Hadoop  either using HDFS directly or running a MapReduce job which access HDFS, user get an error i.e. XXXX could only be replicated to 0 nodes, instead of 1 Example (1): Copying a file from local file system to HDFS $myhadoop$ ./currenthadoop/bin/hadoop fs -copyFromLocal ./b.txt / 14/02/03 11:59:48 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: […]

Continue reading


Troubleshooting YARN NodeManager – Unable to start NodeManager because mapreduce.shuffle value is invalid

With Hadoop 2.2.x you might experience NodeManager is not running and the failure reports the following error message when starting YARN NodeManger: 2014-01-31 17:13:00,500 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Failed to initialize mapreduce.shuffle java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:98) […]

Continue reading