Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Handling exception “Argument python_obj should be a …”

Recently I hit the following exception when running python code with H2O functions on a new machine however this exception does not happen on my main machine. The exception was as below: H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable Error in […]

Continue reading


Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV. We wrote a script in Scala which does the following Handles nested parquet compressed content Look […]

Continue reading


Installing ipython 5.0 (lower then 6.0) compatible with python 2.6/2.7

It is possible that you may need to install some python library or component with your python 2.6 or 2.7 environment. If those components need IPython then you For example, with python 2.7.x when you try to install jupyter as below: $ pip install jupyter –user You will get the error as below: Using cached […]

Continue reading


Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling You must have the following package installed in your R environment: sparklyr, h2o, rsparkling […]

Continue reading


Installing R on Redhat 7 (EC2 RHEL 7)

Check you machine version: $ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) Now  lets updated the RPM repo details: $ sudo su -c ‘rpm -Uvh http://mirror.sfo12.us.leaseweb.net/epel/7/x86_64/e/epel-release-7-9.noarch.rpm’ $ sudo yum update Make sure all dependencies are installed individually: $ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-devel-3.4.2-5.el7.x86_64.rpm $ sudo yum localinstall blas-devel-3.4.2-5.el7.x86_64.rpm $ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-3.4.2-5.el7.x86_64.rpm $ sudo yum localinstall […]

Continue reading


TerminalOMatic: Customizing terminal to be even faster

For most MacBook Pro users, Terminal is an indispensable application. But we can make even better. Below is a collection of some of the tricks that I found to be are some of the tricks that I found very useful. I generally put these commands in my ~/.bashrc file. You might however want to read this blog […]

Continue reading


Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language. The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4. import org.apache.spark.h2o._ import water.support.SparkContextSupport.addFiles import org.apache.spark.SparkFiles import java.io.File import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport} import water.Key import _root_.hex.glm.GLMModel import _root_.hex.ModelMetricsBinomial val hc […]

Continue reading