Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV. We wrote a script in Scala which does the following Handles nested parquet compressed content Look […]

Continue reading


Setting up jupyter notebook server as service in Ubuntu 16.04

Step 1: Verify the jupyter notebook location: $ ll /home/avkash/.local/bin/jupyter-notebook -rwxrwxr-x 1 avkash avkash 222 Jun 4 10:00 /home/avkash/.local/bin/jupyter-notebook* Step 2: Configure your jupyter notebook with password and ip address as needed and make sure where it exist. We will use this file as configuration for jupyter as service. jupyter config: /home/avkash/.jupyter/jupyter_notebook_config.py Step 3: Create […]

Continue reading


Using RESTful API to get POJO and MOJO models in H2O

  CURL API for Listing Models: http://<hostname>:<port>/3/Models/ CURL API for Listing specific POJO Model: http://<hostname>:<port>/3/Models/model_name List Specific MOJO Model: http://<hostname>:<port>/3/Models/glm_model/mojo Here is an example: curl -X GET “http://localhost:54323/3/Models” curl -X GET “http://localhost:54323/3/Models/deeplearning_model” >> NAME_IT curl -X GET “http://localhost:54323/3/Models/deeplearning_model” >> dl_model.java curl -X GET “http://localhost:54323/3/Models/glm_model/mojo” > myglm_mojo.zip Thats it, enjoy!! Advertisements

Continue reading


Installing ipython 5.0 (lower then 6.0) compatible with python 2.6/2.7

It is possible that you may need to install some python library or component with your python 2.6 or 2.7 environment. If those components need IPython then you For example, with python 2.7.x when you try to install jupyter as below: $ pip install jupyter –user You will get the error as below: Using cached […]

Continue reading


Installing R on Redhat 7 (EC2 RHEL 7)

Check you machine version: $ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) Now  lets updated the RPM repo details: $ sudo su -c ‘rpm -Uvh http://mirror.sfo12.us.leaseweb.net/epel/7/x86_64/e/epel-release-7-9.noarch.rpm’ $ sudo yum update Make sure all dependencies are installed individually: $ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-devel-3.4.2-5.el7.x86_64.rpm $ sudo yum localinstall blas-devel-3.4.2-5.el7.x86_64.rpm $ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-3.4.2-5.el7.x86_64.rpm $ sudo yum localinstall […]

Continue reading


Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language. The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4. import org.apache.spark.h2o._ import water.support.SparkContextSupport.addFiles import org.apache.spark.SparkFiles import java.io.File import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport} import water.Key import _root_.hex.glm.GLMModel import _root_.hex.ModelMetricsBinomial val hc […]

Continue reading


Using H2O with Microsoft R Open on Linux Machine

Installation: Microsoft R Open Page: https://mran.microsoft.com/open/ Ubuntu Download link: https://mran.microsoft.com/install/mro/3.3.3/microsoft-r-open-3.3.3.tar.gz $ wget https://mran.microsoft.com/install/mro/3.3.3/microsoft-r-open-3.3.3.tar.gz $ tar -xvf microsoft-r-open-3.3.3.tar.gz $ cd microsoft-r-open $ sudo bash install.sh Installation will be done into the following folder: $ ll /usr/lib64/microsoft-r/3.3/lib64/R/bin/ drwxr-xr-x 11 root root 4096 Apr 20 15:28 ./ drwxr-xr-x 4 root root 4096 Apr 20 15:28 ../ drwxr-xr-x 3 root root […]

Continue reading


Plotting scoring history from H2O model in python

Once you build a model with H2O the scoring history can be see in the mode details or model metrics table. If validation is enabled then scoring and validation history is also visible. You can see these metrics in the FLOW UI however if you are using python shell then you may want to plot […]

Continue reading


Changing categorical values names with set_level API in H2O

Sometime you may need to change the categorical name within a dataset to some other values. This can be done using h2o.set_levels() API as below: >>> df = h2o.import_file(“/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/iris/iris.csv”) Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% >>> df C1 C2 C3 C4 C5 —- —- —- —- ———– 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa […]

Continue reading