Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Full working example of connecting Netezza from Java and python

Before start connecting you must make sure you can access the Netezza database and table from the machine where you are trying to run Java and or Python samples. Connecting Netezza server from Python Sample Check out my Ipython Jupyter Notebook with Python Sample Step 1: Importing python jaydebeapi library import jaydebeapi Step 2: Setting Database […]

Continue reading


Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV. We wrote a script in Scala which does the following Handles nested parquet compressed content Look […]

Continue reading


Stacked Ensemble Model in Scala using H2O GBM and Deep Learning Models

In this full Scala sample we will be using H2O Stacked Ensembles algorithm. Stacked ensemble is a process of building models of various types first with cross-validation and keep fold columns for each model. In the next step building the stacked ensemble model using all the CV folds. You can learn more about Stacked Ensembles here. […]

Continue reading


Logistic Regression with H2O Deep Learning in Scala

Here is the sample code which show using Feed Forward Network based Deep Learning algorithms from H2O to perform a logistic regression . First lets import key classes specific to H2O import org.apache.spark.h2o._ import water.Key import java.io.File Now we will create H2O context so we can call key H2O function specific to data ingest and […]

Continue reading


H2O AutoML examples in python and Scala

AutoML is included into H2O version 3.14.0.1 and above. You can learn more about AutoML in the H2O blog here. H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance […]

Continue reading


Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala. Lets first import all the classes we need for this project: import org.apache.spark.SparkFiles import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.sql.{DataFrame, SQLContext} import water.Key import java.io.File import water.support.SparkContextSupport.addFiles import water.support.H2OFrameSupport._ // Create SQL support implicit val sqlContext = […]

Continue reading


Scala Example with Grid Search and Hyperparameters for GBM in H2O

Here is the full source code for GBM Scala code to perform Grid Search and Hyper parameters optimization using H2O (here is the github code as well): import org.apache.spark.SparkFiles import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.sql.{DataFrame, SQLContext} import water.Key import java.io.File import water.support.SparkContextSupport.addFiles import water.support.H2OFrameSupport._ // Create SQL support implicit val sqlContext = spark.sqlContext import sqlContext.implicits._ […]

Continue reading


Getting all categorical for predictors in H2O POJO and MOJO models

Here is the Java/Scala code snippet which shows how you can get the categorical values for each enum/factor predictor from H2O POJO and MOJO Models: to get the list of all column names in your POJO/MOJO model, you can try the following: Imports: import java.io.*; import hex.genmodel.easy.RowData; import hex.genmodel.easy.EasyPredictModelWrapper; import hex.genmodel.easy.prediction.*; import hex.genmodel.MojoModel; import java.util.Arrays; […]

Continue reading