Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Full working example of connecting Netezza from Java and python

Before start connecting you must make sure you can access the Netezza database and table from the machine where you are trying to run Java and or Python samples. Connecting Netezza server from Python Sample Check out my Ipython Jupyter Notebook with Python Sample Step 1: Importing python jaydebeapi library import jaydebeapi Step 2: Setting Database […]

Continue reading


Visualizing H2O GBM and Random Forest MOJO Models Trees in python

In this example we will build a tree based model first using H2O machine learning library and the save that model as MOJO. Using GraphViz/Dot library we will extract individual trees/cross validated model trees from the MOJO and visualize them. If you are new to H2O MOJO model, learn here. You can also get full […]

Continue reading


Stacked Ensemble Model in Scala using H2O GBM and Deep Learning Models

In this full Scala sample we will be using H2O Stacked Ensembles algorithm. Stacked ensemble is a process of building models of various types first with cross-validation and keep fold columns for each model. In the next step building the stacked ensemble model using all the CV folds. You can learn more about Stacked Ensembles here. […]

Continue reading


Logistic Regression with H2O Deep Learning in Scala

Here is the sample code which show using Feed Forward Network based Deep Learning algorithms from H2O to perform a logistic regression . First lets import key classes specific to H2O import org.apache.spark.h2o._ import water.Key import java.io.File Now we will create H2O context so we can call key H2O function specific to data ingest and […]

Continue reading


H2O AutoML examples in python and Scala

AutoML is included into H2O version 3.14.0.1 and above. You can learn more about AutoML in the H2O blog here. H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance […]

Continue reading


Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala. Lets first import all the classes we need for this project: import org.apache.spark.SparkFiles import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.sql.{DataFrame, SQLContext} import water.Key import java.io.File import water.support.SparkContextSupport.addFiles import water.support.H2OFrameSupport._ // Create SQL support implicit val sqlContext = […]

Continue reading


Scala Example with Grid Search and Hyperparameters for GBM in H2O

Here is the full source code for GBM Scala code to perform Grid Search and Hyper parameters optimization using H2O (here is the github code as well): import org.apache.spark.SparkFiles import org.apache.spark.h2o._ import org.apache.spark.examples.h2o._ import org.apache.spark.sql.{DataFrame, SQLContext} import water.Key import java.io.File import water.support.SparkContextSupport.addFiles import water.support.H2OFrameSupport._ // Create SQL support implicit val sqlContext = spark.sqlContext import sqlContext.implicits._ […]

Continue reading


H2O backend and API processing through Rapids

H2O cluster support various frontend i.e. python, R, FLOW etc and all the functions at these various front ends are handled through H2O cluster backend through API. Frontend actions are translated into API and H2O backend handles these API through Rapid expressions. We will understand how these APIs are handled from backend. Lets Start H2O […]

Continue reading