Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone Please visit my github and this specific page for more details. Installation: Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet. Pre-requsite : Thrift Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and […]

Continue reading


Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV. We wrote a script in Scala which does the following Handles nested parquet compressed content Look […]

Continue reading


H2O AutoML examples in python and Scala

AutoML is included into H2O version 3.14.0.1 and above. You can learn more about AutoML in the H2O blog here. H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance […]

Continue reading


The state of machine learning in Apache Spark

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark. In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion […]

Continue reading


Spark with H2O using rsparkling and sparklyr in R

You must have installed: sparklyr rsparkling   Here is the working script: library(sparklyr) > options(rsparkling.sparklingwater.version = “2.1.6”) > Sys.setenv(SPARK_HOME=’/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.6′) > library(rsparkling) > spark_disconnect(sc) > sc <- spark_connect(master = “local”, version = “2.1.0”) Testing the spark context: sc $master [1] “local[8]” $method [1] “shell” $app_name [1] “sparklyr” $config $config$sparklyr.cores.local [1] 8 $config$spark.sql.shuffle.partitions.local [1] 8 $config$spark.env.SPARK_LOCAL_IP.local [1] […]

Continue reading


Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model: https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/ To collect model metrics  for training use the following: val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train) Now you can access model AUC (_auc object) as below: Note: _auc object has array of thresholds, and then for each threshold it has fps and tps (use […]

Continue reading


Using RESTful API to get POJO and MOJO models in H2O

  CURL API for Listing Models: http://<hostname>:<port>/3/Models/ CURL API for Listing specific POJO Model: http://<hostname>:<port>/3/Models/model_name List Specific MOJO Model: http://<hostname>:<port>/3/Models/glm_model/mojo Here is an example: curl -X GET “http://localhost:54323/3/Models” curl -X GET “http://localhost:54323/3/Models/deeplearning_model” >> NAME_IT curl -X GET “http://localhost:54323/3/Models/deeplearning_model” >> dl_model.java curl -X GET “http://localhost:54323/3/Models/glm_model/mojo” > myglm_mojo.zip Thats it, enjoy!! Advertisements

Continue reading


Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling You must have the following package installed in your R environment: sparklyr, h2o, rsparkling […]

Continue reading