Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM: https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/ To add cross validation you can do the following: def buildGLMModel(train: Frame, valid: Frame, response: String) (implicit h2oContext: H2OContext): GLMModel = { import _root_.hex.glm.GLMModel.GLMParameters.Family import _root_.hex.glm.GLM import _root_.hex.glm.GLMModel.GLMParameters val glmParams = new GLMParameters(Family.binomial) glmParams._train = train glmParams._valid = valid glmParams._nfolds = 3 ###### Here is […]

Continue reading


Truncated Bi-Level Optimization

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”.  I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it. The idea of bilevel optimization is quite simple.  Imagine that you would like to minimize some function .  However, itself is defined […]

Continue reading


Plotting scoring history from H2O model in python

Once you build a model with H2O the scoring history can be see in the mode details or model metrics table. If validation is enabled then scoring and validation history is also visible. You can see these metrics in the FLOW UI however if you are using python shell then you may want to plot […]

Continue reading


Managing timestamps values with n-fold cross validation

When using cross validation with n-folds user can choose a specific column as fold columns. More details on fold columns are described below: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_column.html Using fold columns the various splits will be created into custom grouping based on numerical or categorical values into the fold column. This is how fold column setting is used in […]

Continue reading


Cross-validation example with time-series data in R and H2O

What is Cross-validation: In k-fold cross–validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. learn more at wiki.. When you have time-series data […]

Continue reading