[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Michael Freedman on TimescaleDB and scaling SQL for time-series. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In this episode […]

# Cross-validation example with time-series data in R and H2O

What is Cross-validation: In k-fold cross–validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. learn more at wiki.. When you have time-series data […]

# Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Ira Cohen on developing machine learning tools for a broad range of real-time applications. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, […]

# Building self-service tools to monitor high-volume time-series data

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Phil Liu on the evolution of metric monitoring tools and cloud computing. One of the main sources of real-time data processing tools is IT operations. In fact, a previous post I wrote on the re-emergence of real-time, was to a […]

# Redefining power distribution using big data

[A version of this post appears on the O’Reilly Radar blog.] The O’Reilly Data Show Podcast: Erich Nachbar on testing and deploying open source, distributed computing components. When I first hear of a new open source project that might help me solve a problem, the first thing I do is ask around to see if […]

# Graphs, Time-series, Dataviz, and Crowdsourcing at Strata Santa Clara 2014

There are many fantastic talks at Strata and it can be overwhelming to navigate the schedule. I plan to list talks I’m hoping to catch in a series of “time-turner” posts (check this blog on Wed/Thu at 10 a.m.). But for now let me highlight talks from a few categories: Graphs and Network Analysis: Large-scale […]

# How Twitter monitors millions of time-series

[A version of this post appears on the O’Reilly Strata blog.] One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and […]

# Surfacing anomalies and patterns in Machine Data

[A version of this post appears on the O’Reilly Strata blog.] I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and I […]

# The re-emergence of Time-series

[A version of this post appeared on the O’Reilly Strata and Radar blogs.] My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional […]

# Mining Time-series with Trillions of Points: Dynamic Time Warping at scale

Take a similarity measure that’s already well-known to researchers who work with time-series, and devise an algorithm to compute it efficiently at scale. Suddenly intractable problems become tractable, and Big Data mining applications that use the metric are within reach. The classification, clustering, and searching through time series have important applications in many domains. In […]