Architecting and building end-to-end streaming applications

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In this episode […]

Continue reading


Structured streaming comes to Apache Spark 2.0

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Michael Armbrust on enabling users to perform streaming analytics, without having to reason about streaming. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, […]

Continue reading


Stream processing and messaging systems for the IoT age

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show podcast: M.C. Srivas on streaming, enterprise grade systems, the Internet of Things, and data for social good. Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, […]

Continue reading


Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing. [This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.] Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Many […]

Continue reading


Building self-service tools to monitor high-volume time-series data

[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Phil Liu on the evolution of metric monitoring tools and cloud computing. One of the main sources of real-time data processing tools is IT operations. In fact, a previous post I wrote on the re-emergence of real-time, was to a […]

Continue reading


A real-time processing revival

[A version of this post appears on the O’Reilly Radar blog.] Things are moving fast in the stream processing world. There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, […]

Continue reading


Building Apache Kafka from scratch

[A version of this post originally appeared on the O’Reilly Radar blog.] In this episode of the O’Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things. At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, […]

Continue reading


Stream Processing and Mining just got more interesting

[A version of this post appears on the O’Reilly Strata blog.] Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1 […]

Continue reading


How Twitter monitors millions of time-series

[A version of this post appears on the O’Reilly Strata blog.] One of the keys to Twitter’s ability to process 500 millions tweets daily is a software development process that values monitoring and measurement. A recent post from the company’s Observability team detailed the software stack for monitoring the performance characteristics of software services, and […]

Continue reading


Running batch and long-running, highly available service jobs on the same cluster

[A version of this post appears on the O’Reilly Strata blog.] As organizations increasingly rely on large computing clusters, tools for leveraging and efficiently managing compute resources become critical. Specifically, tools that allow multiple services and frameworks run on the same cluster can significantly increase utilization and efficiency. Schedulers1 take into account policies and workloads […]

Continue reading