Welcome to the May ClickHouse newsletter, which will round up what’s been happening in real-time data warehouses over the last month.
This month, we have recursive CTEs in the 24.4 release, the launch of ClickHouse developer certification, real-time fraud detection at Instacart, and more!
Inside this issue
- Featured community member
- Upcoming events
- 24.4 release
- Become a ClickHouse Certified Developer
- Real-time Fraud Detection at Instacart
- K-Means Clustering with ClickHouse
- Simplified Kubernetes Logging with Fluentbit and ClickHouse
- The New Building Blocks of Observability
- Using ClickHouse for Financial Charts
- Post of the Month
Featured community member
This month's featured community member is Dan Goodman, Co-Founder and CEO of Tangia, a service for hosting interactive live streams.
Dan has been part of the ClickHouse community for at least 18 months and frequently gives the engineering team feedback on both missing features and how existing features can be improved.
Dan writes a blog about distributed systems, where he’s previously written about topics like range partitioning and building a Fly.io scheduler.
A few weeks ago he wrote a blog post titled Finding Trends With Approximate Embedding Clustering. In the post, he explains the importance of approximation techniques when working with big datasets and walks through how to implement the Dynamic K-Means++ algorithm with ClickHouse.
Upcoming events
- Dubai Meetup - May 28th
- AWS Summit Dubai - May 29th
- v24.5 Community Call - May 30th
- San Francisco Meetup - June 4th
- AWS Summit Stockholm - June 4th
- Tokyo Meetup - June 5th
- AWS Summit Madrid - June 5th
- ClickHouse Fundamentals - June 26th & 27th
- AWS Summit D.C. - June 26th
- Amsterdam Meetup - June 27th
- Paris Meetup - July 9th
- New York Meetup - July 9th
24.4 release
The standout feature in the 24.4 release is recursive CTEs (Common Table Expressions), and we made a London Underground-themed example to show you how they work. This release also sees improvements to JOIN performance and the QUALIFY clause to filter the values of window functions.
Become a ClickHouse Certified Developer
Rich Raposa recently announced the launch of the official ClickHouse Developer Certification Program, the only certification directly from ClickHouse.
This certification program validates developers’ proficiency in using ClickHouse to build robust, high-performance data solutions. This certification will showcase your mastery of ClickHouse and help you distinguish yourself as a trusted database management and analytics expert.
Learn more about certification
Real-time Fraud Detection at Instacart
Nick Shieh, Shen Zhu, and Xiaobing Xia have written a blog post where they walk us through Yoda, a decision platform service they built at Instacart to detect fraudulent activities and take action quickly. ClickHouse was chosen as the primary real-time datastore for this system because it can both ingest and query large amounts of data in real time. I especially liked the part of the post where they describe how real-time features fed into the service are derived from ClickHouse SQL queries.
K-Means Clustering with ClickHouse
Recently, when helping a user who wanted to compute centroids from vectors held in ClickHouse, we realized that the same solution could be used to implement K-Means clustering. They wanted to do this at scale across potentially billions of data points while ensuring memory could be tightly managed. In this post, we show how to implement K-means clustering using just SQL and show that it can scale to billions of records while running significantly faster than the same code in scikit-learn.
Simplified Kubernetes Logging with Fluentbit and ClickHouse
Logging is one of the hot ClickHouse use cases of the moment, so I was excited to come across this blog post by Muthukumaran. Fluentbit is a lightweight logging and metrics processor and forwarder designed for containerized environments. Muthukumaran walks us through the steps to setup a metrics server to monitor resource utilization in Kubernetes and then shows how to configure Fluentbit to get those metrics into ClickHouse.
The New Building Blocks of Observability
This article focuses on what the author coins the three new elements in the observability period table: OpenTelemetry, eBPF, and ClickHouse. OpenTelemetry has emerged as the de facto standard for telemetry data, eBPF makes it possible to generate traces and metrics with zero instrumentation, and ClickHouse is used to ingest and query all this data. The article also covers a series of Observability startups that are using ClickHouse - Groundcover, SigNoz, and DeepFlow.
Using ClickHouse for Financial Charts
After giving a brief crash course into when (and when not) to use ClickHouse, Adis Nezirović demonstrates how to ingest, query, and visualize financial time-series data. Along the way, he shows how to use the Null table engine to massage data and aggregate states to reduce the amount of data kept around. To conclude, Adis creates a candlestick chart using the Grafana QueryBuilder.
Post of the Month
Our favorite post this month was by ludwig who was impressed by both the speed of ClickHouse queries and the quality of its data compression.
After giving a brief crash course into when (and when not) to use ClickHouse, Adis Nezirović demonstrates how to ingest, query, and visualize financial time-series data. Along the way, he shows how to use the Null table engine to massage data and aggregate states to reduce the amount of data kept around. To conclude, Adis creates a candlestick chart using the Grafana QueryBuilder.