Apr 16, 2025
A Deep Dive Into Ingesting Debezium Events From Kafka With Flink SQL
Over the years, I’ve spoken quite a bit about the use cases for processing Debezium data change events with Apache Flink, such as metadata enrichment, building denormalized data views, and creating data contracts for your CDC streams. One detail I haven’t covered in depth so far is how to actually ingest Debezium change events from a Kafka topic into Flink, in particular via Flink SQL. Several connectors and data formats exist for this, which can make things somewhat confusing at first. So let’s dive into the different options and the considerations around them!
Read More...
Apr 7, 2025
Building a Native Binary for Apache Kafka on macOS
With help of the GraalVM configuration developed for KIP-974 (Docker Image for GraalVM based Native Kafka Broker), you can easily build a self-contained native binary for Apache Kafka. Read on to learn how you can build a native Kafka executable yourself, starting in milli-seconds, making it a perfect fit for development and testing purposes.
When I wrote about ahead-of-time class loading and linking in Java 24 recently, I also published the start-up time for Apache Kafka as a native binary for comparison. This was done via Docker, as there’s no pre-built native binary of Kafka available for the operating system I’m running on, macOS. But there is a native Kafka container image, so this is what I chose for the sake of convenience.
Now, running in a container adds a little bit of overhead of course, so it wasn’t a surprise when Thomas Würthinger, lead of the GraalVM project at Oracle, brought up the question what the value would be when running Kafka natively on macOS. Needless to say I can’t leave this kind of nice nerd snipe pass, so I set out to learn how to build a native Kafka binary on macOS, using GraalVM.
Read More...
Mar 27, 2025
Let's Take a Look at... JEP 483: Ahead-of-Time Class Loading & Linking!
In the "Let’s Take a Look at…!" blog series I am exploring interesting projects, developments and technologies in the data and streaming space. This can be KIPs and FLIPs, open-source projects, services, relevant improvements to Java and the JVM, and more. The idea is to get some hands-on experience, learn about potential use cases and applications, and understand the trade-offs involved. If you think there’s a specific subject I should take a look at, let me know in the comments below.
Update March 28: This post is on being discussed Hacker News 🍊
Java 24 got released last week, and what a meaty release it is: more than twenty Java Enhancement Proposals (JEPs) have been shipped, including highlights such as compact object headers (JEP 450, I hope to spend some time diving into that one some time soon), a new class-file API (JEP 484), and more flexible constructor bodies (JEP 492, third preview). One other JEP which might fly a bit under the radar is JEP 483 ("Ahead-of-Time Class Loading & Linking"). It promises to reduce the start-up time of Java applications without requiring any modifications to the application itself, what’s not to be liked about that? Let’s take a closer look!
Read More...
Mar 18, 2025
The Synchrony Budget
Update March 27: This post is being discussed on Hacker News
For building a system of distributed services, one concept I think is very valuable to keep in mind is what I call the synchrony budget: as much as possible, a service should minimize the number of synchronous requests which it makes to other services.
Read More...
Mar 5, 2025
Let's Take a Look at... KIP-932: Queues for Kafka!
In the "Let’s Take a Look at…!" blog series I am going to explore interesting projects, developments and technologies in the data and streaming space. This can be KIPs and FLIPs, open-source projects, services, and more. The idea is to get some hands-on experience, learn about potential use cases and applications, and understand the trade-offs involved. If you think there’s a specific subject I should take a look at, let me know in the comments below!
That guy above? Yep, that’s me, whenever someone says "Kafka queue". Because, that’s not what Apache Kafka is. At its core, Kafka is a distributed durable event log. Producers write events to a topic, organized in partitions which are distributed amongst the brokers of a Kafka cluster. Consumers, organized in groups, divide the partitions they process amongst themselves, so that each partition of a topic is read by exactly one consumer in the group.
Read More...
Nov 27, 2024
Thoughts On Moving Debezium to the Commonhaus Foundation
If you are following the news around Debezium—an open-source platform for Change Data Capture (CDC) for a variety of databases—you may have seen the announcement that the project is in the process of moving to the Commonhaus Foundation. I think this is excellent news for the Debezium project, its community, and open-source CDC at large. In this post I’d like to share some more context on why I am so excited about this development.
Read More...
Nov 16, 2024
Building OpenJDK From Source On macOS
Every now and then, it can come in very handy to build OpenJDK from source yourself, for instance if you want to explore a feature which is under development on a branch for which no builds are published. For some reason I always thought that building OpenJDK is a very complex processing, requiring the installation of arcane tool chains etc. But as it turns out, this actually not true: the project does a great job of documenting what’s needed and only a few steps are necessary to build your very own JDK.
Read More...
Oct 18, 2024
CDC Is a Feature Not a Product
During and after my time as the lead of Debezium, a widely used open-source platform for Change Data Capture (CDC) for a variety of database, I got repeatedly asked whether I’d be interested in creating a company around CDC. VCs, including wellknown household names, did and do reach out to me, pitching this idea.
Read More...
Oct 6, 2024
How I Am Setting Up VMs On Hetzner Cloud
Whenever I’ve need a Linux box for some testing or experimentation, or projects like the One Billion Row Challenge a few months back, my go-to solution is Hetzner Online, a data center operator here in Europe.
Their prices for VMs are unbeatable, starting with 3,92 €/month for two shared vCPUs (either x64 or AArch64), four GB of RAM, and 20 TB of network traffic (these are prices for their German data centers, they vary between regions). four dedicated cores with 16 GB, e.g. for running a small web server, will cost you 28.55 €/month. Getting a box with similar specs on AWS would set you back a multiple of that, with the (outbound) network cost being the largest chunk. So it’s not a big surprise that more and more people realize the advantages of this offering, most notably Ruby on Rails creator David Heinemeier Hansson, who has been singing the praise for Hetzner’s dedicated servers, but also their VM instances, quite a bit on Twitter lately.
Read More...
Aug 26, 2024
Leader Election With S3 Conditional Writes
Update Aug 30: This article is discussed on Hacker News and lobste.rs.
In distributed systems, for instance when scaling out some workload to multiple compute nodes, it is a common requirement to select a leader for performing a given task: only one of the nodes should process the records from a Kafka topic partition, write to a file system, call a remote API, etc. Otherwise, multiple workers may end up doing the same task twice, overwriting each other’s data, and worse.
Read More...