Debezium and Friends – Conference Talks 2021
If you love to attend conferences around the world without actually leaving the comfort of your house, 2021 certainly was (and is!) a perfect year for you. Tons of online conferences, many of them available for free, are hosting talks on all kinds of topics, and virtual conference platforms are getting better, too.
As the year is slowly reaching its end, I thought it might be nice to do a quick recap and gather in one place all the talks on Debezium and change data capture (CDC) which I did in 2021. An overarching theme for these talks was to discuss different CDC usage patterns and putting Debezium into the context of solving common data engineering tasks by combining it with other open-source projects such as Infinispan and Apache Pinot. In order to not feel too lonely in front of the screen and make things a bit more exciting, I decided to team up with some amazing friends from the open-source community for the different talks. A big thank you for these phantastic collaborations to Katia Aresti, Kenny Bastani, and Hans-Peter Grahsl!
So without further ado, here are four Debezium talks I had the pleasure to co-present in 2021.
Don’t Fear Outdated Caches – Change Data Capture to the Rescue!
As per an old saying in software engineering, there’s only two hard things: cache invalidation and naming things. Well, turns out the first is solved actually ;)
In this talk at the Bordeaux Java User Group, Katia Aresti from the Infinispan team and I explored how users of an application can benefit from low response times by means of read data models, persisted in distributed caches close to the user. When working with a central database as the authoritative data source — thus receiving all the write requests — these local caches need to be kept up to date, of course. This is where Debezium comes in: any data changes are captured and propagated to the caches via Apache Kafka.
And as if the combination of Kafka, Infinispan and Debezium was not already exciting enough, we also threw some Quarkus and Kafka Streams into the mix, joining the data from multiple Debezium change data topics, allowing to retrieve entire aggregate structures via a single key look-up from the local caches. It’s still on our agenda to describe that archicture in a blog post, so stay tuned for that.
Change Data Streaming Patterns in Distributed Systems
While some folks already might feel something like microservices fatigue, the fact is undeniable that organizing business functionality into multiple, loosely coupled services has been one of the biggest trends in software engineering over the last years.
Of course these services don’t exist in isolation, but they need to exchange data and cooperate; Apache Kafka has become the de-facto standard as the backbone for connecting services, facilitating asynchronous event-driven communication between them. In this joint presentation, my dear friend Hans-Peter Grahsl and I set out to explore what role change data capture can play in such architectures, and which patterns there are for applying CDC to solve common problems related to handling data in microservices architectures. We focused on three patterns in particular, each implemented using log-based CDC via Debezium:
-
The outbox pattern for reliable, eventually consistent data exchange between microservices, without incurring unsafe dual writes or tight coupling
-
The strangler fig pattern for gradually extracting microservices from existing monolithic applications
-
The saga pattern for coordinating long-running business transactions across multiple services, ensuring such activity gets consistently applied or aborted by all participating services
We presented that talk at several conferences, including Kafka Summit Europe, Berlin BuzzWords, and jLove. We also did a variation of the presentation at Flink Forward, discussing how to implement the different CDC patterns using Debezium and Apache Flink. The recording of this session should be published soon, in the meantime you can find the slides here. I also highly recommend to take a look at this blog post by Bilgin Ibryam, in which he discusses these patterns in depth.
Analyzing Real-time Order Deliveries using CDC with Debezium and Pinot
Traditionally, there has been a chasm between operational databases backing enterprise applications (i.e. OLTP systems), and systems meant for ad-hoc analytics use cases, such as queries run by a business analyst in the back office. (OLAP systems). Data would typically be propagated in batches from the former to the latter, resulting in multi-hour delays until the analytics system would be able to run queries on changed production data.
With the current shift to user-facing analytics, we are observing nothing less than a revolution: the ability to serve low-latency analytical queries on large data sets to the end users of an application, based on data that is really fresh (seconds old, rather than hours). Compared to response times and freshness guarantees you’d typically get from earlier generations of data warehouses, this is a game changer.
In this model, Debezium is used to capture all data changes from the operational database and propagate them into the analytics system. Kenny Bastani of StartTree and I spoke about the opportunities and use cases enabled by combining Debezium with Apache Pinot, a realtime distributed OLAP datastore, at the Pinot meet-up. A massive shout-out to Kenny again for putting together an awesome demo, showing how to use Debezium and the outbox pattern for getting the data into Apache Kafka, transform the data and ingest it into Pinot, and do some really cool visualizations via Apache Superset.
Dissecting our Legacy: The Strangler Fig Pattern with Apache Kafka, Debezium and MongoDB
After talking about three different CDC patterns, Hans-Peter and I decided to explore one of the patterns in some more depth and did this talk focusing exclusively on the strangler fig pattern. Existing monolithic applications are a reality in many enterprises, and oftentimes it’s just not feasible to replace them with a microservices architecture all at once in one single migration step.
This is where the strangler fig pattern comes in: it helps you to gradually extract components from a monolith into separate services, relying on CDC for keeping the data stores of the different systems in sync. A routing component, such as Nginx or Envoy Proxy, in front of all the systems sends each incoming request to that system which is in charge of a specific part of the domain at a given point in time during the migration.
This talk (which we presented at MongoDB.Live, Kafka Summit Americas, and VoxxedDays Romania), also contains a demo, we show how to implement the strangler fig pattern using Debezium, gradually moving data from a legacy system’s MySQL database over to the MongoDB instance of a new microservice, which is built using Quarkus.
Bonus: Debezium at the Trino Community Broadcast
This one is not so much a regular conference talk, but more of an informal exchange, so I’m adding it as a bonus here, hoping you may find it interesting too. Brian Olsen and Manfred Moser of Starburst, the company behind Trino, invited Ashhar Hasan, Ayush Chauhan, and me onto their Trino Community Broadcast.
We had a great time talking about Debezium and CDC in the context of Trino and its federated query capabilities, learning a lot from Ashhar and Ayush about their real-world experiences from using these technologies in production.
Learning More
Thanks again to Katia, Kenny, and Hans-Peter for joining the virtual conference stages with me this year! It would not have been half as much fun without you.
If these talks have piqued your interest in open-source change data capture and Debezium, head over to the Debezium website to learn more. You can also find many more examples in the Debezium examples repo on GitHub, and if you look for reports by folks from the community about their experiences using Debezium, take a look at this currated list of blog posts and other resources.