O Kafka, Where Art Thou?
The other day, I came across an interesting thread in the Java sub-reddit, with someone asking: "Has anyone attempted to write logs directly to Kafka?". This triggered a number of thoughts and questions for myself, in particular how one should deal in an application when an attempt to send messages to Kafka fails, for instance due to some network connectivity issue? What do you do when you cannot reach the Kafka broker?
While the Java Kafka producer buffers requests internally (primarily for performance reasons) and also supports retries, you cannot do so indefinitely (or can you?), so I went to ask the Kafka community on Twitter how they would handle this situation:
#Kafka users: how do you deal in producers with brokers not being available? Take a use case like sending logs; you don't want to fail your business process due to Kafka issues here, it's fine do this later on. Large producer buffer and retries? Some extra buffer (e.g. off-heap)?
— Gunnar Morling 🌍 (@gunnarmorling) November 27, 2021
This question spawned a great discussion with tons of insightful replies (thanks a lot to you all!), so I thought I’d try and give an overview on the different comments and arguments. As with everything, the right strategy and solution depends on the specific requirements of the use case at hand; in particular whether you can or cannot afford for potential inconsistencies between the state of the caller of your application, its own state, and the state in the Kafka cluster.
As an example, let’s consider an application which exposes a REST API for placing purchase orders. Acknowledging such a request while actually failing to send a Kafka message with the purchase order to some fulfillment system would be pretty bad: the user would believe their order has been received and will be fulfilled eventually, whereas that’s actually not the case.
On the other hand, if the incoming request was savely persisted in a database, and a message is sent to Kafka only for logging purposes, we may be fine to accept this inconsistency between the user’s state ("my order has been received"), the application’s state (order is stored in the database), and the state in Kafka (log message got lost; not ideal, but not the end of the world either).
Understanding these different semantics helps to put the replies to the question into context. There’s one group of replies along the lines of "buffer indefinitely, block inbound requests until messages are sent", e.g. by Pere Urbón-Bayes:
This would certainly depend on the client used and your app use case. Generally speaking, retry forever and block if the buffer is full, leave time for broker to recover, with backpressure.
— Pere Urbón-Bayes (@purbon) November 28, 2021
if backpressure not possible, cause use case, off-load off-heap for later recovery.
This strategy makes a lot of sense if you cannot afford any inconsistency between the state of the different actors at all: e.g. when you’d rather tell the user that you cannot receive their purchase order right now, instead of being at the risk of telling them that you did, whereas you actually didn’t.
What though, if we don’t want to let the availability of a resource like Apache Kafka — which is used for asynchronous message exchanges to begin with — impact the availability of our own application? Can we somehow buffer requests in a safe way, if they cannot be sent to Kafka right away? This would allow to complete the inbound request, while hopefully still avoiding any inconsistencies, at least eventually.
Now simply buffering requests in memory isn’t reliable in any meaningful sense of the word;
if the producing application crashes,
any unsent messages will be lost,
making this approach not different in terms of reliability from working with ack
= 0,
i.e. not waiting for any acknowledgements from the Kafka broker.
It may be useful for pure fire-and-forget use cases, where you don’t care about delivery guarantees at at all,
but these tend to be rare.
Multiple folks therefore suggested more reliable means of implementing such buffering, e.g. by storing un-sent messages on disk or by using some local, persistent queuing implementation. Some have built solutions using existing open-source components, as Antón Rodriguez and Josh Reagan suggest:
I usually retry forever, specially when reading from another topic because we can apply backpressure. In some cases, discard after some time is ok. Very rarely off-heap with ChronicleQueue or MapsDB. I have considered but never used an external service as DLQ or a Kafka mesh
— Antón (@antonmry) November 27, 2021
Embedded broker and in-vm protocol. Either ActiveMQ or Artemis work great.
— Josh Reagan (@joshdreagan) November 28, 2021
You even could think of having a Kafka cluster close by (which then may have other accessibility characteristics than your "primary" cluster e.g. running in another availability zone) and keeping everything in sync via tools such as MirrorMaker 2. Others, like Jonathan Santilli, create their own custom solutions by forking existing projects:
I forked Apache Flume and modified it to used a WAL on the disk, so, messages are technically sent, but store on disk, when the Broker is available, the local queue gets flushed, all transparent for the producer.
— Jonathan Santilli (@pachilo) November 27, 2021
Also ready-made wrappers aound the producer exists, e.g. in Wix' Greyhound Kafka client library, which supports producing via local disk as per Derek Moore:
I built a proprietary "data refinery" on Kafka for @fanthreesixty and we built ourselves libraries not dissimilar to https://t.co/uQdepGHTzj
— Derek Moore (@derekm00r3) November 27, 2021
But there be dragons! Persisting to disk will actually not be any better at all, if it’s for instance an ephermeral disk of a Kubernetes pod which gets destroyed after an application crash. But even when using persistent volumes, you may end up with an inherently unreliable solution, as Mic Hussey points out:
These are two contradictory requirements 😉 Sooner or later you will run out of local storage capacity. And unless you are very careful you end up moving from a well understood shared queue to a hacked together implicit queue.
— Mic Hussey (@hussey_mic) November 29, 2021
So it shouldn’t come at a surprise that people in this situation have been looking at alternatives, e.g. by using DynamoDB or S3 as an intermediary buffer; The team around Natan Silnitsky working on Greyhound at Wix are exploring this option currently:
So instead we want to fallback only on failure to send. In addition we want to skip the disk all together, because recovery mechanism when a pod is killed in #Kubernetes is too complex (involves a daemonset...), So we're doing a POC, writing to #DynamoDB/#S3 upon failure 2/3 🧵
— Natan Silnitsky (@NSilnitsky) November 29, 2021
At this point it’s worth thinking about failure domains, though. Say your application is in its own network and it cannot write to Kafka due to some network split, chances are that it cannot reach other services like S3 either. So another option could be to use a datastore close by as a buffer, for instance a replicated database running on the same Kubernetes cluster or at least in the same availability zone.
If this reminds you of change data capture (CDC) and the outbox pattern, you’re absolutely right; multiple folks made this point as well in the conversation, including Natan Silnitsky and R.J. Lorimer:
Then a dedicated service will listen to #DynamoDB CDC events and produce to #ApacheKafka including payload, key, headers, etc...
— Natan Silnitsky (@NSilnitsky) November 29, 2021
For our event sourcing systems the event being delivered actually is critical. For "pure" cqrs services, Kafka being down is paramount to not having a db so we fail. Other systems use a transactional outbox that persists in the db. If Kafka is down it sits there until ready.
— R.J. Lorimer (He/Him) (@realjenius) November 27, 2021
As Kacper Zielinski tells us, this approach is an example of a staged event-driven architecture, or SEDA for short:
Outbox / SEDA to rescue here. Not sure if any “retry” can guarantee you more than “either you will loose some messages or fail the business logic by eating all resources” :)
— Kacper Zielinski (@xkzielinski) November 27, 2021
In this model, a database serves as the buffer for persisting messages before they are sent to Kafka, which makes for for a highly reliable solution, provided the right degree of redundancy is implemented e.g. in form of replicas. In fact, if your application needs to write to a database anyways, "sending" messages to Kafka via an outbox table and CDC tools like Debezium is a great way to avoid any inconsistencies between the state in the database and Kafka, without incurring any unsafe dual writes.
But of course there is a price to pay here too: end-to-end latency will be increased when going through a database first and then to Kafka, rather than going to Kafka directly. You also should keep in mind that the more moving pieces your solution has, the more complex to operate it will become of course, and the more subtle and hard-to-understand failure modes and edge cases it will have.
An excellent point is made by Adam Kotwasinski by stating that it’s not a question of whether things will go wrong, but only when they will go wrong, and that you need to have the right policies in place in order to be prepared for that:
For some of my usecases I have a wrapper for Kafka's producer that requires users to _explicitly_ set up policies like retry/backoff/drop. It allows my customers to think about outages (that will happen!) up front instead of being surprised. Each usecase is different.
— Adam Kotwasinski (@AKotwasinski) November 28, 2021
In the end it’s all about trade-offs, probabilities and acceptable risks. For instance, would you receive and acknowledge that purchase order request as long as you can store it in a replicated database in the local availability zone, or would you rather reject it, as long as you cannot safely persist it in a multi-AZ Kafka cluster?
These questions aren’t merely technical ones any longer, but they require close collaboration with product owners and subject matter experts in the business domain at hand, so to make the most suitable decisions for your specific situation. Managed services with defined SLAs guaranteeing high availability values can make the deciding difference here, as Vikas Sood mentions:
That’s why we decided to go with a managed offering to avoid disruptions in some critical processes.Some teams still had another decoupling layer (rabbit) between producers and Kafka. Was never a huge fan of that coz it simply meant more points of failure.
— Vikas Sood (@Sood1Vikas) November 27, 2021
Thanks a lot again to everyone chiming in and sharing their experiences, this was highly interesting and insightful! You have further ideas and thoughts to share? Let me and the community at large know either by leaving a comment below, or by replying to the thread on Twitter. I’m also curious about your feedback on this format of putting a Twitter discussion into some expanded context. It’s the first time I’ve been doing it, and I’d be eager to know whether you find it useful or not. Thanks!