Why You Need Dead-Letter Topics in Pub/Sub
An Interview with Mike Schmitz: 66degrees Infrastructure Engineer
Technology is only as perfect as the people who create it, and we all know perfect human beings simply don’t exist. However, something we have been able to perfect at 66degrees is the ability to quickly recognize a problem and implement a solution. This allows us to relieve our clients of having to do that work themselves. To highlight some of that cutting-edge problem-solving, we interviewed one of our infrastructure engineers. Throughout this interview, we discuss the complexity of Pub/Sub, why dead-letter topics are important, and lend insight into what teamwork looks like at 66degrees.
Tell us about yourself.
I’m a Jr. Infrastructure Engineer here at 66degrees. I work on a lot of different commercial services and engagements, often on the networking side, but all over the GCP when it comes to infrastructure.
What is a dead-letter?
A dead-letter topic or dead-letter queue (DLQ), is a holding queue for messages that cannot be delivered to their destination queues, for example, because the queue does not exist, or because it is full.
How did you know there was a problem with the messaging queue to begin with?
As I was working through a particular project, I noticed that my web app was reading the data and was becoming increasingly unresponsive, meaning it was taking 30 seconds or more to get a response back from a web request. With that, a number of them would come back saying “internal server error” or “cannot reach the database.” That’s when my concern started growing… When I started digging into it more, looking at the graphs and feedback that I can get on each of the Pub/Sub topics, I saw that one of the topics I was writing to had thousands of unacknowledged messages.
Graph 1 shows the undelivered messages before inserting the DL.
Did you immediately know how to fix the issue, or did you have to research for the answer?
As someone who is not a Pub/Sub expert, I had no clue what to do, to be honest. I had to reach out to some of our experts here at 66degrees for resources. Eventually, I figured out there was bad data being processed, but I also knew that this was a problem that needed to be addressed in general.
When you reached out to your colleagues, what solutions did you brainstorm?
When I brought this topic up to my other colleagues, just to bounce around ideas, a lot of them brought up the dead-letter topic or DLQ and how those could be used. After a little pondering, I reached out to our infrastructure team to work out the best way to set up the DLQ.
During the period of time when messages were not being delivered, how long was your web service down for?
It never went completely down, which is good, but for about 2 weeks we were fighting these incredibly high latencies. No one ever wants to present a client demo and there are internal server errors on the application we’re providing to them! It never looks great. When we figured out the root cause of the issue we were able to alleviate the problem.
How important would you say dead-letter topics are in the Pub/Sub?
Everything needs one. If it’s being programmatically managed, there needs to be something that captures bad requests, and unacknowledged messages. You need them. They’re incredibly crucial. The only reason they weren’t added in the beginning was because we were in the proof of concept stage. Once our testing developed, that’s when the cracks started showing and we could see why dead-letter topics are so important.
Once you added the dead-letter topics, did your web service immediately come back online, or was there some downtime?
I manually cleared the whole queue. We’re in proof of concept, none of this is production data, and we just wanted to clear the whole thing and set up the dead-letter topic so we could start over and make sure everything was working properly. Within seconds of clearing the queue, all the problems were gone.
Graph 2 shows where the queue was manually cleared, the DL is in place, and there are no more bad messages.
How much time and energy could fellow Pub/Subers save by implementing dead-letter topics?
It allows you to debug your applications faster in general, and lets them actually process the data they’re supposed to without being slowed down. It can be hard to notice because things will still work until you eventually overflow and overwhelm the server. It might look like everything’s fine, but a quarter or more of the workload could end up in a never-ending cycle of bad messages. Simply put, it’s so important to have dead-letter topics or queues because they save you and your processors valuable time and energy.
What advice would you give to yourself before you implemented the fix, or to others who don’t know about dead-letter topics yet?
If you don’t know you’re getting bad data, you can’t fix the bad data. I think the key takeaway is to not ignore the things happening in your Pub/Sub. It’s deceivingly simple, and that can get you into trouble. If you’re not paying attention to all of its parts, you can run into problems you didn’t realize could show up. Use the diagnostic tools you have in Pub/Sub to make sure you’re addressing the things that might become problems.