I recently came across a bug which required to dive deep into how SQS queues process messages before they get sent to DLQs. At first the whole idea of fallback seemed straightforward but after analysing our setup with my team, we came across some interesting insights about SQS which I would like to share with you today.
SQS is an AWS service that allows us to create and manage queues in a really simple way (that’s why it’s called Simple Queue Service). As a redrive policy, AWS recommends us to connect queues to other special queues designed to receive faulty messages. Such queues are called Dead Letter Queues.
SQS provides us with a plenty of options on how to fallback. Let’s explore the idea behind them and how to use them efficiently with AWS Lambda functions.
When Lambda functions process SQS messages, they call the
ReceiveMessage API on SQS. This processing can either be successful or fail. If our function attempts to consume a message more than x times unsuccessfully, it gets marked as faulty and is sent to the DLQ. Then, this message will be available to us for further investigation to find out what went wrong.
The default settings when adding a redrive policy give us 2 retries. In other words, our consumer will pull the message from SQS and attempt to execute 3 times in total. If it fails all 3 times, then that message gets sent to the DLQ.
Now we might be asking ourselves.What is the service that decides on how these retries are handled? Is it Lambda or SQS? Well, technically both… To explain that, let us dive a bit deeper into the idea of max receive counts, visibility timeouts and messages in flight.
Earlier we talked about 2 retries or 3 attempts it total. That is what the
maxReceiveCount property is all about (or maximum receives on the UI). We can configure it when setting up the queue.