Trade-offs of Using Job Queues
In a previous blog post I listed some ways job queues can be used to improve reliability of systems. As everything, however, job queues come with trade-offs and pitfalls. In this blog post, I aim to explore some of these side-effects.
Durability Guarantees
In order to provide message durability, most job queues offer at-least-once delivery. This means that a job is guaranteed to be delivered once but might be delivered multiple times. This feature enables retrying failed jobs, but can also cause problems when a job triggers a non-idempotant action, such as sending an email, deducting money from a user’s balance, etc.
There are multiple ways to mitigate this problem. The most obvious one is keeping track of processed jobs. A log of processed job IDs can be kept. This might require to have transactional guarantees that span the application state and the processed job list. In other cases, an ephemeral short-term storage for processed job IDs might be enough. It mostly depends on the application design and guarantess that the job queue setup provides.
Another way is to architect the application to be able to handle at-least-once delivery. This is often referred to as an idempotent system - one where the same outcome is produced even if the same job is received more than once. This approach is usually highly tailored to the application and is not a generic solution.
Causes for this problem span beyond job queues - they are challenges that come with distributed systems and asynchronous communication, so discussing in-depth solutions, is beyond the scope of this blog post.
Ordering
Another challenge in need of consideration when a job queue is used is message ordering. In fact, this is another challenge with distributed systems and asynchronous communication.
For numerous reasons, jobs might reach the consumer not in the order they were published. One reason for this might be a retry of a failed job. It can also be caused by a network problem between the job queue and consumer.
Message ordering issues can be solved by being able to delay processing of a job that came in out-of-order. Some job queues support submitting a job with a delay after which the queue should provide the job to the consumer. Another approach could be to save the out-of-order job in a persistent storage until it can be processed.
Ordering of jobs is not a problem for some systems. Some other systems can be engineered to not care about ordering. The rest can use the approaches described above.
Faulty Jobs
Sometimes a faulty job can appear in the system. This happens for a myriad of reasons: buggy serialization, mis-alignment between the processing and consuming parts of the system, etc. This can become a big problem when the retry mechanism has been poorly implemented. If even a small number of faulty jobs end up being retried infinitely, they can prevent all the other jobs from being processed. This problem is often mitigated by assigning each job a retry counter, incrementing it whenever a retry occurs and limitting the number of retries per job. This approach can be combined with having a “dead letter” queue where all of these faulty jobs can be shipped to after their number of retries has been exhausted. This allows to still retry those jobs after fixing any problems they might have.
Measuring Job Processing Time
When all processing happens synchronously, time it takes to process is usually quite easy to measure. When a job queue gets involved, measuring job processing times becomes trickier. One way to measure processing time is to record the time a request has entered the acceptance layer in the job payload and use that to calculate processing time when the job is processed. In big distributed systems, this could cause problems when clocks of different servers are not in sync.
Furthermore, when a job queue is involved, time to process a job can also be impacted by network problems or job queue saturation.
Of course, it is not impossible to measure the processing time for a job, but it is more complicated than doing it in a synchronous system.
Conclusion
All in all, most trade-offs that come with job queues are similar to ones faced in distributed systems and systems utilizing asynchronous communication. These challenges are quite common and thus have well-known solutions.