My question assumed a scenario where a consumer dequeues a batch, commits the de...

napsterbr · on May 27, 2019

That's the challenge of distributed systems :) it really boils down to how you want failures to be handled.

If you ack before processing, and then you crash, those messages are lost (assuming you can't recover from the crash and you are not using something like a two-phase commit).

If you ack after processing, you may fail after the messages have been processed but before you've been able to ack them. This leads to duplicates, in which case you better hope your work units are idempotent. If they are not, you can always keep a separate table of message IDs that have been processed, and check against it.

Either way, it's hard, complex and there are thousands of intermediate failure cases you have to think about. And for each possible solution (2pc, separate table of message IDs for idempotency, etc) you bring more complexity and problems to the table.

sorokod · on May 27, 2019

Well, sqs has machinery that deals with this (in flight messages, visibility timeouts) "out of the box". Similar functionality needs to be handcrafted when using dB as a queue.

To be clear, it is not that the SKIP LOCKED solution is invalid, it is just that there are scenarios where it is not sufficient.

andrewstuart · on May 27, 2019

Refer to the example I posted as a reply, above.

cryptonector · on May 27, 2019

You'd have the same problem with SQS, wouldn't you. The act of dequeueing does not guarantee that the process that received a message will not fail to perform it.

If you want a reliable system along those lines than you need to use SKIP LOCKED to SELECT one row to lock, then process it, and then DELETE the row. If your process dies then the lock will be release. You still have a new flavor of the same problem: you might process a message twice because the process might die in between completing processing and deleting the row. You could add complexity: first use SKIP LOCKED to SELECT one row to UPDATE to mark in-progress and LOCK the row, then later if the process dies another can go check if the job was performed (then clean the garbage) or not (pick and perform the job) -- a two-phase commit, essentially.

Factor out PG, and you'll see that the problem similar no matter the implementation.

derefr · on May 30, 2019

> you might process a message twice because the process might die in between completing processing and deleting the row

The very handy thing about the setup described, is that your data tables are part of the same MVCC world-state as your message queue. So you do all the work for the job, in the context of the same MVCC transaction that is holding the job locked; and anything that causes the job to fail, will fail the entire transaction, and thus rollback any changes that the job's operation made to the data.

sorokod · on May 28, 2019

With SQS, the act of dequeueing makes the mesage invisible to other consumers for a predefined time period. The consumer can ack the mesage once the procesing is completed resulting in the message being deleted. If the consumers fails to do so - the mesage will eventually become elligible to be processed by another consumer,

cryptonector · on May 28, 2019

I described, in the message you're responding to, how to do the same thing with SKIP LOCKED.

andrewstuart · on May 27, 2019

Essentially postgres SKIP LOCKED worker queues DELETES an item from a worker queue table, does the relevant work, and if the work completes ok, commits the deletion.

The SKIP LOCKED bit means that once the queue item has been grabbed FOR UPDATE, it cannot then be grabbed FOR UPDATE by any other queries, so it has an exclusive lock on the worker queue item.

It's pretty robust and works fine for servicing multiple workers.

manigandham · on May 28, 2019

Delaying the commit is the standard approach with this. SKIP LOCKED was created specifically to avoid the throughput issues of locked rows (and has similar implementations in other RDBMS).

If you don't want to keep the transaction open than you can just go back to updating a column containing the message status, which avoids keeping a transaction open but might need a background process to check for stalled out consumers.