A long time ago, as new-ish developer, I was building a system that needed to ta...

Rapzid · on May 27, 2019

The biggest gotcha in a design like this IMHO is that you can't post and delete atomically. You may post the new work into the queue and then a failure to delete could occur and the work will stack.

Depending on the workload this could be not a big deal or very expensive. Treating a queue as a database, particularly queues that can't participate in XA transactions, can get you in trouble quick.

nine_k · on May 28, 2019

With a realistic, that is, not 100% reliable, queue you can have either "at most once" or "at least once" delivery anyway. "Exactly once" can't be guaranteed.

So a duplicate message should be processed as normal anyway, e.g. by deduplication within a reasonable window, and/or by having idempotent operations.

Rapzid · on May 28, 2019

Yes, it depends on the workload. Idempotency is typically always a good idea, but sometimes the operation itself is very expensive in terms of time, resources, and/or money. I have also seen people try to update the message when writing it back(with checkpoint information and etc) for long running processes. A slew of issues, including at least once delivery, can cause workflow bifurcation. Deduplication via FIFO _can_ help mitigate this, but it has a time window that needs to be accounted for. Once you start managing your own deduplication I'd say it has moved past trying to go databaseless.

kondro · on May 28, 2019

But you could adjust the message visibility timeout of the message you received so that it appears back later in the queue itself.

Rapzid · on May 28, 2019

For this specific use case, as described, this sounds like a possible very good approach.

kondro · on May 28, 2019

One thing to be aware of for this approach though is that you can only have 120k in-flight messages per queue.

bkanber · on May 27, 2019

Why are you horrified at a design that works well, scales well, is resilient enough for its use case, and is low cost? The whole point of an engineering design process is to find designs that meet these types of requirements. Honestly, this sounds like the perfect solution for what you're trying to accomplish.

mabbo · on May 29, 2019

Because looking at it now, something feels deeply wrong about it, haha. Honestly, if I'd used a database it probably would have opened up a few more options for future work.

I can't do any analytics about how long things typically take, who my biggest users are, etc. I mean, I could, but I'd have to add a datastore for that.

Adding new details to the parameters of the system requires very careful work to make all changes backwards and forwards compatible so that mid-deployment we don't have messages being pushed that old hosts can't process or new hosts seeing old messages they don't understand. That's good practice generally, but it's super mission critical to get right this way.

Also, a dropped message is invisible. SQS has redrive, sure, and that helps but if there were a bug, an edge case, where the system stopped processing something and quietly failed, that processing would just stop and we'd never know. If the entries were in a datastore, we'd see "Hey, this one didn't finish and I havne't worked on it lately, what gives?".