Hacker News new | past | comments | ask | show | jobs | submit login

A long time ago, as new-ish developer, I was building a system that needed to take inputs, then run "pass/fail/wait and try again later" until timeout or completion. This wasn't mission-critical stuff, mind you, so a lost message would annoy someone but not cause any actual harm.

As I was figuring out how to setup a datastore, query it for running workflows and all that jazz, I happened upon an interesting SQS feature: Post with Delay.

And so, the system has no database. Instead, when new work arrives it posts the details of the work to be done to SQS. All hosts in the fleet are polling SQS for messages. When they receive one, they do the checks and if the process isn't complete they repost the message again with a 5-minute delay. In 5 minutes, a host in the fleet will receive the message and try again. The process continues as long as it needs to.

Looking back, part of me now is horrified at this design. But: that system now has thousands of users and continues to scale really well. Data loss is very rare. Costs are low. No datastore to manage. SQS is just really darned neat because it can do things like that.




The biggest gotcha in a design like this IMHO is that you can't post and delete atomically. You may post the new work into the queue and then a failure to delete could occur and the work will stack.

Depending on the workload this could be not a big deal or very expensive. Treating a queue as a database, particularly queues that can't participate in XA transactions, can get you in trouble quick.


With a realistic, that is, not 100% reliable, queue you can have either "at most once" or "at least once" delivery anyway. "Exactly once" can't be guaranteed.

So a duplicate message should be processed as normal anyway, e.g. by deduplication within a reasonable window, and/or by having idempotent operations.


Yes, it depends on the workload. Idempotency is typically always a good idea, but sometimes the operation itself is very expensive in terms of time, resources, and/or money. I have also seen people try to update the message when writing it back(with checkpoint information and etc) for long running processes. A slew of issues, including at least once delivery, can cause workflow bifurcation. Deduplication via FIFO _can_ help mitigate this, but it has a time window that needs to be accounted for. Once you start managing your own deduplication I'd say it has moved past trying to go databaseless.


But you could adjust the message visibility timeout of the message you received so that it appears back later in the queue itself.


For this specific use case, as described, this sounds like a possible very good approach.


One thing to be aware of for this approach though is that you can only have 120k in-flight messages per queue.


Why are you horrified at a design that works well, scales well, is resilient enough for its use case, and is low cost? The whole point of an engineering design process is to find designs that meet these types of requirements. Honestly, this sounds like the perfect solution for what you're trying to accomplish.


Because looking at it now, something feels deeply wrong about it, haha. Honestly, if I'd used a database it probably would have opened up a few more options for future work.

I can't do any analytics about how long things typically take, who my biggest users are, etc. I mean, I could, but I'd have to add a datastore for that.

Adding new details to the parameters of the system requires very careful work to make all changes backwards and forwards compatible so that mid-deployment we don't have messages being pushed that old hosts can't process or new hosts seeing old messages they don't understand. That's good practice generally, but it's super mission critical to get right this way.

Also, a dropped message is invisible. SQS has redrive, sure, and that helps but if there were a bug, an edge case, where the system stopped processing something and quietly failed, that processing would just stop and we'd never know. If the entries were in a datastore, we'd see "Hey, this one didn't finish and I havne't worked on it lately, what gives?".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: