> It’s almost impossible for a human to figure out how to handle UNKNOWN correct...

thatsamonad · on Feb 22, 2020

In my experience designing systems like these: There are no hard and fast rules and you handle them in a way that’s been decided on a case-by-case basis for each system. Some messages may need to keep retrying with an exponential back off indefinitely. Others may need to retry only once and then send an email to someone because two failures in five minutes is a critical failure. You also have to design things so that if one failure occurs, maybe it stops the entire process, or maybe other messages can go through still and you just log the failure.

It all comes down to the rules of your business and how critical these systems are. Maybe unknown means “failure” or maybe unknown means “someone should get an alert about this and check it out”.

I think it’s hard because there is no highly visible “crash” that occurs like in a non-distributed system when an unexpected exception occurs and the entire program shuts down. Failures often happen silently and it’s difficult to tell where or why something failed. So you have to design each system with that in mind and figure out how each piece needs to deal with uncertainty.

stingraycharles · on Feb 22, 2020

For what it’s worth, in my experience it’s very effective to add this information to the error context: is it a permanent failure (eg validation) or a retryable error. If it is retryable, also add to the error context when it should be retried.

This will allow you to handle these errors appropriately without having to handle these things on a case by case basis.

clarry · on Feb 22, 2020

> it’s very effective to add this information to the error context: is it a permanent failure (eg validation) or a retryable error

The discussion was specifically about UNKNOWN errors, i.e. you sent a message but never got a reply back. You don't know whether it was a validation failure or temporary hiccup. For all you know, it's possible the message was received and processed correctly but the response never made it back.

How to handle these unknowns is always going to be case by case. Some combination of retry and give up works for most cases, but there is no silver bullet and usually you have to think hard about the consequences of 1) retrying 2) giving up thinking the request failed even though it actually (silently) succeeded.

kccqzy · on Feb 22, 2020

There's not really an actual UNKNOWN. As the article illustrated, there're distinct steps in the client-side processing, so we do know with certainty which part of the client code emitted the error. We might know anything about the remote server but the point is the error is known at the client side. It could be failure to create socket, failure to send messages, failure to receive back any reply within the stipulated deadline, the remote server returning an error, etc.

In practice the easiest thing is simply to propagate the error onwards, without affecting other independent requests (failure domains), and let something intelligent handle the error. It could very well be a human sitting at a computer seeing an internal error message, who can then decide to retry or not.

Also often times it's acceptable to just log the error and carry on. I don't know of anything that can promise a 100% error-free SLA. With careful engineering achieving even 99.999% success rate is possible, but I don't think anyone would actually promise 100%.

azylman · on Feb 22, 2020

> failure to receive back any reply within the stipulated deadline

If you get this error back, the client doesn't know if the server actually processed it or not, so knowing where the client failed isn't actually useful for knowing the state of the request and what needs to happen next.

To handle something like this, you need a resilient design around client-server communications (e.g. assuming retries on the client side and idempotent behavior on the server side).

Immediately erroring out on the client is usually going to lead to a poor user experience and might lead to inconsistent behavior.

kccqzy · on Feb 22, 2020

> so knowing where the client failed isn't actually useful for knowing the state of the request and what needs to happen next.

Correct. Which is why the client should just error out and stop processing and return the error to the user, who will have more context and knows whether or not a retry is necessary or desirable.

My argument is that your "resilient design around client-server communications" isn't necessary in the majority of cases, and is often unwarranted over-engineering. Poor user experience is fine, if they don't happen very often, and go away upon a retry. Even banks do that. It's fine. No one will be offended if your app shows an internal error message once a month (a five-minute outage in a month is still more than 99.9% availability).

azylman · on Feb 22, 2020

> the client should just error out and stop processing and return the error to the user, who will have more context and knows whether or not a retry is necessary or desirable... Even banks do that. It's fine

If a user at a bank tries to transfer $10,000, gets an internal server error and retries because their balance hasn't updated (banks don't process these in realtime), and checks the next day and finds $20,000 gone, that's a big problem. The user can't be responsible for this, you need something more.

You're not wrong that this isn't necessary in the majority of cases (updating your email address?) - but the majority of cases don't need distributed systems. By the time you're talking about distributed systems (which is the focus of this article and discussion), you absolutely do need this, in the vast majority of cases.

foota · on Feb 22, 2020

Eh, in that case you're asking for trouble by requiring the server to be successful. If I had to write an API like that I would definitely go with randomized tokens per logical request for dedupe on retries.

If it's a user facing transaction it should be a transaction resource that's created and then the server does the retries imo, with the user able to view the status.

azylman · on Feb 22, 2020

Exactly, given a timeout you can't assume the server is clearly successful (or unsuccessful!). That was my point, we're definitely agreeing :) Usually you'd generate some kind of idempotency key (randomized token as you said) and retry with that.

You definitely can't just bubble that up to the user and assume things are fine