You call APIs, and at first the latency and responses are within your expected parameters, but sometimes, just sometimes, the endpoint starts to respond differently. Maybe the latency increases, response come back garbled, or they don’t come back at all and cause timeouts. Either way, you can tell something is wrong because connection attempts fail.

Load on the infrastructure behind the endpoint you are calling could be high (he or she is busy), or perhaps there are other factors in play like network congestion (talking to other people), a recent code or infrastructure change (he or she changed their mind about something), etc, but your application still tries.

So, what error handling can we do?

Well, in AWS Support we always give a stock answer about retry and exponential backoff. In layman’s terms, if you receive a response you weren’t expecting, handle that error gracefully and try again another time. If you experience multiple errors, you introduce a delay and try again a few seconds later (not in dating though, that would be weird).

If you still don’t get the response you’re expecting, increase your delay and continue to try again. Until maybe your maximum attempts are exhausted and your application has to give up. At that point, you start to consider sending your API calls elsewhere, especially if there is no response.

The wait sucks for both cases… you could be left with a lot of half-open connections and cause other issues on your sending application infrastructure, which could cascade elsewhere. In real life it just sucks and it hurts, but only as much as you let it.

In a couple real life examples, I was making API calls and the responses were fine, the TCP handshake was made and we started to communicate. In others, the two-way communication abruptly stopped. In some, I received an unexpected surprise and ended the connection on my side.

Timeouts

Recently, an endpoint I was calling just became unresponsive. At that point I started to troubleshoot (introspect), but realized that expecting that endpoint to behave the same over a long period of time wasn’t fair to myself, or the endpoint…

So yeah, not sure where I’m going with this, but it all feels kind of the same at this point. Perhaps it was the way I was formatting my messages and something was not handled gracefully on the other side.

Not having logs to dive into or metrics to examine kind of sucks, but the show must go on.