A product that requires the services of third parties (understood as “third parties” any entity that intervenes in the development cycle as independent people, companies, own teams… that are not managed by the main team) must be tolerant of errors and react to any adversity that may arise in its context.
In the previous article, we advanced that any integration with external agents has to be controlled. In this article we focus on the following premise:
We must be able to react to transitory errors
To be able to react to this type of error we must first understand and internalize what we are referring to as “transient”. A transient error is one that occurs unexpectedly and is likely to disappear in a short period of time. It can appear because of:
- Connection errors due to a momentary drop in service.
- Overexploitation errors due to an overload in the number of connections to it.
- Errors due to data corruption during transmission.
- Errors in services due to unknown problems but which do not extend over time.
These types of errors that do not occur continuously over time can be corrected by following a simple “Try again” strategy.
We must choose the right number of attempts depending on each situation. There will be times when continuing to try indefinitely may be the best option, but it may be that in other cases continuing to try indefinitely may result in unacceptable waiting times, so it will be necessary to set a consistent limit on attempts.
For example, a service we have contracted to carry out text analysis to obtain its polarity may reject our connections because we have made too many calls in a very short time. The approach here could be to retry later in order to “rest” that service.
Another example, we are using a service managed by another team within our own company and it requires a 1 minute downtime for maintenance tasks. We can’t afford to have our dependent product collapse and stop providing service. We must be able to retry these interactions until the service is available again.
Once we have understood the concept of transient error, there are different approaches to applying a Retry pattern. Below are some variations that allow us to act correctly in slightly different situations.
This variant is based on the premise that: If a service has failed, it is likely that if I try again immediately it will be available.
This way of handling transient errors is the easiest to use. If we have detected that the cause of the error is unlikely to recur (e.g. data corruption), we can choose to try again immediately.
Retry with a fixed delay
This variant is based on the premise that: If a service has failed, it is likely to fail again if you try again immediately.
Here we are faced with a slightly different situation, if we detect that a third party service has failed for its own sake, we should not assume that it is capable of immediate recovery. This allows us to provide you with a few moments of respite by not overloading you with too many unsatisfactory requests.
There will be times when, if a service fails, making multiple subsequent calls will only result in a longer downtime.
For example, if there has been a network saturation problem, continuing to establish more connections will only cause the network to become more and more saturated and increase the severity of the problem.
In this case, if we set 3 seconds as the base waiting time and the communication is not satisfactory, we will have the following waiting times:
|Try||Waiting time (seconds)|
In the example we are using waiting times that scale linearly.
Depending on the needs we may even need to define our own function that calculates retry times. In the following graph we see the retry times following an exponential function:
Throughout this article, different approaches have been offered to solve the problem of transient errors. The following conclusions can be drawn from them:
- Always use a “Retry” pattern when a connection to third party services has to be made.
- Identify the business needs and choose the best variant of the Retry pattern.
- Establish a coherent limit of attempts that will depend on the casuistry of your project.
- There will be times when a “Fail Fast” is the approach to use and you should not retry a connection too many times.