Connectivity and Retries

Connectivity for IoT devices is generally a hard problem to solve, because there are many layers where an error can happen. Physical network, protocols, hardware, device constraints, a lot of things can go wrong.

The following document describes the approach that the Azure IoT Device SDK for Node.js takes to deal with the failures it detects, and how you can alter the default behavior to better suit your needs.

Types of errors and how to detect them

From an SDK perspective there are only a few types of failures we can detect, mostly related to network and protocols:

Network errors such as a disconnected socket, name resolution errors, etc
Protocol-level errors for our HTTP, AMQP and MQTT transports (links detached, session expired...)
Application-level errors that result from either local mistakes (invalid credentials) or service behavior (Quota exceeded, throttling...)

One class of error we do not deal with is hardware and OS-related error such as memory exhaustion or faulty drivers.

Loss of network connectivity

"Loss of network connectivity" can mean a lot of things: is the network controller disconnected? are we still connected but not resolving the IoT Hub hostname? or is it a routing issue where our packets are lost? In any case, to detect those errors, we rely on:

the socket errors that the Node runtime emits (ENOTFOUND, EAI_AGAIN, etc)
the closure of the socket, if the OS supports it (Windows detects a "network cable unplugged" situation for example and kills the socket, whereas the Linux distributions we use for testing do not)
If the IoT protocol supports it, a keep-alive ping.

Socket closure and errors are usually detected and bubbled up to the retry logic layer of the SDK pretty quickly. Keep-alive ping errors however take time, because it takes 2 cycles of pings to detect a failure. For example in our MQTT transport layer, the default keep-alive ping is 3 minutes, and the success of a ping is evaluated just before another one is sent. In other words, in the worst case scenario, it takes 6 minutes to detect a network failure that happens immediately after a successful keep-alive ping.

Protocol errors

IoT Protocols usually make use of OSI layer 5-7 objects to manage session, presentation etc. These can fail even when the network connectivity itself is working perfectly. Links can be detached, sessions can be unmapped, etc. These errors are caught by our transport layer, translated into "protocol-agnostic errors" and then fed to our retry logic that can decide to retry the operation that failed, based on the type of error that is emitted.

Because the client layer of the SDK is transport-agnostic (The client doesn't know and doesn't care which protocol is used), the transport layers track their own state using state machines ("is my session OK, "do I have the necessary links to execute that operation", "am I authenticated", etc). Whenever the client requests a retry, the transport reevaluates its current state and if necessary recreates/reestablishes all the necessary objects to perform the operation.

Application-level errors

"Application-level" in this case relates to an error that can happen server-side because either the service is misconfigured (for example, not enough units compared to the number of messages sent by devices) and that can be sent back to devices in hope to alter their behavior (in case it's being throttled for example).

Depending on the type of error the SDK may try a less aggressive retry policy, or not retry at all and let the user decide what to do.

How does the SDK deals with these errors?

Depending on the error type and the retry policy that has been configured, the SDK may or may not retry operations that could not be completed because of an error. The following sections describe the constructs used in the SDK to make this decision, the default behavior, and how to alter it.

What is a retry policy?

In our SDK, a retry policy is a combination of 2 things:

an error filter
an algorithm to calculate when to retry

Error filters

The error filter is a simple dictionnary where keys are error types and values is a boolean indicating whether the operation should be retried or not if this error is encountered.

Changing the error filter means retrying on a different set of errors. Although good defaults are pretty easy to come up with, there are cases where, depending on the application, you may or may not want to retry. For example, if you get an IoTHubQuotaExceededError, would you retry or not? If it's another few hours before your quota is reset but your default retry configuration si to retry for another 4 minutes, no point in retrying. but if you decide to retry indefinitely, then maybe retrying is OK?

Retry algorithms

When an error occurs and the retry policy kicks in, it calculates a delay to wait before retrying. The idea is that if an error happens very quickly, you don't want to retry immediately and keep hammering the network, or your IoT Hub, and make the problem worse (especially if the error is a ThrottlingError for example!).

The delay between retries is usually the result of 2 variables

how many retries have already been executed
if this is a throttling situation

The math formula used to calculate the delay varies depending on the policy that is chosen, and can generally result in a few different things:

the time between retries can be constant or increasing
a measure of randomness (also called jitter) can be added to avoid the thundering herd problem

The reasons for not retrying anymore could be:

A different error that should not be retried has been received
The total time to retry has been is would be exceeded.

What is the default retry policy?

In the SDK, the default retry policy is called "Exponential Backoff with Jitter". It's a fairly common standard math formula in the industry that tends to be aggressive at the start, then slows down, then hits a maximum delay that is not exceeded.

The formula is the following, x being the current retry count:

F(x) = min(Cmin+ (2^(x-1)-1) * rand(C*(1 – Jd), C*(1-Ju)), Cmax)

There are a few constants in this formula, here's their role and default values:

C: Initial retry interval, 100ms
Cmin: Lower bound for the delay between retries, 100ms
Cmax: Upper bound for the delay between retries, 10,000ms
Jd: Lower bound for the jitter factor, 0.5
Ju: Upper bound for the iitter factor, 0.25

What is the default error filter?

export class DefaultErrorFilter implements ErrorFilter {
  ArgumentError: boolean = false;
  ArgumentOutOfRangeError: boolean = false;
  DeviceMaximumQueueDepthExceededError: boolean = false;
  DeviceNotFoundError: boolean = false;
  FormatError: boolean = false;
  UnauthorizedError: boolean = false;
  NotImplementedError: boolean = false;
  NotConnectedError: boolean = true;
  IotHubQuotaExceededError: boolean = false;
  MessageTooLargeError: boolean = false;
  InternalServerError: boolean = true;
  ServiceUnavailableError: boolean = true;
  IotHubNotFoundError: boolean = false;
  IoTHubSuspendedError: boolean = false;
  JobNotFoundError: boolean = false;
  TooManyDevicesError: boolean = false;
  ThrottlingError: boolean = true;
  DeviceAlreadyExistsError: boolean = false;
  DeviceMessageLockLostError: boolean = false;
  InvalidEtagError: boolean = false;
  InvalidOperationError: boolean = false;
  PreconditionFailedError: boolean = false;
  TimeoutError: boolean = true;
  BadDeviceResponseError: boolean = false;
  GatewayTimeoutError: boolean = false;
  DeviceTimeoutError: boolean = false
}

How to change the retry logic

The device client has a specific method to change the retry policy, called setRetryPolicy that accepts a RetryPolicy object, which will be used to compute whether or not to retry, and if so, after what time.

Built-in retry policy objects

The SDK comes with 2 built-in RetryPolicy classes:

The ExponentialBackoffWithJitter class that implements the default retry policy discussed in the previous paragraph.
The NoRetry class that simply disables the retry logic and doesn't take any parameters.

Creating a custom retry policy

The RetryPolicy is public and it is possible for the SDK user to implement it and inject it in the SDK:

export interface RetryPolicy {
  nextRetryTimeout: (retryCount: number, isThrottled: boolean) => number;
  shouldRetry: (error: Error) => boolean;
}

The shouldRetry method is called when retrying and is passed the error that caused the retry in the first place. It should return true if the retry policy should kick in or false to disable the retry and fail the operation. Depending on what type of operation is in progress, either the operation callback will be called, or an error will be emitted by the Client object.

If shouldRetry returns true, the nextRetryTimeout method is called with 2 arguments: the current retry count, and a boolean indicating whether it's a throttling situation or not. It should return a number that indicates the number of milliseconds to wait before attempting the operation again.

Pitfalls and potential issues when retrying

There are a few of things that can go wrong when implementing retries:

Not retrying conservatively: hammering the server with retries when throttling is in progress is going to make things worse. Here is a good article to learn more about throttling and what can trigger it. For example you can get throttled for sending too many messages at once, but also for trying to connect to many devices at once. That's why the jitter factor in the retry policy is important.
The SDK can receive an unknown error from the underlying socket, protocol library or the service. In that case, the error filter will most likely not know whether to retry or not and will default to not retrying (again, conservative approach is best).

Conclusion

hopefully you've learnt a bit more about how the SDK works and how retries are implmented. If you want to learn even more, you can look directly at the source code in the SDK repository and if you have questions, ask them in the issues section of the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly