Fault Tolerance

// // Copyright (c) 2016-2017 Contributors to the Eclipse Foundation // // See the NOTICE file(s) distributed with this work for additional // information regarding copyright ownership. // // Licensed under the Apache License, Version 2.0 (the "License"); // You may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. //

Fault Tolerance

Proposal: MP-0004
Authors: Emily Jiang, Jonathan Halterman, Antoine Sabot-Durand, John Ament
Status: v1.0 released

During the review process, add the following fields as needed:

Decision Notes: Discussion thread topic covering the Rationale, Discussion thread topic with additional Commentary

Introduction

It is increasingly important to build fault tolerant micro services. Fault tolerance is about leveraging different strategies to guide the execution and result of some logic. Retry policies, bulkheads, and circuit breakers are popular concepts in this area. They dictate whether and when executions should take place, and fallbacks offer an alternative result when an execution does not complete successfully.

As mentioned above, the Fault Tolerance proposal is to focus the aspects: TimeOut, RetryPolicy, Fallback, bulkhead and circuit breaker.

TimeOut: Define a duration for timeout
RetryPolicy: Define a criteria on when to retry
Fallback: provide an alternative solution for a failed execution.
Bulkhead: isolate failures in part of the system while the rest part of the system can still function.
CircuitBreaker: offer a way of fail fast by automatically failing execution to prevent the system overloading and indefinite wait or timeout by the clients.

The main design is to separate execution logic from execution. The execution can be configured with fault tolerance policies, such as RetryPolicy, fallback, Bulkhead and CircuitBreaker.

Hystrix and Failsafe are two popular libraries for handling failures. This proposal is to define a standard API and approach for applications to follow in order to achieve the fault tolerance.

The requirements are as follows:

Loose coupling: Execution logic should not know anything about the execution status or fault tolerance.
Failure handling strategy should be configured when the execution takes place.
Support for synchronous and asynchronous execution
Integration with 3rd party asynchronous APIs. This is necessary to handle executions that are completed at some time in the future, where retries will need to be explicitly scheduled from within the asynchronous execution. This is common when working with various 3rd party asynchronous tools such as Netty, RxJava, Vert.x, etc.
Require immutable failure handling policy configuration
Some Failure policy configurations, e.g. CircuitBreaker, RetryPolicy, can be used stand alone. For example, it has been very useful for circuit breakers to be standalone constructs which can be plugged into and intentionally shared across multiple executions. Likewise for retry policies. Additionally, an Execution construct can be offered that allows retry policies to be applied to some logic in a standalone, manually controlled way.

Mailinglist thread: Discussion thread topic for that proposal

Motivation

Currently there are at least two libraries to provide fault tolerance. It is best to uniform the technologies and define a standard so that micro service applications can adopt and the implementation of fault tolerance can be provided by the containers if possible.

Proposed solution

Separate the responsibility of executing logic (Runnables/Callables/etc) from guiding when execution should take place (through retry policies, bulkheads, circuit breakers). In this way, failure handling strategies become configuration that can influence executions, and the execution API itself is just responsible for receiving some configuration and performing executions.

By default, a failure handling strategy could assume, for example, that any exception is a failure. This all is what the RetryPolicy's retryOn, abortOn are all about - defining a failure.

Standardise the Fallback, Bulkhead and CircuitBreaker APIs and provide implementations.

CDI-first approach to apply RetryPolicy, Fallback, BulkHead, CircuitBreaker using annotations

Detailed design (One example of implementations)

This specification utilises CDI to simplify the programming model.

CDI-based approach

Use interceptor binding to specify the execution and policy configuration. An annotation of Asynchronous has to be specified for any asynchronous calls. Otherwise, synchronous execution is assumed.

RetryPolicy: A policy to define the retry criteria

An annotation to specify the max retries, delays, maxDuration, Duration unit, jitter, retryOn etc.

CircuitBreaker: a rule to achieve fail fast, in order to prevent from repeating timeout

An annotation to specify when to open a circuit, when to half open, close the circuit.

Fallback

Define the fallback method or fallback handler for a failed execution.

Timeout to be used specifying the maximum time for an execution

Timeout to specify the maximum time for a particular execution.

Bulkhead - threadpool or semaphore style

Use this annotation without Asynchronous annotation for semaphore style. When used with Asynchronous, it means threadpool style of bulkhead.

Usage

The annotations can be applied to a bean or methods. They can be used together. For an instance, @Retry can be used with @Fallback in order to trigger the fallback when the Retry policy fails.

@ApplicationScoped
public class FaultToleranceBean {
   int i = 0;
   @Retry(maxRetries = 2)
   public Runnable doWork() {
      Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
                                         // sometimes throws a RuntimeException
	  return mainService;								 
   }
}
}

Configuration

The annotation parameters can be configured via MicroProfile Config. In order to configure the maxRetries to be 6 for the following Retry policy, define a property org.microprofile.readme.FaultToleranceBean/doWork/Retry/maxRetries=6. Alternatively, if the maxRetries of the Retry is to be configured to 6, just specify the property of Retry/maxRetries=6.

package org.microprofile.readme
@ApplicationScoped
public class FaultToleranceBean {
   int i = 0;
   @Retry(maxRetries = 2)
   public Runnable doWork() {
      Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
                                         // sometimes throws a RuntimeException
	  return mainService;								 
   }
}
}

Impact on existing code

n/a

Alternatives considered

n/a

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
api		api
perform-release		perform-release
spec		spec
tck		tck
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pom.xml		pom.xml
site.yaml		site.yaml

License

neilgsyoung/microprofile-fault-tolerance

Folders and files

Latest commit

History

Repository files navigation

Fault Tolerance

Introduction

Motivation

Proposed solution

Detailed design (One example of implementations)

CDI-based approach

RetryPolicy: A policy to define the retry criteria

CircuitBreaker: a rule to achieve fail fast, in order to prevent from repeating timeout

Fallback

Timeout to be used specifying the maximum time for an execution

Bulkhead - threadpool or semaphore style

Usage

Configuration

Impact on existing code

Alternatives considered

About

Resources

License

Stars

Watchers

Forks

Languages