Different types of applications perform tasks that include a number of steps, some of which might invoke remote services or access remote resources. Whenever possible, the application should ensure that the task runs to completion and resolve any failures that might occur during communication with external endpoints.
These types of failures usually can't be retried immediately, it is better to wait for a while and try again later with a hope that remote service or resource is back to a functional state.
The problem becomes more complicated when we start to think about recovery from the crashes of the application itself, for example, if a node where an application is hosted went down. Someone needs to pick up unfinished job and make sure it is completed.
I would like to propose a solution for the efficient supervision logic based on Akka and Akka clustering, with an ability to react immediately on the node failure and load-balance unfinished supervised work between cluster nodes.
I have a nice demo with a dashboard where you can see all "pending" tasks in cluster nodes and how the tasks are recovered and redistributed among alive nodes if one of them fails.
SHARE THIS TALK