High Availability and Robustness in Automation Solutions
This post is part of the "Automation-Orchestration" architecture series. Posts of this series together comprise a whitepaper on Automation and Orchestration for Innovative IT-aaS Architectures.
High Availability and Robustness are critical elements of any automation solution. Since an automation solution forms the backbone of an entire IT infrastructure, one needs to avoid any type of downtime. In the rare case of a system failure, rapid recovery to resume operations shall be ensured.
Several architectural patterns can help achieve those goals.
Multi-Server / Multi-Process architecture
Assuming, that the automation solution of choice operates a centralized management layer as core component, the intelligence of the automation implementation is located in a central backbone and distributes execution to “dumb” endpoints. Advantages of such an architecture will be discussed in subsequent chapters. The current chapter focusses on the internal architecture of a “central engine” in more detail:
Robustness and near 100% availability is normally achieved through redundancy which comes by the trade-off of a larger infrastructure footprint and resource consumption. However, it is inherently crucial to base the central automation management layer on multiple servers and multiple processes. Not all of these processes necessarily act on a redundant basis, as would be the case with round-robin load balancing setups where multiple processes can all act on the same execution. However, in order to mitigate the risk of a complete failure of one particular function, the different processes distributed over various physical or virtual nodes need to be able to takeover operation of any running process.
This architectural approach also enables scalability which has been addressed previously. Most of all, however, this type of setup best supports the goal of 100% availability. At any given time, workload can be spread across multiple physical/virtual servers as well as split into different processes within the same (or different) servers.
Another aspect of a multi-process architecture that helps achieve zero-downtime non-stop operations is that it ensures restoration of an execution point at any given time. Assuming, that all transactional data is stored in a persistent queue, the transaction is automatically recovered by the remaining processes on the same or different node(s) in case of failure of a physical server, a single execution, or a process node. This prevents any interruption, loss of data, or impact to end-users.
Purposely, the figure above is the same as in the “Scalability” chapter, thereby showing how a multi-server / multi-process automation architecture can support both scalability and high availability.
In a multi-server/multi-process architecture, recovery can’t occur if the endpoint adapters aren’t capable of instantly reconnecting to a changed server setup. To avoid downtime, all decentralized components such as control and management interfaces, UI, and adapters must support automatic failover. In the case of a server or process outage these interfaces must immediately reconnect to the remaining processes.
In a reliable failover architecture, endpoints need to be in tune with the core engine setup at all times and must receive regular updates about available processes and their current load. This ensures that endpoints connect to central components based on load data, thereby actively supporting the execution of load balancing. This data can also be used to balance the re-connection load efficiently in case of failure/restart.
Even in an optimum availability setup, the potential of an IT disaster still exists. For this reason, organizations should be prepared with a comprehensive business continuity plan. In automation architectures, this does not necessarily mean rapid reboot of central components – although this could be achieved depending on how lightweight the automation solution central component architecture is constructed. More importantly, however, is the ability of rebooted server components to allow for swift re-connection of remote automation components such as endpoint adapters and/or management UIs. These concepts were discussed in depth in the scalability chapter.
Key Performance Indicators
Lightweight, robust architectures as described above have two main advantages beyond scalability and availability:
- They allow for small scale setups at a low total cost of ownership (TCO).
- They allow for large scale setups without entirely changing existing systems.
One of the first features of an automation system as described above is a low TCO for a small scale implementation.
Additional indicators to look for include:
- Number of customers served at the same time without change of the automation architecture
- Number of tasks per day the automation platform is capable of executing
- Number of endpoints connected to the central platform
In particular, the number of endpoints provides clarity about strengths of a high-grade, enterprise-ready, robust automation solution.