High Availability

Learn about Augtera's redundancy architecture

High Availability is a critical offering provided by Augtera across both services and data. Augtera platform is a big data distributed system with many micro-services acting on real-time ingested data. As explained in the Overview section, there are six primary functional areas and each addresses high availability based on the constraints unique to each.

Resources provided have a direct impact on system redundancy. In addition, deployment model influences the constraints on the system as well. Following sections explain the resources and their impact on both collector and platform stack, and thereby, the impact on the services and data.

Services in Collector stack are designed to be stateless, i.e., no internal state or data from Augtera services needs to be persisted. In addition, there is no ingestion data saved in the Collector stack, and hence resource failures that impact high availability of Collector services are limited to following two scenarios: Node and Connection failure.

Node failure

Host or VM failure is the most common failure scenario. Depending on level of redundancy desired, there are two deployment models to consider

Full Redundancy

In Full Redundancy model, twice the number of required nodes are deployed. It is recommended that nodes be deployed in locations that are in different availability zones to avoid multiple failures at the same time. Collector orchestration deploys services in a way where only one instance of any service is active at any given location within a cluster. On a failure, services impacted are dynamically moved to remaining nodes within the cluster. As an example, the diagram below illustrates a case where three clusters are deployed across six locations. Each cluster is fully redundant in itself and provides protection for one node failure. Single node in each cluster is sufficient to ingest the desired rate and hence each cluster is a two node setup. This deployment provides a very high level of redundancy across the network.

N-Factor Redundancy

In N-Factor model, required nodes are supplemented by a factor N for achieving the desired redundancy. Orchestration of services in the Collector stack can be done in a way to protect against a fixed number of node failures. For example, in a deployment where two nodes are required for ingesting data, one more node can be added to provide redundancy. Services impacted due to a node failure are dynamically moved to other remaining nodes, thereby providing redundancy for ingestion data. It is recommended that nodes in the cluster be in different availability zones to keep the number of failures under what is desired.

Connection failure

Collector stack connects to platform stack over HTTPS or using Kafka over TLS. Since both connections are stateful, failure in connection to platform stack can impact ingestion data already received by collector stack. Data handling is dependent on multiple factors and following three cases explain behavior for each.

Kafka Ingestion: If data is being ingested from kafka, then services in Collector stack will pause and resume reading from Kafka broker when connection to platform stack is restored. However, data in transit at the moment of the failure can be lost.
SNMP Ingestion: If data being ingested is SNMP Trap or via SNMP polling, then data in transit that cannot be pushed to Platform stack due to failure is persisted on local filesystem on the Collector stack. When connection to platform stack is re-established, data from persisted filesystem is read and pushed first and eventually services catch up to real-time ingested data.
Everything else: All other data being pushed to Collector stack is buffered in RAM for transient failures but only for very short periods. How much data is lost is dependent on the rate of incoming ingestion data.

Platform Stack Redundancy

Services in Platform stack are both stateless as well as stateful, and hence, require state to be persisted. In addition, ingestion data that is enriched by the pipeline is also saved in the Platform stack, and hence high availability of Platform services and data are highly dependent on both resource constraints and deployment models.

Data Availability

Irrespective of the deployment model, data availability is primarily dependent on underlying storage infrastructure. In cloud environments, Platform stack uses the storage service, such as AWS S3, offered for data availability. In on-premise deployments, Platform stack requires underlying storage infrastructure, such as NFS, to provide data availability. Service availability is explained in the next section but service failures have no impact on data availability as services that are impacted dynamically move to remaining nodes that also have access to underlying data.

Service Availability

There are two deployment models to consider for service availability: Primary-DR and N-Factor. Both models provide different levels of redundancy and must be chosen based on the SLA requirement.

1. Primary and Disaster Recovery

In this Primary-DR model, 100% high-availability is provided for both services and data. All resources are allocated for two different stacks. Both stacks run independent of each other and share no resources. Moreover, two stacks are deployed in different fault domains or availability zones.

Out of the two stacks, one is designated as Primary and the other is designated as DR (Disaster Recovery). The Primary stack is responsible for Presentation functionality as mentioned in Overview section. All outbound interaction from Augtera occurs only on Primary stack. For example, ticket creation in ServiceNow will be done only from Primary stack. DR is always ready for failover to assume the role of primary. When a failure occurs that can impact the Primary stack to lose any critical functionality, Augtera provides a one-click operation to transition the DR stack to assume the role of primary, and thereby, providing full redundancy across all data and services.

2. N-Factor Redundancy

In N-Factor model, required resources are supplemented by a factor N for achieving the desired redundancy. For example, if 50% more nodes than minimum required are allocated, then approximately 50% redundancy is achieved. The orchestration of services across available resources is done in a way to optimize the availability of critical services and data. To precisely calculate how much redundancy exists in a deployment with less than twice the required resources is much more complex as all individual resources and their failure impact across storage, compute, memory and network need to be evaluated. It is critical to define SLA requirements that define how many and what kind of failures need to be handled for providing availability of critical services and data. For example, an on-premise deployment with ten nodes can be orchestrated in way to allow critical services to always be functional for up to two node failures, thereby, providing 20% redundancy for node failures. With more than two node failures, service degradation can occur.