You rarely get an advance notice that disruption is coming. Safeguarding your systems is very often of extreme importance for any business operations, otherwise, you risk facing downtime, losing clients and reputation. That’s why every business must plan for making its systems resilient to potential threats and vulnerabilities. In this article, we share the things to consider on the road to resiliency and how to prepare for surviving chaotic situations and unexpected failures.
Infrastructure resilient options
Almost all enterprise applications need external integrations in order to meet user expectations. Modern software architectures, like microservices or serverless, most commonly break down the application logic into smaller components, which number is growing at a rapid pace. This leads to the question – how can we ensure that all the components are resilient enough?
The first thing that we should think about is how to make our infrastructure fault-tolerant and available. Most cloud providers have built-in solutions for high availability which almost completely solves the problems that we might face from an infrastructure point of view. For everything else, load balancers, autoscaling groups and/or message brokers come in place.
Load balancers distribute evenly requests to all application servers. They can easily track the application health and send traffic to healthy servers only. Another very useful feature is to use SSL termination on the load balancer. Decrypting the incoming data is a CPU-intensive operation and, allowing the load balancer to do that job, gives our application servers more resources to process client requests.
Autoscaling groups monitor the health and resource utilization of target services. If more capacity is needed, either more servers will be created, or the existing ones will be bumped up to have more resources.
Message brokers handle message processing. When a server picks a message but does not process it in a reasonable time, another server takes the same message.
We can also separate critical components to use dedicated resources (e.g. database, queue, etc.). This will ensure that these components will have better performance and will be less likely to fail. But the reality is that either the components under our direct control or the external integration services will fail at some point. The statistics say that the more components our systems consist of, the more important it becomes to plan for scenarios of an actual failure. For example, Netflix has gone further than anticipating/planning failure and implemented automated failure testing.
With unexpected events lurking behind every corner, proper monitoring is sure to save your life many times. For example, good monitoring in place will notify us if we need to extend our database disk space before it runs out and the database stops running normally. When an actual error happens, it needs to be logged in a single repository which can be easily searched by the support team. When we’re talking about microservices architecture, it is recommended to create a distinct error tracking service.
We already discussed the autoscaling groups which can replace unhealthy servers or add/remove them depending on the application load. But what happens if each server maintains the application state in memory? This makes our servers more irreplaceable than they are supposed to be because every time a server goes down many users lose their recent progress and data. That’s why, we should strive for making all our application stateless which allows us to replace any existing application server and eliminates the hassle of handling user session and state during server replacement. It’s even better to think about the servers as redundant components which can be replaced anytime without notice. If we plan for that in the beginning, we will be ready for almost any scenario.
Application code recommendations
It is common that dependent services are too busy to respond in a timely manner or don’t respond at all due to failure. Our application performance will hinder if we wait for a reply for too long. For that very reason, it is always a good idea to set a timeout when sending a request to an external service. In other words – it’s better to fail after a couple of seconds than keep waiting for a reply forever.
When a service is unavailable or our requests are throttled, we can think about implementing a retry pattern. This pattern assumes that the external resource is either temporarily unavailable or we have reached a call quota/limit. The most common strategy is to retry one or more times to call the same endpoint. In case we know that the dependent service has a call quota we can send retry requests with a small delay which increases with each request, to mitigate the chance of throttling errors.
If our application serves many users, it might be a good idea to consider a circuit breaker design pattern. Due to the heavy load, our application might try to access the same external resource hundreds of times in the span of seconds. We do not want to end up in a situation where all the requests to this resource fail and continue to fail in the same fashion for a long period of time. We can save network resources, hardware resources and time, using a simple, yet effective solution – the circuit breaker. The job of the circuit breaker is to monitor the calls to an external resource. If the configured threshold number of failures is reached, we can set the circuit to open state so that all new requests will automatically fail, without making calls to the problematic resource for a specific period. After some time, the circuit breaker can check if the external resource is reachable. If that is the case, the circuit breaker will go to closed state and process requests to the resource once again. Оtherwise, it will simply return to an open state, see how below.
The above patterns allow us to fail fast or return to a working state more easily. However, external service or most recent data is not always so important for the operation of our application. In many cases we can fall back on showing obsolete data that was cached several minutes ago rather than showing no data at all. If we want to further enhance resiliency, we can combine several of the patterns at once. For example, timeout, retry and circuit breaker design patterns work very well together.
To sum up
Nowаdays, users have tremendous expectations when it comes to their favorite applications.They need to be either available and fast or will be quickly forgotten. Fortunately, the major cloud vendors provide features that can automatically handle many of the scalability and resiliency tasks for us, and even for millions of users. So, incorporating a resiliency planning into your business must really be on the agenda. It will allow you to easily scale applications depending on the load and create failure-prone infrastructure.
Need an extra hand to help you out on your journey to resiliency? Turn to our team at Accedia who has helped diverse stakeholders evaluate plans, set strategies and implement projects that enable their team to adapt and thrive when faced with challenges.