PCG logo
Article

Incidents and Operational Resiliency - Why it Matters and What to Consider

Are you prepared for an IT incident? If a core component of your infrastructure suddenly fails or data is lost, who would you contact? About two thirds of German mid-size companies would shrug off an answer to these questions [1] - even though being prepared could save enormous costs.

Facts and Data

An average IT outage in a mid-sized company (200-5,000 employees) costs about 25,000 Euro per hour, according to a recent study. On average, German companies experience up to four of these outages per year where each outage lasts about 3.8 hours - causing annual economic damage to over 380,000 Euro! [1]

There are many reasons for the high level of damage: even though only one third of the outages registered has any impact on customer operations [2], internal disruptions can also lead to widespread loss of productivity.

Causes and Mitigation

A process review can offer the greatest remedy: a full 20% of outages can be traced to poor process adherence [2]. At this point, the root cause must be carefully examined: the reason for the process deviation is an important clue as to what can be improved. In times of hybrid and remote work environments, the requirements for processes change as well. “People before process” is a helpful mantra to consider, keeping focused on adjusting processes to the needs of your staff. This does not mean that “sensitivities” should dictate work flows, but that processes should support employees in their work as much as possible and not hinder them.

But it is also possible to use technological influence: especially hyper-scalers such as AWS offer a plethora of possibilities to react to outages and errors in a variety of ways - be it by utilizing smart monitoring and alerting or even automatic error resolution, for example by restarting a certain service or machine.

The choice of your cloud provider is therefore the first factor in a resilient infrastructure: AWS is one of the few cloud providers that has guaranteed the physical separation of its availability zones since its early years, thereby establishing geophysical redundancy. Microsoft Azure only established mandatory geophysical separation in 2018 [3] and Google Cloud Platform still does not guarantee significant physical separation of its zones, although they famously provided the reason why this makes sense in 2023. [4]

The services and technologies used are also a key factor to achieving resiliency: smart monitoring and logging, as well as properly configured autoscaling and established error management already go a long way.

Finally, a missing or unknown disaster recovery concept is another reason for long-lasting outages. While a prepared company can ideally restore basic operation with the push of a button, unprepared ones often times have to take inventory first to see what actually needs to be worked on to restore operations.

A famous scenario where you directly profit from being prepared would be a ransomware attack. This not only affects your applications, but also renders all company data inaccessible. A well-architected cloud infrastructure with protected (tamper-proof) backups can save significant amounts of time and money in this case: affected applications can be quickly terminated, and well-trained data recovery operations can reduce any data loss to an acceptable level.

Conclusion

Incidents and outages are expensive. Even more so with a growing number of employees and/or customers. Having to deal with an outage ad-hoc is prone to errors and takes a lot of time. Being prepared by utilizing current data from economy and research and establishing change, incident and recovery procedures will help to avoid incidents or at least keep them short. To prepare, all parts, processes, infrastructure and threat models of your production chain should be considered.

Make our Expertise Your Own!

As an AWS strategic partner, PCG can call on many years of experience to support you on your journey to resiliency: from best practice and process reviews and optimization over to providing expert knowledge on various AWS technologies, on observability and ElasticSearch as well as orchestrating your microservice deployments on Kubernetes - we are happy to support you to the best of our ability.

Don’t hesitate to get in touch if you wish to review your infrastructure and processes in regard to resilience - we look forward to working with and supporting you on your resiliency journey!

[1] https://digitalisationworld.com/news/27800/hp-studie-it-systemausf-auml-lle-kosten-deutsche-mittelst-auml-ndler-im-durchschnitt-fast-400000-euro-pro-jahrExternal Link

[2] Uptime Institute, Annual Outage Analysis 2023

[3] https://azure.microsoft.com/en-us/blog/azure-availability-zones-now-available-for-the-most-comprehensive-resiliency-strategy/External Link

[4] https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPfkYExternal Link


Services Used

Continue Reading

Article
Automation
Automated Control Rollout in AWS Control Tower

Control Tower Controls help you to set up guardrails making your environment more secure and helping you ensuring governance across all OUs and accounts.

Learn more
News
Above the Clouds: PCG's Stellar Performance at the AWS LeadMaster Challenge 2024

Wow, what a triumph! Public Cloud Group has just swept the AWS Summit 2024 Lead Master Challenge.

Learn more
Article
AWS Events 2025: The Future is Cloud

As a leading AWS Premier Partner, we're thrilled to present the exciting lineup of AWS events for 2025.

Learn more
Article
Protecting Lambda URLs with Cognito, IAM, Lambda@Edge and CDK

In this article, we’ll look at how to secure Lambda URLs using IAM access control. With complete code to try yourself!

Learn more
See all

Let's work together

United Kingdom
Arrow Down