PCG logo
Case Study

How monitoring can improve the security and availability of systems

The Challenge

For customers, problems with the availability and security of their applications and infrastructures are a nuisance.

In this case study, we show how we at PCG use monitoring solutions such as Datadog to optimize the availability and data security of our customers - for greater customer satisfaction and a strong relationship between PCG and our customers.

Availability

One of our customer’s major vulnerabilities is facing potential downtimes caused by an EC2 instances or application availability problems, such as those stemming from DoS attacks.

A specific incident highlighting this vulnerability occurred in early October, precisely at 2:23 am on a Saturday. A Confluence outage took place, prompting the on-call PCG operator to take action.

Security

The customer is also highly interested in security concerns surrounding vulnerabilities within Atlassian applications and the flatcar operating system. CVEs on either the application level or the OS might lead to security breaches.

In June 2022, the spotlight was on a critical security issue — a zero-day unauthenticated remote code execution (CVE-2022-26134External Link) — in Atlassian Confluence (Server and Data Center) applications. This caught our attention, prompting swift action from our agile internal vulnerability management team, who closely monitors multiple channels for vulnerability alerts such as this one.

The Solution

To maintain a robust system, we have established an on-call schedule for our customer, providing round-the-clock support in the event of any incidents. We ensure continuous (24/7) monitoring of our infrastructure, specifically focusing on EC2 instance metrics such as CPU usage, disk usage, available RAM and other important metrics. This monitoring is supported through our dedicated monitoring tools, Datadog and New Relic.

image-d8a1b6dfe578

Furthermore, we closely monitor specific metrics related to the customer’s Atlassian Java applications, including swapping, garbage collection, and heap allocations. In the event of any issues, our Datadog monitors promptly trigger alerts, seamlessly integrated with our messaging platform Slack and operations tool OpsGenie. This integration ensures that the employee on-call promptly receives the necessary notifications, as can be seen below:

image-d6516ce62e10
image-80ea4afc0873

In case of the problem described above, a swift restart provided a temporary resolution, successfully bringing Confluence back to normal operation after 8 minutes. Ongoing monitoring revealed a continuous increase in CPU usage and load on the Confluence instance, which ultimately led to the outage. Upon detailed investigation of the logs, it was discovered that numerous requests from a user were triggering high-load tasks, notably PDF exports, through the proxy to Confluence:

image-d0a4e23b441c

Despite notifying the user as well as the customer via email and a Service Desk ticket, the requests persisted several hours after the incident.

To prevent any additional outages, the user was subsequently blocked, which effectively resolved the performance issues.

Security Solution

Our team consistently monitors the flatcar operating system, as well as specific Atlassian application CVEs and promptly addresses any identified issues.

Security monitoring is managed by a range of resources, including announcements from Atlassian concerning application-specific CVEs.

image-16c986bc2fbe

Moreover, we use secalerts.coExternal Link to receive information about CVEs affecting software we are using, such as flatcar. secalerts integrates with our Jira and Slack, generating tickets and alerts within our network.

image-8a435720cc33

These alerts are carefully assessed, evaluating the potential impact on our customer. In situations of elevated risk, we proactively communicate to customers, providing detailed insights into specific issues, potential fixes, and available workarounds to mitigate risks where possible. For critical events, we sometimes conduct system patches during non-business hours, prioritizing the security of our customers’ systems, even in the absence of prior customer approval.

In addition to the two security monitoring solutions above, we use Amazon InspectorExternal Link to scan our container images and the corresponding libraries for vulnerabilities:

image-1e275b91ad5a

When fixable by us, we patch the libraries and deploy the patched containers asap. One example can be seen in the screenshot below, in which we fixed multiple critical and high findings in our PostgreSQL container:

image-c52173ad044a

Regarding the previously mentioned incident, we quickly notified our customers’ technical contacts and implemented the recommended mitigations, including the blocking of requests that matched specific URL patterns. Furthermore, we enforced IP whitelisting to restrict access to instances where feasible. Once Atlassian released the patch version, we applied the automated deployment process, ensuring that the fix was seamlessly integrated into the customer’s system within a matter of hours.

Results and Benefits

The customer subsequently endorsed our process, clarifying that the user’s requests were unintentional and automated.

The continuous monitoring of the customer’s infrastructure and applications around the clock therefore contributed to a consistently stable system, as evident from the subsequent uptime report:

image-1043abf341b2

The above three security procedures therefore guarantee that the software running on the customer’s EC2 instances is patched and secure. This protects the customer data from potential exploits as well as possible.

About PCG

Public Cloud Group (PCG) supports companies in their digital transformation through the use of public cloud solutions.

With a product portfolio designed to accompany organisations of all sizes in their cloud journey and competence that is a synonym for highly qualified staff that clients and partners like to work with, PCG is positioned as a reliable and trustworthy partner for the hyperscalers, relevant and with repeatedly validated competence and credibility.

We have the highest partnership status with the three relevant hyperscalers: Amazon Web Services (AWS), Google, and Microsoft. As experienced providers, we advise our customers independently with cloud implementation, application development, and managed services.


Continue Reading

Article
Gemini for Google Workspace: Prompting Guide for Efficient Use of AI

Your cheat code for Gemini for Google Workspace: Get the most out of the AI features with our prompting guide.

Learn more
Case Study
Software
DevOps
Accounting Accelerates

What began as a start-up in their parents' basement has developed into a leading provider of cloud-based accounting and financial software within just a few years: sevDesk.

Learn more
Case Study
Media & Entertainment
Big data platform supports publishing house with data processing

Axel Springer adopts modern, big data technology to manage a substantial amount of operational data.

Learn more
Case Study
Energy & Utilities
Cloud Migration
Energy (over)Flow

"Fish" that generate energy: This is how Energyminer envisions the future of hydropower. PCG was responsible for setting up their data platform in the AWS cloud.

Learn more
See all

Let's work together

United Kingdom
Arrow Down