The Challenge
The success of a marketing program depends on continuous analysis of performance metrics to make informed, data-driven decisions about partnerships and campaigns. When you are one of the largest global players in Digital Marketing, you require performant and scalable systems for analyzing the vast amount of data generated every day. PCG was mandated to assess one of the analytics systems based on Amazon OpenSearch Service (successor of Amazon Elasticsearch), identify areas of cost optimization, and advise on best practices for operation and maintenance.
Our client offers a broad range of tools to monitor performance indicators of marketing campaigns, such as clicks, impressions and conversion rates. The project targeted the system responsible for tracking, analyzing, and building customer-facing reports of click and impression data. The data is streamed into the system through Apache Kafka, enriched by Spring Boot microservices, indexed in an Elasticsearch cluster running on Amazon OpenSearch Service, and regularly aggregated by Spark jobs. Using Elasticsearch as the analytics engine led to good performance and stability, especially during times with high load, such as Black Friday. Furthermore, choosing the managed Amazon OpenSearch Service greatly simplified cluster maintenance in the early stages, despite the team having limited experience with it.
The focus of this project was to build on top of this foundation and take the system to the next level, addressing the following topics:
- Review the current cluster configuration and identify possible inefficiencies and opportunities to save on costs.
- Improve maintenance automation, especially for creating staging environments for testing and QA, instead of using the AWS console for all the operations.
- Assess the indexing and query set up to reduce the impact of the aggregation jobs on the query latency, which occasionally affected the user experience.
- Advice on how to run benchmarking experiments in Elasticsearch in a timely and reproducible manner, to evaluate alternative queries and settings.
The Solution
We reviewed the Amazon OpenSearch Service settings, monitored the cluster statistics and usage over time, and validated capacity planning. The outcome was a list of cost-saving improvements worth up to 10,000 euro a month.
Cluster Configuration
Finding the Elasticsearch instance types overprovisioned in CPU and memory, we proposed cluster configurations more adequate to the actual needs, and guidelines for their iterative evaluation without affecting the production system. Changes to the subscription model and the storage setup (EBS downsizing and use of UltraWarm) could achieve further savings. Moreover, the recommended enhancements to cluster security and resilience could reduce the hidden costs of operational interruptions.
Maintenance
Infrastructure as Code is a key element of DevOps and a PCG best practice. We provided our client’s engineers with a Terraform template to provision, update, and teardown an Amazon OpenSearch cluster in a given environment. The template included a Lambda function to bootstrap data from snapshots stored in Amazon S3, automating the creation of ephemeral clusters for performance tests and QA.
We also advised on improvements to data retention management, suggesting discontinuing the home-brew script in place in favor of the built-in Amazon OpenSearch feature Index State Management, which enables more sophisticated rollover and deletion policies - and one less script to maintain.
Indexing and query
We tackled the query latency problem by reviewing the index set up first. We discussed a different sharding strategy with the team for their index-heavy use case, recommending bigger and more evenly distributed shards to handle the load caused by the Spark jobs. Another series of small but impactful changes to the index mapping and settings could lead to further improvements. Then, we evaluated alternative options to reduce the impact of the jobs on the cluster performance, ranging from identifying better time windows for their execution to rethinking the aggregation process using Elasticsearch features such as rollups or transforms.
Benchmarking
We created a runbook to semi-automate benchmarking experiments in Amazon OpenSearch and increase trust in the results. Based on our experience and best practices, we tailored the runbook to the click and impression analytics system and reviewed it with the team to underline the repeatability and reproducibility of the methodology.
Results and Benefits
Without question, our client made the right choice in putting Elasticsearch at the core of their analytics system for Digital Marketing data. Using Amazon OpenSearch Service did speed up time-to-market, but left room for optimization. As certified AWS and Elasticsearch partners, PCG worked together with the client’s engineers to increase both the operational and cost-efficiency of the system. The result was an actionable list of configuration and process improvements expected to nearly halve AWS costs, along with tools to enhance automation and confidence in maintenance operations.
About PCG
Public Cloud Group (PCG) supports companies in their digital transformation through the use of public cloud solutions.
With a product portfolio designed to accompany organisations of all sizes in their cloud journey and competence that is a synonym for highly qualified staff that clients and partners like to work with, PCG is positioned as a reliable and trustworthy partner for the hyperscalers, relevant and with repeatedly validated competence and credibility.
We have the highest partnership status with the three relevant hyperscalers: Amazon Web Services (AWS), Google, and Microsoft. As experienced providers, we advise our customers independently with cloud implementation, application development, and managed services.