Big data platform supports publishing house with data processing

About Axel Springer

Axel Springer is a transatlantic publishing house that employs over 18,000 people at 200 brands across the globe. From our historic headquarters in Berlin and our vibrant new corporate offices in New York City, we invest in cutting edge technology and businesses to shape the next chapter of the digital revolution in global media. We are optimistic about the future of media and committed to building resilient, independent journalism businesses in Europe and the United States.

The Challenge

The largest European publishing group Axel Springer, known for brands such as Bild, Welt, Business Insider, or Politico, has probably crossed everyone’s path. And in the case of Axel Springer, we can speak of Big Data, with over a petabyte of data being processed daily. The National Media and Tech (NMT) at Axel Springer SE is responsible for all topics relating to Tech & Product. Within this, the Data Section is responsible for all topics relating to data products.

Axel Springer has been migrating to the Big Data platform Foundry (Palantir) for four years and, after an initial phase in which the first teams brought their data pipelines into the platform, is in the process of migrating all existing data products as well as building all new ones directly in the platform.

To support colleagues in the migration process, to ensure that a unified, cross-project structure is created and maintained, there is the One Data Platform team of which PCG was also a part.

One Big Data platform for different users

Foundry is a Big Data platform and describes itself as

Foundry is a highly available, continuously updated, fully managed SaaS platform that spans from cloud hosting and data integration to flexible analytics, visualization, model-building, operational decisionmaking, and decision capture.

The platform thus takes over all infrastructural tasks and offers the data engineers, data scientists and data analysts who work with it the possibility to concentrate completely on their tasks that are directly related to the data. However, this does not mean that Foundry takes over everything. For example, while Foundry controls where data is stored (including backups, retention time, etc.) NMT staff must determine how data is stored (what is the schema, is data partitioned). In Foundry there are different applications to aggregate data, those that do not require any code, as well as those where you only need to write the ETL transformations. To have maximum flexibility and write the whole pipeline and tests in code, one can also create repositories. These different applications allow a variety of users with different programming skills to work together on data projects. There are also applications for data health checks, visualization and even those that allow you to build your own applications based on your own data. An important technology used here is Apache Spark, a framework for distributed computing. This is based on the fact that the data is stored and processed in a distributed manner, and thus a high degree of parallelization takes place. This allows each data pipeline to scale with more computers as the amount of data grows (called horizontal scaling).

A common platform

Giving AS Data teams access to the platform and asking them to migrate to it and develop their new products there is a start - but it also quickly leads to chaos. That’s why we have the One Data Platform Team (ODP), which supports the different teams in this process and is also there to collect the knowledge of the different teams, put it into a more general context and then distribute it to everyone. The goal is for the other teams to be able to produce optimal and efficient data products and for these to fit together across the Data Section (whether because data is also to be used by other teams, or because there are standards so that teams can help and benefit from each other by spreading their best practices).

The Solution

Rather than pre-defining a big migration plan and building a rigid set of rules (which risks never being implemented that way), PCG and the ODP went to different teams or helped them migrate or build their data, ETL pipelines, ML models, reporting and applications on Foundry. With this approach, other employees were trained in Foundry, Spark, or ETL design, and at the same time, the ODP team also had various experiences with the platform.

Individual coaching for a sustainable good code base

In addition to joint code reviews and knowledge sharing sessions, PCG focused on coaching. Thereby, PCG went into different teams and completed tasks together with other Axel Springer employees. Since Axel Springer has interdisciplinary teams for different data products, the level of knowledge in these teams is often very different. So it was sometimes a matter of teaching engineering best practices, such as readable and tested code and health checks for data. On the one hand, this was valuable for the employees, but also for the ODP team, which is interested in how the projects on the data platform should be, so that even newcomers can quickly find their way around.

In other cases, it was a matter of optimizing computationally intensive pipelines. Here, PCG paid attention to analyzing the pipelines, avoiding redundant steps and storing and reading the data as efficiently as possible. Spark provides various storage options here, such as partitioning and bucketing. Further, the ODP team and PCG made sure that the resources available were used optimally and analyzed the pipelines based on usage metrics.

Results and Benefits

After a period in which the ODP together with PCG accompanied many teams, implemented data products and gathered and discussed experiences, common project structures were defined for the entire Data Section so that it was not necessary to think about each project anew. Another important pillar for the platform was building a data catalog, as many teams share common data sources or the data produced by one team serves as input for other projects.

Summary

When PCG started working at Axel Springer, a platform existed and was already being used by some. By building an ODP team that PCG contributed to, Foundry became not only a tool that is used for calculations, but Foundry became a data platform for the whole Data Section of Axel Springer.

About PCG

Public Cloud Group (PCG) supports companies in their digital transformation through the use of public cloud solutions.

With a product portfolio designed to accompany organisations of all sizes in their cloud journey and competence that is a synonym for highly qualified staff that clients and partners like to work with, PCG is positioned as a reliable and trustworthy partner for the hyperscalers, relevant and with repeatedly validated competence and credibility.

We have the highest partnership status with the three relevant hyperscalers: Amazon Web Services (AWS), Google, and Microsoft. As experienced providers, we advise our customers independently with cloud implementation, application development, and managed services.

Case Study

Big data platform supports publishing house with data processing

About Axel Springer

The Challenge

The Solution

Results and Benefits

About PCG

Services Used

Amazon Web Services

Continue Reading

Scaling Smarter: How StiQ Modernised on AWS with PCG

PCG Achieves AWS GenAI Competency

PCG Named as a Launch Partner for the AWS European Sovereign Cloud (ESC)

Modernization on AWS: Rethinking Applications

Let's work together

Case Study

Big data platform supports publishing house with data processing

About Axel Springer

The Challenge

The Solution

Results and Benefits

About PCG

Industry

Share it!

Services Used

Amazon Web Services

Continue Reading

Scaling Smarter: How StiQ Modernised on AWS with PCG

PCG Achieves AWS GenAI Competency

PCG Named as a Launch Partner for the AWS European Sovereign Cloud (ESC)

Modernization on AWS: Rethinking Applications

Let's work together