Analytics systems at Scentbird

Introduction

At Scentbird, our analytics currently involve dozens of data sources: internal databases, several Shopify stores, third-party analytics services, marketing services, billing systems, event collection systems, and other external sources. After several transformations, all the obtained data is finally deposited in the Snowflake analytics database. In this way, Snowflake serves as a repository, which is used on a daily basis to generate reports and make decisions. Transporting data, followed by cleaning, processing, building more complex aggregated models, building data marts and delivering them to the business, marketers, data analysts, the finance team and other interested teams have become everyday tasks. However, under the illusion of stability lie complex and meticulous processes aimed at selecting tools to solve the existing problems without creating new ones.

How it all began

In 2019, analytics team leader Ivan Zerin compiled a report about developing an in-house analytics system from scratch. Let’s look back at the main challenges the team faced:

● Historical data needs to be stored

● Reports must work quickly regardless of data volume

● The business user may add an arbitrary filter at any given time, meaning it is impossible to optimize the report in advance

● A load cannot be created on the production databases (and these are precisely the main source of data)

● The data must be somehow transformed and cleared without creating complex and hard-to-maintain structures. The need for an ETL tool

● A BI tool is needed that is both developer- and user-friendly. A BI tool will cover 80% of user queries without the need to change existing reports or create new ones for the next requirements.

The final solution looked as follows:

Where:

● AWS Glue/S3 for E-L processes (and partially for T)

● AWS Redshift — a columnar database for data storage

● Looker — a BI tool

● DBT — a tool for the T process

● Mixpanel, Google Analytics — external analytics systems

According to the business specifics, realtime data was not needed (with minor exceptions), so the data was collected and delivered to Redshift once a day.

We thus developed an ETL process incorporating ELT elements. At this point, we should elaborate on each acronym and explain the differences between them. ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two different approaches to handling and storing large volumes of data:

● In ETL, data is extracted from the source systems, then transformed in an intermediate data warehouse, and only then is it loaded into the target database or data warehouse. The transformation of data takes place before it is loaded into the target system. This is the classic approach, and it works well for structured data and situations where the business logic and specified transformation rules are complex and/or must be performed before the data is loaded. However, this approach may be ineffective when working with very large volumes of data, because it requires a large quantity of computing resources at the T stage. In our case, AWS Glue, transformed the data in RAM using Spark under the hood, uploading the intermediate and resulting data to S3, and then loading it into Redshift

● ELT is an advanced approach whereby the data is initially loaded into the target data warehouse and only then is it transformed. This approach became possible thanks to significantly cheaper disk space (storing slightly modified copies of data has become much cheaper than in the heyday of ETL) and database management systems that are optimized for large amounts of data (for example, Snowflake or BigQuery).

ETL -> ELT

It was precisely the appearance of DBT that led to the ELT element being introduced into the classical ETL system. The need for DBT was obvious: we needed a convenient “framework” for managing numerous complex sub-reports that could be structured, inherited from each other, tested and maintained. So, after basic transformation of the data in AWS Glue, further transformation tasks were delegated to DBT.

After a while, the shortcomings of AWS Glue began to show:

The number of transported tables has grown a lot, and the jobs (1 job = 1 table) no longer fit into a batch, so the number of batches was artificially increased
We did not need the full T capabilities of Glue due to the lack of built-in tools for debugging and testing ETL scripts
Spark’s long and cumbersome “cold” start, which meant sometimes it took longer for the cluster to start than for the data transportation itself.

Redshift also began to reveal its flaws, forcing us to look for a replacement:

Redshift became economically unprofitable as competitors offered the same functionality for a lower price. However, we could not count on a getting a discount from Amazon itself; this would require moving to a more powerful cluster, and our data volumes had not grown enough
Extremely poor capabilities for data transformation on the fly. Redshift is a heavily modified fork of PostgreSQL DBMS, incorporating many of its aspects such as query syntax and functions. However, some functions (for example, working with JSON) were removed completely, so the “unpacking” of JSON structures had to be transferred to AWS Glue
RAM limits. Due to its architecture, Redshift is sensitive to RAM consumption, so queries must be carefully optimized. In particular, queries that are too heavy must be executed one after another, since simultaneous execution would lead to Out of Memory errors. Sometimes this led to unexpected consequences, like when an SQL query, which had been executing stable for over a month, decided one fine day it could no longer be executed due to the abundance of CTEs and the new amount of data arriving. This blocked the work process until the circumstances were clarified and negatively affected developer productivity
Once in a while, the execution of queries was interrupted for no apparent reason.

After comparing similar solutions, Snowflake became the frontrunner to replace Redshift, in part due to its rich built-in functionality on top of the standard SQL implementations. In particular, the service provides ample opportunities for working with JSON (as well as Avro, Parquet and others), which was a separate problem for us with Redshift. In this way, the arrival of Snowflake began to radically change our architecture; after abandoning Glue and Redshift, we transferred the data transformation processes directly to Snowflake, which significantly reduced the number of intermediate steps and the related problems. In turn, PipelineWise became the replacement for Glue, whose only task was to transfer data to Snowflake without any changes.

PipelineWise is a tool for automating data transportation based on the ELT paradigm. It takes care of routine tasks related to integrating with different databases, data warehouses and Saas applications. The data sources are connected to the target systems where the data needs to be loaded by means of special connectors called “taps.” Taps are inherited from the Singer project that PipelineWise is based on. PipelineWise provides a number of ready-made taps for popular data sources, and the community offers its own taps for sources not included in the official list.

Argo was chosen to automate and orchestrate ELT processes (PipelineWise + Snowflake + DBT). Argo is an open source project that provides a set of tools for automating and managing tasks in Kubernetes. In particular, Argo Workflows supports complex workflow topologies, including sequential, parallel, and conditional tasks, and provides a set of functions for managing the workflow lifecycle (task restart, stop, and crash recovery).

Business and data are booming!

As time passed, more data sources were added, the number of related components in the Argo pipeline increased, and colleagues from other teams needed to add their own sources for hypothesis testing. A new set of problems arose:

As the pipeline swelled, the time needed for daily data transfer was increasing. This led to a situation where the downstream jobs could not start without first fixing the upstream jobs. The obvious solution was to fragment the process and reduce the connectivity between jobs where possible
Some PipelineWise taps became unreliable because they did not support the declared incremental data transfer and were not supported by the community. In addition, there were no resources available for us to refine them by ourselves
The data engineering team began to suffer in terms of productivity because, alongside solving business tasks, they were now dealing with more issues related to technical debt and maintaining the health of taps and individual links in the ELT process
If and when errors arose in the process of daily data transfer, it was difficult to restart or skip individual steps, as Argo did not allow this. Some jobs would lose their logs, and it was impossible to edit flow jobs from a device other than a computer. One time, I received a notification of failed data transportation while I was leaving the airport early in the morning, when the rest of the team was not yet online. It was an obvious and simple fix, but Argo’s complex UI/UX issues prevented solving the issue from a cellphone.

After analyzing the current stack and our workload, it became clear that, as the amount of data and transportation components increased, two components in our stack began to negatively affect the team’s performance and become blockers for other teams. This is because any new transportation component requires the direct participation of data engineers:

- From PipelineWise, because the taps we were using were not updated by the community and made transportation very difficult

- From Argo, because problems with UI/UX, along with some technical shortcomings (periodic loss of logs and transportation history, limited ability to manage a running pipeline) did not allow us to quickly fix errors and analyze the status of data transportation.

My colleague Efim Matytsin suggested trying Airbyte as an alternative to PipelineWise for several reasons:

An impressive community offering a huge number of connectors to data sources. Airbyte focuses on the quality of connectors, investing resources in keeping them up to date. Our experience confirms this: the connectors we use are regularly updated, which is automatically reported in a friendly interface
Declarative configurations: Airbyte allows you to set up and manage processes using declarative configurations, which saves a lot of time
User-friendly interface
The tool scales well and has well-thought-out mechanisms for processing large volumes of data.

The final architecture began to look like this:

At present, Airflow is being actively implemented to replace Argo. The ability to describe business logic in the form of Python code combined with the flexible setting of the launch schedule makes it possible to abandon the time-consuming packaging of executable code in a Docker image and writing bash scripts for those who don’t need this for their work. For example, a data analytics team can describe data transformation scenarios directly in DAGs and get results more quickly, without wasting time setting up the environment or using third-party task launching and scheduling tools. And for data engineers, Airflow provides detailed statistics on running pipelines, making it possible to compare historical runtime trends. However, we can talk about all this in more detail later, once we fully study the new tool and adapt it to our specific needs.

Almost a conclusion

As a final point, I would like to note that the path of data analytics is very dynamic; only in the last four years of active growth has the architecture of data transportation and transformation at our company undergone profound changes. However, this did not bring us to any specific final stage, because one improvement usually leads to another. The most important thing is to strike a balance between system complexity and usability.

Written by Mark Korzhov, Senior Software Engineer at Scentbird.