From Transactional Bottlenecks to Lightning-Fast Analytics
Co-authored with Anudeep, Rohan
At Uptycs, our data platform architecture has evolved over the years through the natural progression of pretty much every data platform. Initially our architecture revolved around Online Transaction Processing (OLTP) databases (in our case, mostly PostgreSQL), for managing following categories of data:
- Customer-specific configurations
- Feature metadata
- Access control policies
- Other microservices-related data
- Micro Services generated data
This setup worked well when data access was simple, but as our platform expanded, so did the complexity of managing and utilizing this data, specially category 5 of data. Engineers often had to query individual databases manually, write custom scripts to correlate data across multiple stores, and deal with the lack of a global view of our data.
Over time, as more team built their own services and introduced new databases tailored to their specific workloads, our data became increasingly scattered and fragmented. Some of these decisions were intentional — driven by microservices best practices, while others were a natural consequence of rapid iteration and team autonomy while some were the outcome of just laziness and being comfortable.
In parallel, we were building our petabyte-scale data lake, which served as the foundation for storing large-scale cybersecurity telemetry and analytical data. However, as our data ecosystem grew, we faced an increasing need for cross-database analytical workflows that required joining OLTP data with our data lake. This fragmentation created a bottleneck, making it clear that we needed a way to query data seamlessly across multiple sources without requiring extensive data movement.
This led us to explore federated query engines, which allows querying across different data sources (e.g., HDFS/S3, PostgreSQL, MongoDB) without having to migrate everything into a single system.
Among the available options such as Trino, Apache Drill, and Dremio, we ultimately chose Trino (formerly PrestoSQL) as our primary federated query engine due to its superior performance, scalability, ANSI SQL compliance, and extensive connector ecosystem.
Trino’s distributed SQL execution enabled us to run queries across multiple OLTP databases and our data lake efficiently, without the need for costly ETL pipelines.
Our current data-platform architecture with Trino federated query engine
Moving forward, federated querying became a cornerstone of our data platform, enabling seamless data access across different stores while maintaining high performance and scalability. However, as our data volumes skyrocketed both in our OLTP databases and the data lake, we hit performance bottlenecks, especially when running complex JOIN operations between OLTP databases and the data lake. The overhead of querying across multiple data stores in real time became unsustainable, prompting us to rethink our approach.
Challenges with the Existing Approach
As our data platform scaled, we faced increasing challenges in handling large-scale analytical queries while ensuring that transactional workloads remained performant. The core issues stemmed from the fundamental differences in how OLTP databases like PostgreSQL and analytical query engines like Trino operate.
- Read and Write Contention in PostgreSQL
PostgreSQL served as the primary transactional store, handling both real-time reads and write-heavy workloads. As analytical queries on PostgreSQL increased, contention between OLTP and OLAP workloads became a severe bottleneck.
To mitigate this, we initially introduced read replicas for PostgreSQL and redirected Trino queries to them. However, this approach introduced new challenges
- Replication Lag: Heavy write workloads on the primary database caused frequent replication delays, leading to stale or inconsistent query results from read replicas.
- Impact of Long-Running Queries: Trino’s analytical queries, which processed millions of records, made replication lag worse. Long-running queries on the read replica frequently conflicted with the replication process, often triggering the infamous “cancelling statement due to conflict with recovery” error. Increasing the replication delay parameters might temporarily reduce query cancellations, but it comes at the cost of significant replication lag, delaying access to fresh data.
- Growing Write Latency: As the volume of incoming data surged, PostgreSQL write operations became slower. Managing indexes, vacuuming, and maintaining DB health became increasingly operationally complex.
2. Performance Bottlenecks in Cross-System Queries
A major pain point was running federated queries that joined structured OLTP data in PostgreSQL with massive ORC/Parquet-based tables in Hive. Dashboards querying billions of records suffered significant slowdowns, primarily due to the limitations of the PostgreSQL connector in Trino.
- Single-Node Execution Bottleneck: Unlike Hive, which leverages distributed query execution and parallelized file-based splits, PostgreSQL’s Trino connector processes data sequentially over JDBC. This led to high network overhead, limited query concurrency (due to PostgreSQL’s connection constraints), and increased memory pressure on Trino
- Lack of Pushdown Optimization: While Trino can push down filters and projections to PostgreSQL, it cannot push down JOINs between PostgreSQL and Hive. This forces large datasets to be fetched into Trino for in-memory processing, leading to increased execution time and resource consumption
- Lack of Parallel Split Generation: Unlike Trino’s Hive connector, which efficiently splits large datasets across multiple worker nodes, enabling parallel reads from ORC/Parquet files, the Postgres connector retrieves data sequentially using JDBC connections, constrained by the database’s single-node execution model.
- Latency in Multi-Source Queries: Many dashboards query multiple PostgreSQL databases alongside data in HDFS/S3. Orchestrating queries across these systems, each with vastly different latency and throughput, further degraded performance.
3. The Challenge of Maintaining a Single Source of Truth
Considering these challenges, one possible approach was to point all microservices directly to the Data Lake, eliminating the need for federated queries.
However, this was far from a simple solution. It would have required a massive migration effort, rewriting countless services and queries, adapting to new query paradigms, and ensuring compatibility with existing workflows, all of this, while maintaining operational stability.
Another option was to migrate analytical data from OLTP databases to the Data Lake via an ETL pipeline at regular intervals. While this approach worked for some batch workloads, it introduced data freshness issues, delays between updates in PostgreSQL and their availability in the Data Lake.
Many of our microservices relied on real-time transactional data via REST APIs, and introducing periodic syncs meant that insights could be outdated by minutes or even hours, leading to inconsistencies between operational and analytical workloads.
As our need for real-time insights grew, both solutions proved inadequate, pushing us to explore a more seamless and efficient alternative.
The Need for a Scalable Solution
Given these challenges, we needed a hybrid solution one that would:
- Offload frequently queried data from PostgreSQL to a scalable storage layer to reduce database contention.
- Ensure data consistency between transactional and analytical systems, avoiding replication lag and stale reads.
- Leverage a Data Lake for large-scale analytical queries without disrupting real-time operational workflows
And that’s why we moved towards a unified, scalable data pipeline, leveraging Change Data Capture (CDC) with Debezium and Apache Hudi.
Why Debezium and What is Change Data Capture (CDC)?
For starters, CDC is a technique that tracks changes in a source database (inserts, updates, deletes) through replication slots and streams them to downstream systems in real time. It enables efficient data replication without full table scans or periodic batch jobs.
Why Debezium?
Debezium is an open-source CDC tool that integrates seamlessly with PostgreSQL, Kafka, and other event streaming platforms. Key advantages include:
- Log-based CDC with minimal performance overhead
- Schema evolution support
- Fault tolerance and at least-once delivery (exactly-once delivery can be achieved as well with Kafka Connect
- Pre-built connectors for PostgreSQL databases, that has capability to stream change events at scale
CDC Ingestion Pipeline architecture overview
The high-level architecture of our CDC-based ingestion pipeline is as follows:
Section 1:
- PostgreSQL — The source database where transactional data resides. (multiple database instances)
- Debezium — Captures changes from PostgreSQL WAL through replication slots and streams them to Kafka.
Section 2:
- Kafka — Acts as a buffer for event streaming and fault tolerance.
- DebeziumConsumer — Decorator service responsible for reading data from multiple table-specific Kafka topics, decorates it as per our generic Hudi schema, along with further data enrichment to the events coming through Debezium, ultimately publishes the decorated events to a single Kafka topic, through which spark job will pick such events for ingestion. It handles several other use cases like Asynchronous cache Sync, which is used by other micro services.
Section 3:
- Spark Streaming App — Handle upserts and incremental updates in the data lake through Apache Hudi
- Data Lake (S3/HDFS) — Stores raw and transformed data in parquet / ORC file
- Query Layer — Trino which enables analytics and querying over the ingested data
By implementing the CDC-to-Hudi ingestion pipeline, we have consolidated analytical data into a single, optimized storage layer, enabling all read workloads to be processed efficiently from HDFS or S3 or any other Object Store. Leveraging structured partitioning and Bloom filters, we achieved a 10x improvement in dashboard query performance on Trino.
By offloading data to a Hudi-based data lake with minimal (observable and actionable) CDC latency, we overcame all challenges of Running an analytical workload on disparate databases. The Hive and Hudi connectors in Trino provide scalable split generation, allowing queries to efficiently process billions of records in parallel, utilizing columnar storage, predicate pushdown, and metadata pruning. This shift not only eliminated PostgreSQL’s scalability constraints but also significantly reduced query execution times, making real-time analytics feasible at scale.
In the above architecture diagram, one can simply replace section-1 artifacts with other databases and respective CDC tools — say we can replace PostgreSQL + Debezium combo with MySQL + Debezium or Mongodb + Mongo’s CDC handlers and so on.
Building a Framework Around Debezium
Instead of setting up Debezium for each database / table manually, we built a configuration-driven framework that allows teams to onboard new databases / tables easily.
Key components of our framework include:
- Connector Management — Centralized configuration for Debezium connectors.
- Schema Evolution Handling — Automatic schema propagation to the data lake.
- Monitoring & Alerting — Tracks failed events, automatic restarts, snapshots, tracking lost delete events, etc
- Debezium signals — for incremental snapshots
- Heartbeat query — for connectors pointed to multiple databases running within the same postgres instance, in this case sometimes debezium may not be able to get the change from db which has low frequency of events. Heartbeat query helps debezium to acknowledge change, and move the replication slots ahead.
Why Apache Hudi for the Data Lake?
While Delta Lake and Apache Iceberg are widely adopted for data lakes, Apache Hudi stands out for its ability to efficiently handle upserts, incremental processing, and real-time analytics at scale.
Hudi is purpose-built for stateful data lakes, optimized for record-level indexing and efficient write performance. Unlike Delta Lake and Iceberg, which rely on full-file rewrites for updates, Hudi employs Bloom filters, global indexing, and bucketized storage layouts to minimize write amplification, drastically reducing latency and storage overhead.
Its Merge-on-Read (MOR) mode enables low-latency updates by writing only delta logs and asynchronously compacting them, ensuring high write throughput without sacrificing query performance. This makes Hudi particularly well-suited for workloads that require frequent updates to large datasets while maintaining fast analytics.
Another key differentiator is incremental querying. Hudi allows downstream consumers to fetch only changed records, avoiding expensive full-table scans-a critical advantage for real-time analytics, CDC (Change Data Capture), and streaming ingestion. By leveraging structured partitioning, metadata pruning, and scalable split generation in Trino, Hudi delivers a cost-effective, high-performance solution for managing stateful data lakes at scale
Challenges and lessons learned
As we created and deployed the CDC pipeline, we faced few challenges that we solved on the go
- Handling migration of Kafka cluster / PG switchover — As we are still using PostgreSQL v15, the logical replication slots used by Debezium is not propagated to the standby server in case of switchover, hence, we have added a mechanism to take snapshots in case of PG switchover, and a workflow that will keep the state consistent in data lake, by removing stale records
- Splitting large payload — In few of the cases the event generated from postgres table could have huge JSON blob, which were hindering our ingestion pipeline at the start, and to handle such events we split it to smaller chunks of record for efficient data ingestion. Although this can be achieved at the Producer level (i.e. Debezium) as well
- Skipping few columns / Skipping duplicate events — Although debezium has inbuilt capability to skip change event for particular set of column and it can skip duplicate event (UPDATE event without any change in column’s value), we incorporated our custom logic on top of it based on our use cases
- Reducing the overall latency in pipeline — As we moved most of our customer centric dashboards to the “Debezium CDC to HUDI Ingestion“ based table, ingesting and maintaining the same state of data in PostgreSQL and data lake was crucial with minimal latency as of now we are tracking more than 100 tables through a single spark application
- Query Performance Optimizations
- We enabled partition pruning in Hudi to speed up queries
- Tuned compaction and clustering strategies for better read performance
To wrap up:
Let’s sum up the benefits we end up gaining with this pipeline, some we originally set out to achieve, others were intended and yet others we hoped to achieve and ended up achieving.
- Offload frequently queried data from PostgreSQL: Reduced read QPS to the tune of 10000 QPS on PostgreSQL databases, thereby improving PG reliability.
- Leverage a Data Lake for large-scale analytical queries : Improved the analytics query performance by 10X. This improved our dashboard performance dramatically.
- Ensure data consistency between transactional and analytical systems: Sub-minute latency of data sync between Primary databases and OLAP store.
- Asynchronous cache Sync: Most of our microservices leverage Redis as a read-through cache to retrieve data from our primary databases. To prevent multiple microservices from overwhelming the database during cache invalidation, we adopted asynchronous cache synchronization using Debezium-based CDC events. This not only eliminated cache staleness issues, but also ensured efficient update propagation, significantly reducing direct database hits and enhancing overall system performance.
- This entire exercise came with another pleasant, much needed and secretly intended outcome. There were speculations about HUDI being able to support low latency, real time analytics query workload suitable for dashboard. As data was instantly available for analytical workload via CDC + HUDI, some APIs that started to fail against RDBMS were migrated to HUDI with no material impact on performance and latency. This gave confidence to feature teams to start ditching RDBMS and write directly to HUDI.
- Road to Cost optimisation : Our OLTPs (PostgreSQL with replicas, or MongoDB) were becoming expensive — cost wise and operationally. They struggled under heavy load, prompting a reevaluation of our schema design — an effort that would further escalate development and QA expenses. With CDC to HUDI Ingestion pipeline, we have given a bit of longevity to the existing DBs by offloading them to Data lake, and as discussed in point 4, created a pattern to scale in future.
How we have scaled HUDI usage in Uptycs is yet another blog in the making…Stay tuned !!
References