Upgrading Graphite's ETL: Navigating Postgres to Redshift Data Replication

June 2023

Jun 11, 2023

Key Learnings

In Graphite’s journey to optimize data replication from Postgres to Redshift on AWS, we've uncovered three fundamental patterns that provide effective solutions:

Utilizing the Database Migration Service (DMS) for continuous replication.
Implementing federated queries to have Redshift query and clone Postgres directly.
Leveraging the snapshotting capabilities of AWS RDS Aurora to periodically save data to S3, from where AWS can run federated queries.

Each method has its advantages and specific use cases, offering a range of choices to engineers based on their individual requirements.

Context: The Starting Point

Our initial ETL patterns were designed around an AWS infrastructure, which included Node servers, ECS, and a single Postgres RDS Aurora DB cluster. We opted for AWS Redshift as our data warehouse due to its widespread use and the familiar SQL interface it provides. The data replication process involved the use of AWS's DMS for continuous replication, providing an always-on, up-to-date data feed to our Redshift data warehouse.

Our preferences for ETL systems are straightforward:

We want to maximize simplicity and minimize maintenance overhead.
We don’t care much about query latency and stale data under 24 hours.
We’d like to keep the costs of the system low.

Motivation for Change: Upgrading to Postgres 15

While our initial setup was functioning adequately, the upgrade to Postgres 15 presented unforeseen challenges. Notably, the continuous replication process with DMS was not compatible with Postgres 15. Furthermore, we had historically experienced a degree of instability with DMS, leading to concerns about its reliability in the long run. DB upgrades, certain schema changes, and failover would all break the DMS job, requiring manual intervention to get it up and running again. These factors motivated us to rethink our approach to data replication and seek out new methods to ensure a robust and dependable ETL process.

Exploration: Implementing Federated Queries

Our exploration led us to consider the use of federated queries, a powerful feature offered by AWS. It allows access and manipulation of data from different sources directly within our Redshift data warehouse. The setup was straightforward, and it provided a more elegant solution than DMS, avoiding its complexity.

However, the federated query approach also presented its unique challenge: the inability to handle camel case column names in Postgres. This issue necessitated the creation of alias views, adding a layer of complexity that we hoped to avoid.

Future Exploration: The Promise of S3 Snapshotting

Our ongoing quest for the best ETL process has prompted us to consider an unexplored approach: S3 snapshotting. This method involves using AWS RDS Aurora's ability to snapshot data and store it in S3, which is a cost-effective and reliable storage solution. Once the data is in S3, AWS can run federated queries on it, potentially offering a more stable and efficient ETL process.

The S3 snapshotting approach is an intriguing solution. While our data set is relatively small, and we run our queries asynchronously, making speed less of a concern, we are still striving for an efficient, cost-effective process. The potential to use AWS RDS Aurora's snapshot feature to save data to S3 in a column store format could significantly optimize our workflow. The data could sit affordably in S3, and federated queries could be run on it periodically, as per our requirements, to perform transformations.

Another benefit is the possible separation of concerns. While it's important to note that federated queries in our current setup hit read replicas, thus reducing the load, having the ability to separate our production database from data science queries completely brings an additional layer of safety. The idea of a data science-style query potentially causing a high load or a lockup on our production database is not appealing. Moving to an S3-based snapshotting system could alleviate these concerns and help ensure our production database remains robust and efficient.

One of the most interesting aspects of this potential pattern is the possibility of it resolving the camel case issue. The process of exporting data and snapshotting it into S3 could provide a translation layer, allowing for the renaming of columns or making the driver indifferent to the capitalization of column names. If Amazon doesn't resolve this issue anytime soon, this could be a viable workaround, saving us the trouble of having to migrate our column names.

Conclusion

In conclusion, this journey of upgrading our ETL process from a DMS-based continuous replication system to exploring federated queries and the promising possibility of S3 snapshotting, has been a learning experience. The process reinforced the importance of experimentation, of "fiddling and trying,” as I like to call it, alongside thorough research and planning. Each potential solution has its unique challenges and benefits, and only through hands-on exploration can you truly understand which path is right for your specific use case.

If you're considering replicating data from Postgres to Redshift, the options are there: DMS for continuous replication, federated queries for real-time access, and S3 snapshotting for cost-effectiveness and flexibility. Hopefully, my experience can serve as a guide, providing some context and insights as you navigate your own journey.

As for us, our exploration continues, and I am hopeful that we are converging on the best case for our needs - a system that prizes simplicity, stability, and affordability. And while there may be challenges along the way, the goal remains clear: to learn, to innovate, and to continually improve our ETL process for the benefit of our team and our users.

Small Diffs - by Greg Foster

Discussion about this post