S3 Snowflake Ingestion using DBT

As I worked on optimising our data pipelines, I faced a common challenge—loading data from S3 into Snowflake in a way that was scalable, reusable, and easy to maintain. To address this, I developed a custom dbt macro and a parameter file to streamline the process.

The macro automates the extraction, transformation, and loading (ETL) of data from S3 into Snowflake. One of its key advantages is the ease of maintaining it when there are changes to the source file structure. By simply updating the model SQL (parameter file) which contains details such as fields, keys, and datatypes, the entire process adapts. This ensures that the solution remains scalable and flexible, as you only need to modify the parameter file without changing the core macro logic.

While it may seem like we’re manually defining the schema by passing fields, keys, and datatypes through the parameter file, this is actually what enables automated table creation and recreation. Whenever there’s a change, the macro drops and recreates the table based on these updated definitions, allowing it to stay synchronised with the data’s structure. This way, the solution remains fully automated while still giving us control over how the table should be structured in the land layer.

The macro is designed to align with the dbt recommended approach—one model SQL file for each target table in Snowflake. Although we can process multiple files from the same source system with different structures using loops and to load them into different tables, the macro is optimised to load one file (by passing the file name) or multiple files of the same format e.g. files from different branches/region (by passing file name pattern) into one target table.

Additionally, there are a few prerequisites to using this macro:

The file format for the source data,
An external stage in Snowflake pointing to the S3 bucket,
The storage integration between Snowflake and S3, and
The file landing in S3 itself.

These objects and processes should already be in place. Configuring and managing the file format, external stage, storage integration, and file landing are out of scope for this macro, which focuses solely on loading the data and handling table creation.

Looking forward, the macro can be extended to archive the file once it has been successfully loaded into the target table in Snowflake. This would add another layer of automation, ensuring that files are moved to an archive location post-processing.

By leveraging dbt’s templating capabilities and Snowflake’s data ingestion features, this approach eliminates the need for repetitive, manual work and ensures that our pipelines are reliable and easily configurable for the newer source files . With the macro, we can also quickly adapt to changes in the source files without heavy lifting, making the solution not just scalable but also easy to maintain.

Moving ahead, I plan to enhance the macro with more complex transformations and error-handling mechanisms to make it even more robust.

Code Snippet: Please reach out.