SnowPro

Architecture

Three Distinct Architecture Layers
Decoupling of Elastic Storage and Elastic Compute.
Implemented using Virtual Warehouses
Cloud Agnostic
Data Cloud/ Data Exchange and Partner Network
Cloud Partner Categories
Truly self managed service
Fully ACID compliant
Zero Copy Cloning

Snowflake Editions

Each account in an organization has a specific edition

Standard

Full unlimited access to all of Snowflake’s standard features
Provides a strong balance between features, level of support and cost

Enterprise

Additional features designed specifically for the needs of large scale organizations
Extended timetravel
Periodic rekeying of encrypted data
Object tagging
Query acceleration, search optimization and materialized view
Column and row level security
Multiclustering
Audit User Access history through account_usage.access_history view

Business Critical

Formerly known as Enterprise for Sensitive Data (ESD)
Support higher level of data protection using Customer Managed Encryptions keys
Support for highly sensitive data particularly PHI (Protected Health Information) data that must comply with HIPAA and HITRUST CSF regulations
Additional support for database failover/failback and disaster recovery
As per HIPAA and HITRUST regulation, a signed business associate agreement (BAA) must be signed between the organization and Snowflake INC
Support for PCI DSS, FedRAMP and IRAP

Virtual Private Snowflake

Higher level of security
Suited for financial institutions and other organizations that collect, analyze and share highly sensitive data
Complete separate Snowflake account completely isolated from other Snowflake accounts
VPS accounts do no share any resources outside of the VPS account
VPS account may choose to enable data sharing with non VPS customer by contacting Snowflake Support. This is not enabled by default
Can use account identified that uses organization name and account name
Account locator uses a different format from those of other account editions

Releases

Weekly release

Full release that may include new features, feature enhancements, fixes and behavior changes
Weekly releases notes are published to the Snowflake Community
Snowflake documentation is updated
Any day of the week except fully releases are not typically released on a Friday
Patch release only include fixes
Are deployed along with or after a full release

Behavior releases

Behavior change is defined as a change to an existing behavior
May affect customers existing code or workloads
Each month apart from November and December
Snowflake selects one of the weekly releases to release behavior changes
Typically this is the 3rd or the 4th week of the month
Naming convention YYYY_NN. YYYY stands for the year, NN stands for the sequential ordinal number for the release within the year
Consist of two periods; the test period and the opt-out period
Testing period (1st month) Disabled by default. Customer have to opt in if needed
Opt out period (2nd month) Enabled by default. Customers have to opt out if needed

Staged Release

Snowflake does not move all accounts to the release at the same time. Accounts are moved to the full release based on their edition. Staged release allow Snowflake to monitor releases and respond to any issues if they occur

There are 3 stages

Stage 1 (Day 1): Early access for designated Enterprise and above editions
Stage 2 (Day 1 or 2): Regular access to Standard edition
Stage 3 (Day 2): Final access to Enterprise and above

The minimum amount of time between early access and final release is 24 hours

Snowflake Architecture

Snowflake architecture is a hybrid of the share disk and shared nothing architectures

Shared Disk Architecture

Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform

Shared Nothing Architecture

Snowflake processes queries uses MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally.

Three Layered Architecture

Snowflake architecture consists of the 3 layers

Cloud Services
Query Processing (Also called the compute layer and virtual warehouses)
Database Storage

Cloud Services

Services managed in this layer include:

Authentication
Infrastructure management
Metadata management
Query parsing and optimization
Access control

Query Processing

Queries are processed in this layer using compute notes called Virtual Warehouses. Each Virtual warehouse is a MPP compute cluster composed on multiple compute notes

Database Storage

Data is stored in Snowflake proprietary format on cloud storage such as AWS S3, Azure Blob storage or Google Cloud
Data stored in Snowflake tables are encrypted automatically in AES-256 strong encryption
Files stored for data loading and unloading in Snowflake internal stages are encrypted using AES-256 strong encryption

Elastic Storage

Snowflake uses a central data repository that is accessible from all compute notes as is used in a Shared Disk Architecture.

Loaded data is organized into its internal optimized, compressed columnar format.

Data objects stored in this columnar format is not directly visible to users and can only be accessed via SQL

Snowflake manages all aspects of how this data is stored

organization
file size
structure
compression
metadata
statistics
other aspects of data storage

Elastic Compute

Snowflake Compute is driven by virtual warehouses often referred to simply as warehouses. A warehouse provides resources such as CPU, memory and temporary storage to perform operations such as SELECT queries, DML operations such as updating rows (INSERT, UPDATE and DELETE) and LOADING and UNLOADING data.

If sufficient resources are not available to run queries submitted to a warehouse, then snowflake queues the queries till such resources are available. You can then either scale up (down) or scale out (in) the virtual warehouses to accommodate for such resources when necessary.

Scaling up or down is to change the size for a warehouse while scaling out or in is adding or removing clusters to a multi cluster warehouse.

Virtual Warehouse

Each Virtual Warehouse is an independent compute cluster that does not share resources with other virtual warehouses (Shared Nothing Architecture). As such a virtual warehouse does not have any performance impact on any other virtual warehouses

Default size is x-small when created using the CREATE WAREHOUSE statement

Default size is x-large when created using the web interface

Credits used depends on the size of the warehouse

Warehouses sizes can be changed either from the web interface or using the ALTER command and setting the WAREHOUSE_SIZE

The warehouse does not need to be suspended when changing the size and queries being run on the warehouse will NOT be affected.

Types of Virtual Warehouse:

Standard
Snowpark-optimized

A warehouse must be suspended before you can change the type of the warehouse.

The type can be changed either from the web interface or using the ALTER command and setting the WAREHOUSE_TYPE

Enterprise and above editions allow for multi cluster warehouses

Auto Suspend and Auto Resume

Both auto-suspend and auto-resume are enabled by default when creating a new virtual warehouse

Enabling auto-suspend automatically suspends the warehouse if it is inactive for a set period of time

When auto-resume is enabled

The warehouse is resumed when any statement requiring the warehouse is submitted and
The warehouse is the current warehouse of the session

For a current session only a single Virtual Warehouse can be used. You can however change the Virtual Warehouse for that given session

Setting the min and max for a multi cluster setup will ensure auto scaling

Cache

Result cache

Result cache is also called as 24 hour cache. Cache the Result for the non configurable duration of 24, each time a persisted result is reused then 24 hour retention period extended up-to a maximum of 31 days. After which results are purged from the cache.

Result cache is available across virtual warehouse & across users as long as they have the required privileges

Stores the result of a query in Cloud Services

Results cache reuse can be disabled using the account/user/session level using the parameter USE_CACHED_RESULT. By default it is enabled.

Warehouse Cache

Warehouse Cache on the other hand is purged when the virtual warehouse is resized, suspended or dropped.

Warehouse cache (or) local disk cache (or) data cache (or) raw data cache (or) SSD cache are few other names
Warehouse cache – can be used by the subsequent queries only from the same warehouse.
Unlike result cache, warehouse cache maintains the raw table data used for processing a query (not the result of the query like in result cache). i.e. data is not stored in an aggregated form instead in a raw micro partition form and Stores the micro-partitions in memory of the warehouse cluster that ran the query

SQL Optimisation tips

Query Profile Analysis

Query profile provides a detailed, visual representation of how a query executed, including stages, scanned data, memory usage, and potential bottlenecks like memory spills or inefficient joins.

You can access it:

From the History tab by clicking the “Profile” icon next to a query.
Through SQL using QUERY_HISTORY or QUERY_HISTORY_DETAIL views.

Query profile tells the amount of data read by a query from storage, while memory spill occurs when the allocated memory for a query is exceeded, causing intermediate results to spill to disk, which slows down the query. Here’s how to address these issues:

1. Check Query Scanning:

Use Query History in Snowflake (SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY) to track the scanned bytes.
Optimize by:
- Pruning data: Use clustering keys or filters to limit the amount of data scanned.
- Minimizing SELECT*: Avoid selecting unnecessary columns.
- Partitioning: Ensure external tables or files (e.g., in S3) are partitioned effectively.
  - Micro-partitioning is not applicable for external tables in Snowflake.
    - Micro-partitioning is a Snowflake feature for managing data stored internally in Snowflake-managed tables. It automatically partitions data in the cloud storage layer for performance and efficiency.
    - External tables, however, reference data stored outside Snowflake (e.g., in S3, Azure Blob, or GCS). Snowflake does not manage or physically reorganize this data, so micro-partitioning does not apply.
  - To optimise the External Tables:
    - Use file partitioning at the source (e.g., S3 folder structure) to improve query performance.
    - Leverage partition pruning with filters in queries to limit the amount of data scanned.

2. Avoid Memory Spills:

Monitor memory usage: View bytes spilled to local storage in Query History.
Optimize queries:
- Use proper joins: Ensure small tables are on the right side of a join.
- Simplify aggregations and window functions.
- Break large queries into smaller steps (CTEs or intermediate tables).
Use warehouses with more resources: Scaling up to a larger virtual warehouse (e.g., X-LARGE) can help.
Enable result caching and automatic clustering.

By analyzing query profiles and implementing these optimizations, you can reduce scanning and prevent memory spills effectively.

3. Avoid some common mistakes:

To improve the query performance and to reduce the cost follow these steps:

Pruning: effective partition pruning, helps query to scan right volume of records. Filter on high cardinality columns to leverage snowflake micro partition.
Queries that uses like operator prevents partition pruning and scan the whole table.
Query that uses the exact match filter condition or date range will improve the query performance.
Apply pruning (filters) on each table before joining to reduce the size of the data scan.
Select: Keeps only the required fields in the select statement reduces the data scan and associated costs, when extracting data from large and wider tables.
Clustering: Automatic micro-partitioning works best initially when data is written in bulk.
Over time, as data grows, updates, and inserts occur:
- New data might not align with the earlier sets.
- Micro-partitions become less optimized for common query filters (e.g., date ranges, customer IDs).
- This leads to more irrelevant micro-partitions being scanned during queries.
  - Snowflake does not allow direct manual reorganization or fixing of micro-partitions because micro-partitioning is fully automated and managed by Snowflake. However, clustering keys serve as a workaround to guide Snowflake on how to organize and optimize the micro-partitions.
  - Snowflake’s automatic micro-partitioning does not re-organize partitions after incremental data updates or inserts.
    - This fragmentation leads to more irrelevant partitions being scanned, as Snowflake can’t prune as effectively.
- When you define a clustering key (e.g., order_date):
  - Snowflake reorganise the data based on your clustering key to align partitions with order_date.
Key Takeaways:
- Without clustering: Over time, micro-partitions become disorganised, and query pruning scans irrelevant data.
- With clustering: Snowflake reorganizes data to ensure query filters (e.g., date ranges) scan only the necessary partitions.
In some cases Use Temp tables over CTE’s
- CTE’s are recalculated every time they’re referenced, which can be inefficient in complex queries.
- Use temporary tables for intermediate data referenced multiple times.
Materialised views: Re-running aggregations for dashboards or regular reports leads to redundant costs.
- Use materialised views to store frequently aggregated data, automatically updated as source data changes.
Right Warehouse size: Start with the smallest warehouse and check the query profile, if the intrim results is spilled over the disk then increase the size of the warehouse. Also enable Auto-Suspend Settings when there isn’t any activity for 60 seconds. Auto-resume ensures that queries can automatically start warehouses as needed.
Data Storage formats and compression: Ignoring data storage formats and compression isn’t ideal because these directly impact query performance, storage costs, and efficiency in Snowflake.
1. Storage Format Affects Query Efficiency
- The data storage format (e.g., Parquet, CSV, ORC) impacts how Snowflake processes the data, especially for external tables (like S3). Here’s how:
- Columnar Formats (e.g., Parquet, ORC):
  - Optimized for analytical workloads (e.g., filtering and aggregating specific columns).
  - Snowflake can read only the required columns, reducing I/O and query time.
- Row-Based Formats (e.g., CSV, JSON):
  - Snowflake must scan entire rows, even if only one column is needed.
  - Results in more data read and slower query performance.
- Example:
  - Query: SELECT customer_id FROM external_table WHERE order_date = '2024-01-01';
    - With Parquet or Variant type: Only the customer_id column is read.
    - With CSV: The entire row is scanned for each match.
- Ignoring the format forces Snowflake to do unnecessary work, affecting performance.

Data Cloud

Each snowflake instance is hosted in a Single Cloud Region

Snowflake can be hosted on any of the following

Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Microsoft Azure (Azure)

Snowflake cannot run on a private cloud infrastructure either on premise or hosted

Cloud Partner Categories

Data integration
Business Intelligence
Machine Learning and Data Science
Security and Governance
SQL Development and Management
Native Programming interfaces

The following are some examples of each category. For an exhaustive list please refer to the following link

Data Integration

dbtLabs, Coalesce, Azure Data Factory,Fivetran,Google Cloud, Informatica, Denedo, Heap, Hevo, HVR (now Fivetran), Mattilion, QLIK, Stitch,Tableau, Talend

Business Intelligence

Adobe, Quick Sight,Domo, Google Data Studio,Power BI, QLIK, Sigma, Tableau

Machine Learning and Data Science

Alteryx, Amazon Sagemaker,Databricks, Datarobot, R, Apache Spark

Security and Governance

Alation, Atlan, Collibra, Datadog, Informatica,SkyFlow

SQL Development and Management

dBeaver, Data ops, Agile Data Engine, SQL Workbench/J,

Native Programming Interface

Go, JDBC, Kafka, Node, ODBC, php, python, Microsoft DOTNET, SQL Alchemy