Data & Dimensional Modeling, Data Warehouse & Data Lake
Data Modelling
- Data Modeling is a technique widely used in the world of Business and Information technology to show how information is or should be stored and used within a Business System.
- The Success of any organisation relies on the efficient flow and processing of information.
Dimensional Modeling
- Make the data accessible and usable by the business community.
- Fundamental goals
- Ease of use
- Query Performance
Success Factors (most significant)
- Business Partnership & Executive Sponsorship
- Business Involvement throughout
- Real Business Requirement
- Real Business Problems to solve
- Logical Model
Success Factors (Least significant)
- Physical Model i.e. Logical model can be deployed into multiple databases after adding database-specific items.
- One logical entity can be more than 1 physical table. e.g. Customer entity (no change, slowly changing & rapid changing attributes will be kept in a separate table)
Basic Principles for Dimensional Modeling
- Multiple hierarchies are allowed, even encouraged, within a dimension
- Each data element must be uniquely named. i.e. It may require existing elements with multiple meanings to be renamed for unique identification.
- Identify the true grain for each set of facts, support rolling data up multiple ways
- Sound data architecture is technology independent
- Implementation needs to determine what data must co-exist in the same Physical database
- Asses the capabilities of the latest technologies for data storage and access
- Understand business needs and frequency to use data together
- Maintain Key dimensions and its structure
- The integrity of fact tables should remain intact.
- Can easily change attributes, rollups (etc.,) without impact on fact tables
- Dimensions should be easily rebuilt without affecting the fact tables.
Enterprise Data Models & Goals
- Enterprise data modeling projects should be justified and developed on their merits.
- Get data into the hands of Users.
- Ensure Data Quality
- Timely delivery of data marts
- ability to adapt as business changes
What is Datawarehouse (DWH)
- Centralise place to hold large volume of data from multiple source system in a standard format.
- Provides historical time series records
- Read optimised & Columnar storage
- Act as a single source of truth
- Transactional system follows OLTP structure vs Analytical system follows OLAP structure (Facts and Dimensions)
- ODS layer in the middle is not a data warehouse. It is a collection of data from various sources system and follows the same source data formats.
- DWH – Data has to be standardised, organised, reshaped to support analytical Capabilities
- In DWH, the data often represented as Facts & Dimensions. i.e. Facts are the Answers and Dimensions are the questions (example: Who, what, When, Where – Often relates to Person, Geography, Time/Date, more specific items related to business like Product / Channel, Accounts, etc.)
- In Dimensional model the central Fact table is joined (inner join) to Dimensions to answer the questions.
- Some of the Benefits of OLAP structure:
- Drill down capabilites
- Dynamic presentation through BI Dashboards
- Adhoc Reporting and Analysis
- Able to reproduce the same results for the same point in time
- Some of the Benefits of OLAP structure:
Data Warehouse vs Data Lake
- Data Warehouse
- Structured data – Mostly stored in Database.
- Can be in cloud or onprem
- Re-Organise Transaction data for the purpose of Analysis i.e. OLTP -> OLAP
- Uses star schema / snowflake schema (Built around Facts and Dimensions)
- or Use Data Vault (Hubs, Satellites and Links)
- Easy access to the data point that answers complex questions
- End user (Business Users, Analytics community)
- Prior to building the Datawarehouse, know what questions that are going to be answered
- Cleanliness of data is critical when building the data warehouse to ensure the quality and Trust
- Hence there is a need to have Master data Management Process along with the data warehouse project.
- Data Lake
- Can hold Structure, semi structures and unstructured data.
- Can be in cloud or onprem
- Generally the data is stored in the form of files.
- Low cost
- Provide the ability to combine all file formats and use in the Analysis
- End User – Data Engineers, Data Scientist to re-oraganise the data in a consumable format that can be used by Business users or by the Analytics community)
- When the requirement is unknown then Data lake is the best. But bring in only the data that may be of use in Reporting or in any ML model
- Majority of the work happen post loading. i.e. from the landing / Storage Zone the data get re-shaped in a consumable format.
- Platform
- Choose a platform that support both Data Warehouse and Data Lake Capabilities
- Example:
- Snowflake cloud data warehouse that support structure and semi structured data (can also store other form of files in the storage layer)
- Data Lakehouse Platform by Databricks: hybrid architecture that combines the best of data lake and data warehouse capabilities in one place
- Data Warehouse – General Definition
- A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
- Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
- Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
- Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
- Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
- A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.