• May 2, 2025

Data & Dimensional Modeling, Data Warehouse & Data Lake

Data Modelling

  • Data Modeling is a technique widely used in the world of Business and Information technology to show how information is or should be stored and used within a Business System.
  • The Success of any organisation relies on the efficient flow and processing of information.

Dimensional Modeling

  • Make the data accessible and usable by the business community.
  • Fundamental goals
    • Ease of use
    • Query Performance

Success Factors (most significant)

  • Business Partnership & Executive Sponsorship
    • Business Involvement throughout
    • Real Business Requirement
    • Real Business Problems to solve
  • Logical Model

Success Factors (Least significant)

  • Physical Model i.e. Logical model can be deployed into multiple databases after adding database-specific items.
  • One logical entity can be more than 1 physical table. e.g. Customer entity (no change, slowly changing & rapid changing attributes will be kept in a separate table)

Basic Principles for Dimensional Modeling

  • Multiple hierarchies are allowed, even encouraged, within a dimension
  • Each data element must be uniquely named. i.e. It may require existing elements with multiple meanings to be renamed for unique identification.
  • Identify the true grain for each set of facts, support rolling data up multiple ways
  • Sound data architecture is technology independent
    • Implementation needs to determine what data must co-exist in the same Physical database
    • Asses the capabilities of the latest technologies for data storage and access
    • Understand business needs and frequency to use data together
  • Maintain Key dimensions and its structure
    • The integrity of fact tables should remain intact.
    • Can easily change attributes, rollups (etc.,) without impact on fact tables
    • Dimensions should be easily rebuilt without affecting the fact tables.

Enterprise Data Models & Goals

  • Enterprise data modeling projects should be justified and developed on their merits.
  • Get data into the hands of Users.
  • Ensure Data Quality
  • Timely delivery of data marts
  • ability to adapt as business changes

What is Datawarehouse (DWH)

  • Centralise place to hold large volume of data from multiple source system in a standard format.
  • Provides historical time series records
  • Read optimised & Columnar storage
  • Act as a single source of truth
  • Transactional system follows OLTP structure vs Analytical system follows OLAP structure (Facts and Dimensions)
  • ODS layer in the middle is not a data warehouse. It is a collection of data from various sources system and follows the same source data formats.
  • DWH – Data has to be standardised, organised, reshaped to support analytical Capabilities
  • In DWH, the data often represented as Facts & Dimensions. i.e. Facts are the Answers and Dimensions are the questions (example: Who, what, When, Where – Often relates to Person, Geography, Time/Date, more specific items related to business like Product / Channel, Accounts, etc.)
  • In Dimensional model the central Fact table is joined (inner join) to Dimensions to answer the questions.
    • Some of the Benefits of OLAP structure:
      • Drill down capabilites
      • Dynamic presentation through BI Dashboards
      • Adhoc Reporting and Analysis
      • Able to reproduce the same results for the same point in time

Data Warehouse vs Data Lake

  • Data Warehouse
    • Structured data – Mostly stored in Database.
    • Can be in cloud or onprem
    • Re-Organise Transaction data for the purpose of Analysis i.e. OLTP -> OLAP
    • Uses star schema / snowflake schema (Built around Facts and Dimensions)
    • or Use Data Vault (Hubs, Satellites and Links)
    • Easy access to the data point that answers complex questions
    • End user (Business Users, Analytics community)
    • Prior to building the Datawarehouse, know what questions that are going to be answered
    • Cleanliness of data is critical when building the data warehouse to ensure the quality and Trust
    • Hence there is a need to have Master data Management Process along with the data warehouse project.
  • Data Lake
    • Can hold Structure, semi structures and unstructured data.
    • Can be in cloud or onprem
    • Generally the data is stored in the form of files.
    • Low cost
    • Provide the ability to combine all file formats and use in the Analysis
    • End User – Data Engineers, Data Scientist to re-oraganise the data in a consumable format that can be used by Business users or by the Analytics community)
    • When the requirement is unknown then Data lake is the best. But bring in only the data that may be of use in Reporting or in any ML model
    • Majority of the work happen post loading. i.e. from the landing / Storage Zone the data get re-shaped in a consumable format.
  • Platform
    • Choose a platform that support both Data Warehouse and Data Lake Capabilities
    • Example:
      • Snowflake cloud data warehouse that support structure and semi structured data (can also store other form of files in the storage layer)
      • Data Lakehouse Platform by Databricks: hybrid architecture that combines the best of data lake and data warehouse capabilities in one place
  • Data Warehouse – General Definition
    • A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
      • Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
      • Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
      • Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
      • Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.