Translate

Thursday, 27 November 2025

Data Warehouse Concepts

 

๐Ÿ—️ Core Data Warehouse Concepts

  • ๐Ÿข Data Warehouse (DWH): A central repository of integrated, historical data designed for query and analysis. Its primary purpose is to support business intelligence.

  • ๐Ÿช Data Mart: A subset of a data warehouse tailored to serve a specific business line (e.g., Sales, Finance). It contains a focused collection of data.

  • ๐Ÿ”„ Operational Data Store (ODS): A database for integrating data from multiple sources for operational reporting. It is more current and volatile, often acting as a staging area.

  • ๐ŸŒŠ Data Lake: A vast repository holding massive amounts of raw data in its native format (structured and unstructured) until needed. No predefined schema required.

  • ๐Ÿก Data Lakehouse: A modern architecture combining the flexibility of data lakes with the data management and ACID transactions of data warehouses.

  • ๐Ÿšง Staging Area: A temporary storage area for data extraction, cleansing, and transformation before loading into the warehouse. Not typically queried by end-users.

  • ๐Ÿ“Š Business Intelligence (BI): Technologies and practices for collecting, analyzing, and presenting business information to support decision-making.


๐Ÿ“ Data Modeling & Architecture

  • ๐Ÿ—บ️ Schema: The logical description of the entire database, including table structures and relationships.

  • ⭐ Star Schema: The simplest schema consisting of a central fact table connected to multiple dimension tables in a star shape.

  • ❄️ Snowflake Schema: A variation of the star schema where dimension tables are normalized into multiple related tables. Reduces redundancy but increases complexity.

  • ๐ŸŒŒ Galaxy Schema (Fact Constellation): A complex schema with multiple fact tables sharing dimension tables.

  • ๐Ÿ” Data Vault Modeling: A hybrid method for long-term historical storage, composed of Hubs, Links, and Satellites. Resilient to change and highly scalable.

  • ๐Ÿท️ Dimension: A category of information (the "who, what, where, when"). Provides context to facts (e.g., Customer, Product).

  • ๐Ÿ”ข Fact: A measurement or metric, typically numerical (e.g., Sales Amount, Quantity).

  • ๐ŸŸข Fact Table: The central table containing measurements (facts) and foreign keys.

  • ๐Ÿ”ต Dimension Table: A table storing descriptive attributes related to a business dimension (e.g., Customer Name, City).

  • ๐ŸŒพ Grain (Granularity): The level of detail in a fact table (e.g., "one row per line item").

  • ๐Ÿ”‘ Surrogate Key: A system-generated unique identifier (integer) used as a primary key, independent of the source system.

  • ๐Ÿ†” Natural Key (Business Key): An identifier from the operational source system (e.g., CustomerID).

  • ๐Ÿ•ฐ️ Slowly Changing Dimension (SCD): Techniques to manage data changes over time.

    • Type 1: Overwrite old value (No history).

    • Type 2: Add new row (Preserve history).

    • Type 3: Add new column (Limited history).

  • ๐Ÿค Conformed Dimension: A dimension that represents the same thing across different fact tables (e.g., Date).


๐Ÿš€ ETL & Data Integration

  • ๐Ÿšš ETL (Extract, Transform, Load): The process of moving data from source to warehouse.

    • Extract: Reading data.

    • Transform: Cleaning and structuring data.

    • Load: Writing data to the target.

  • ☁️ ELT (Extract, Load, Transform): Loading data into the target system before transformation. Common in modern cloud platforms (Snowflake, BigQuery).

  • ๐Ÿšฐ Data Pipeline: A system moving data from one place to another; may or may not involve heavy transformation.

  • ๐Ÿ“ธ Change Data Capture (CDC): Identifying and capturing changes (inserts, updates, deletes) in a source database to apply them to the warehouse in near real-time.

  • ๐Ÿงน Data Cleansing: Detecting and correcting corrupt or inaccurate records.

  • ๐Ÿ” Data Profiling: Examining source data to collect statistics and assess quality.

  • ๐Ÿ›ก️ Data Governance: Managing data availability, usability, integrity, and security across an enterprise.


๐ŸŽฏ Key Performance Indicators & Metrics

  • ๐Ÿ“ˆ KPI (Key Performance Indicator): A measurable value demonstrating how effectively a company achieves objectives.

  • ๐Ÿ“ Measure (Metric): A numerical value that can be aggregated.

  • ➕ Additive Measure: Can be summed across all dimensions (e.g., Sales Amount).

  • ๐ŸŒ— Semi-Additive Measure: Can be summed across some dimensions but not all (e.g., Account Balance).

  • ๐Ÿšซ Non-Additive Measure: Cannot be summed (e.g., Ratios, Percentages).


๐ŸงŠ OLAP & Querying

  • ๐Ÿง  OLAP (Online Analytical Processing): Technology for interactive, multidimensional data analysis.

  • ๐Ÿงพ OLTP (Online Transactional Processing): Systems managing transaction-oriented applications (e.g., ERP, CRM).

  • ๐Ÿ“ฆ Cube: A multi-dimensional array of pre-aggregated data for fast querying.

  • ↕️ Drill Down / Roll Up: Navigating data hierarchy from summary to detail (Drill Down) or detail to summary (Roll Up).

  • ๐Ÿฐ Slice and Dice: Viewing data from different perspectives by selecting subsets.

  • ๐Ÿ”„ Pivot: Changing the dimensional orientation of a report.

  • ⌨️ MDX / DAX: Query languages for OLAP cubes (MDX) and Power BI/Analysis Services (DAX).


☁️ Modern Cloud & Big Data Terminology

  • ๐Ÿ•ธ️ Data Mesh: Decentralized architecture organizing data by business domains, treating data as a product.

  • ๐Ÿงต Data Fabric: Architecture providing a unified layer for data management across disparate environments.

  • ๐Ÿ–ฅ️ Data Warehouse Appliances: Pre-configured hardware/software bundles (e.g., Teradata, Netezza).

  • ⚡ MPP (Massively Parallel Processing): Multiple processors working simultaneously on a task; used by cloud DWHs like Snowflake.

  • ๐ŸŠ Data Swamp: A deteriorated, unmanaged data lake with little value.

  • ๐Ÿ‘ป Serverless: Cloud execution model where the provider manages machine resources dynamically (e.g., BigQuery).


๐Ÿ“š General & Administrative Terms

  • ๐Ÿท️ Metadata: "Data about data." Describes structure, source, and characteristics.

  • ๐Ÿ‘ฃ Data Lineage: Visual representation of data's origin and movement through systems.

  • ๐Ÿ“– Data Catalog: Centralized inventory of data assets helping users find and understand data.

  • ๐Ÿ‘‘ Master Data Management (MDM): Managing critical data (customer, product) for a single point of reference.

  • ✅ Data Quality: The accuracy, completeness, consistency, and timeliness of data.

  • ๐Ÿ“‰ BI Tool: Software for creating reports/dashboards (Tableau, Power BI).

  • ❓ Ad-hoc Query: A non-standard, one-time query created by a user.

  • ๐Ÿ’พ Materialized View: A pre-computed view stored physically to improve performance for complex queries.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.