Translate

Thursday, 27 November 2025

Data analysis or analyst Terminology

 

๐Ÿง  I. Core Concepts & Processes

  • ๐Ÿ” Data Analysis: The process of inspecting, cleaning, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making.

  • ๐ŸŒ Data Analytics: A broader term that encompasses the entire management of data, including analysis, tools, methods, and processes.

  • ๐Ÿงช Data Science: An interdisciplinary field using scientific methods, algorithms, and systems to extract insights from data. It often involves advanced programming and statistical modeling.

  • ๐Ÿ’ผ Business Intelligence (BI): The infrastructure for collecting, storing, and analyzing business data to optimize decision-making.

  • ๐Ÿ“Š Data-Driven Decision Making (DDDM): Basing decisions on data analysis rather than purely on intuition or observation.


๐Ÿ“‚ II. Types of Data

  • ๐Ÿงฑ Structured Data: Organized in a predefined format (e.g., SQL databases, Excel rows/cols).

  • ๐Ÿ“„ Unstructured Data: No predefined format (e.g., emails, videos, social media posts).

  • ๐Ÿท️ Semi-structured Data: Contains tags or markers but no formal structure (e.g., JSON, XML).

  • ๐Ÿ“ Quantitative Data: Numerical data that can be measured (e.g., height, sales figures).

  • ๐ŸŽจ Qualitative Data: Descriptive, non-numerical data (e.g., interview transcripts, colors).

  • ๐Ÿ‹ Big Data: Extremely large datasets characterized by the 3 V's (or 5 V's):

    • ๐Ÿ“ฆ Volume: The sheer amount of data.

    • ๐Ÿš€ Velocity: The speed of data generation/processing.

    • ๐Ÿงฉ Variety: The different types of data.

    • Veracity: The quality and accuracy.

    • ๐Ÿ’Ž Value: The usefulness of the data.


๐Ÿ—„️ III. Data Management & Storage

  • ๐Ÿ›ข️ Database: An organized collection of data stored electronically.

  • ๐Ÿ—️ SQL (Structured Query Language): The language used to communicate with databases.

  • ๐Ÿƒ NoSQL: Databases designed for unstructured data (e.g., MongoDB).

  • ๐Ÿญ Data Warehouse: A central repository of integrated data used for reporting (e.g., Snowflake, BigQuery).

  • ๐Ÿ’ง Data Lake: A vast pool of raw data stored in its native format.

  • ๐Ÿ”„ ETL (Extract, Transform, Load): Extracting data, transforming it, and loading it into a warehouse.

  • ๐Ÿ“ฅ ELT (Extract, Load, Transform): Loading data first, then transforming it within the target system.

  • ๐Ÿช Data Mart: A subset of a data warehouse dedicated to a specific team.


๐Ÿงน IV. Data Cleaning & Preparation

  • ๐Ÿงผ Data Cleansing/Wrangling: Detecting and correcting corrupt or inaccurate records. Often the most time-consuming step.

  • ❓ Missing Data: Data points that are not recorded.

  • ๐Ÿงฉ Imputation: Replacing missing data with substituted values.

  • ๐Ÿ“‰ Outlier: A data point that differs significantly from others (can be an error or a finding).

  • ⚖️ Normalization: Scaling numerical data to a standard range (e.g., 0 to 1).

  • ๐Ÿ“ Standardization: Rescaling data to have a mean of 0 and a standard deviation of 1.


๐Ÿงฎ V. Statistics & Mathematics

  • ๐Ÿ“ Descriptive Statistics: Summarizes the main features of a dataset.

    • ๐ŸŽฏ Mean: The average.

    • ↔️ Median: The middle value.

    • ๐Ÿ” Mode: The most frequent value.

    • ๐Ÿ“ถ Standard Deviation: Measure of variation/dispersion.

  • ๐Ÿ•ต️ Inferential Statistics: Using a sample to make inferences about a population.

    • ๐ŸŒ Population: The entire set.

    • ๐Ÿงช Sample: A subset of the population.

  • ✅ Hypothesis Testing: Testing a hypothesis about a population using sample data.

  • ๐ŸŽฒ P-value: Probability of results occurring by chance (≤ 0.05 usually means significant).

  • ๐Ÿ”— Correlation: A measure of the relationship between two variables (Correlation ≠ Causation).

  • ๐Ÿ“‰ Regression Analysis: Estimating relationships between variables (e.g., Linear, Logistic).


๐Ÿ”ฌ VI. Data Analysis & Modeling

  • ๐Ÿ”Ž Exploratory Data Analysis (EDA): Analyzing datasets to summarize characteristics, often visually.

  • ๐Ÿฉบ Diagnostic Analysis: Understanding why events happened.

  • ๐Ÿ”ฎ Predictive Analysis: Using algorithms to identify the likelihood of future outcomes.

  • ๐Ÿ’Š Prescriptive Analysis: Recommending actions to affect outcomes.

  • ๐Ÿค– Machine Learning (ML): Computers learning without explicit programming.

    • ๐Ÿ‘จ‍๐Ÿซ Supervised Learning: Trained on labeled data.

    • ๐Ÿ•ถ️ Unsupervised Learning: Finding patterns in unlabeled data.


๐Ÿ“Š VII. Data Visualization

  • ๐Ÿ–ฅ️ Dashboard: A visual display of key information on a single screen.

  • ๐Ÿ“ Metric: A standard of measurement.

  • ๐ŸŽฏ KPI (Key Performance Indicator): Measurable value demonstrating business objective success.

  • ๐Ÿ“ˆ Charts & Graphs:

    • Bar Chart: Compares categories.

    • Line Chart: Trends over time.

    • Histogram: Distribution of a variable.

    • Scatter Plot: Relationship between two variables.

    • Box Plot: Distribution based on quartiles.


๐Ÿ‘ฅ VIII. Roles & Responsibilities

  • ๐Ÿง Data Analyst: Interprets data to help make decisions (Reporting, Viz, Descriptive analysis).

  • ๐Ÿค Business Analyst: Bridges the gap between IT and business needs.

  • ๐Ÿ‘จ‍๐Ÿ”ฌ Data Scientist: Uses advanced ML/Stats to build predictive models.

  • ๐Ÿ‘ท Data Engineer: Builds the infrastructure (pipelines, warehouses) for analysts.

  • ๐Ÿ› ️ BI Developer: Specializes in designing dashboards and BI tools.


๐Ÿ› ️ IX. Tools & Technologies

  • ๐Ÿ Programming: Python (Pandas, NumPy), R.

  • ๐Ÿ“— Spreadsheets: Excel, Google Sheets.

  • ๐Ÿ“Š BI Tools: Tableau, Power BI, Looker.

  • ๐Ÿ›ข️ Databases: SQL, PostgreSQL, MySQL.

  • ๐Ÿ˜ Big Data: Hadoop, Spark, Kafka.

Data Warehouse Concepts

 

๐Ÿ—️ Core Data Warehouse Concepts

  • ๐Ÿข Data Warehouse (DWH): A central repository of integrated, historical data designed for query and analysis. Its primary purpose is to support business intelligence.

  • ๐Ÿช Data Mart: A subset of a data warehouse tailored to serve a specific business line (e.g., Sales, Finance). It contains a focused collection of data.

  • ๐Ÿ”„ Operational Data Store (ODS): A database for integrating data from multiple sources for operational reporting. It is more current and volatile, often acting as a staging area.

  • ๐ŸŒŠ Data Lake: A vast repository holding massive amounts of raw data in its native format (structured and unstructured) until needed. No predefined schema required.

  • ๐Ÿก Data Lakehouse: A modern architecture combining the flexibility of data lakes with the data management and ACID transactions of data warehouses.

  • ๐Ÿšง Staging Area: A temporary storage area for data extraction, cleansing, and transformation before loading into the warehouse. Not typically queried by end-users.

  • ๐Ÿ“Š Business Intelligence (BI): Technologies and practices for collecting, analyzing, and presenting business information to support decision-making.


๐Ÿ“ Data Modeling & Architecture

  • ๐Ÿ—บ️ Schema: The logical description of the entire database, including table structures and relationships.

  • ⭐ Star Schema: The simplest schema consisting of a central fact table connected to multiple dimension tables in a star shape.

  • ❄️ Snowflake Schema: A variation of the star schema where dimension tables are normalized into multiple related tables. Reduces redundancy but increases complexity.

  • ๐ŸŒŒ Galaxy Schema (Fact Constellation): A complex schema with multiple fact tables sharing dimension tables.

  • ๐Ÿ” Data Vault Modeling: A hybrid method for long-term historical storage, composed of Hubs, Links, and Satellites. Resilient to change and highly scalable.

  • ๐Ÿท️ Dimension: A category of information (the "who, what, where, when"). Provides context to facts (e.g., Customer, Product).

  • ๐Ÿ”ข Fact: A measurement or metric, typically numerical (e.g., Sales Amount, Quantity).

  • ๐ŸŸข Fact Table: The central table containing measurements (facts) and foreign keys.

  • ๐Ÿ”ต Dimension Table: A table storing descriptive attributes related to a business dimension (e.g., Customer Name, City).

  • ๐ŸŒพ Grain (Granularity): The level of detail in a fact table (e.g., "one row per line item").

  • ๐Ÿ”‘ Surrogate Key: A system-generated unique identifier (integer) used as a primary key, independent of the source system.

  • ๐Ÿ†” Natural Key (Business Key): An identifier from the operational source system (e.g., CustomerID).

  • ๐Ÿ•ฐ️ Slowly Changing Dimension (SCD): Techniques to manage data changes over time.

    • Type 1: Overwrite old value (No history).

    • Type 2: Add new row (Preserve history).

    • Type 3: Add new column (Limited history).

  • ๐Ÿค Conformed Dimension: A dimension that represents the same thing across different fact tables (e.g., Date).


๐Ÿš€ ETL & Data Integration

  • ๐Ÿšš ETL (Extract, Transform, Load): The process of moving data from source to warehouse.

    • Extract: Reading data.

    • Transform: Cleaning and structuring data.

    • Load: Writing data to the target.

  • ☁️ ELT (Extract, Load, Transform): Loading data into the target system before transformation. Common in modern cloud platforms (Snowflake, BigQuery).

  • ๐Ÿšฐ Data Pipeline: A system moving data from one place to another; may or may not involve heavy transformation.

  • ๐Ÿ“ธ Change Data Capture (CDC): Identifying and capturing changes (inserts, updates, deletes) in a source database to apply them to the warehouse in near real-time.

  • ๐Ÿงน Data Cleansing: Detecting and correcting corrupt or inaccurate records.

  • ๐Ÿ” Data Profiling: Examining source data to collect statistics and assess quality.

  • ๐Ÿ›ก️ Data Governance: Managing data availability, usability, integrity, and security across an enterprise.


๐ŸŽฏ Key Performance Indicators & Metrics

  • ๐Ÿ“ˆ KPI (Key Performance Indicator): A measurable value demonstrating how effectively a company achieves objectives.

  • ๐Ÿ“ Measure (Metric): A numerical value that can be aggregated.

  • ➕ Additive Measure: Can be summed across all dimensions (e.g., Sales Amount).

  • ๐ŸŒ— Semi-Additive Measure: Can be summed across some dimensions but not all (e.g., Account Balance).

  • ๐Ÿšซ Non-Additive Measure: Cannot be summed (e.g., Ratios, Percentages).


๐ŸงŠ OLAP & Querying

  • ๐Ÿง  OLAP (Online Analytical Processing): Technology for interactive, multidimensional data analysis.

  • ๐Ÿงพ OLTP (Online Transactional Processing): Systems managing transaction-oriented applications (e.g., ERP, CRM).

  • ๐Ÿ“ฆ Cube: A multi-dimensional array of pre-aggregated data for fast querying.

  • ↕️ Drill Down / Roll Up: Navigating data hierarchy from summary to detail (Drill Down) or detail to summary (Roll Up).

  • ๐Ÿฐ Slice and Dice: Viewing data from different perspectives by selecting subsets.

  • ๐Ÿ”„ Pivot: Changing the dimensional orientation of a report.

  • ⌨️ MDX / DAX: Query languages for OLAP cubes (MDX) and Power BI/Analysis Services (DAX).


☁️ Modern Cloud & Big Data Terminology

  • ๐Ÿ•ธ️ Data Mesh: Decentralized architecture organizing data by business domains, treating data as a product.

  • ๐Ÿงต Data Fabric: Architecture providing a unified layer for data management across disparate environments.

  • ๐Ÿ–ฅ️ Data Warehouse Appliances: Pre-configured hardware/software bundles (e.g., Teradata, Netezza).

  • ⚡ MPP (Massively Parallel Processing): Multiple processors working simultaneously on a task; used by cloud DWHs like Snowflake.

  • ๐ŸŠ Data Swamp: A deteriorated, unmanaged data lake with little value.

  • ๐Ÿ‘ป Serverless: Cloud execution model where the provider manages machine resources dynamically (e.g., BigQuery).


๐Ÿ“š General & Administrative Terms

  • ๐Ÿท️ Metadata: "Data about data." Describes structure, source, and characteristics.

  • ๐Ÿ‘ฃ Data Lineage: Visual representation of data's origin and movement through systems.

  • ๐Ÿ“– Data Catalog: Centralized inventory of data assets helping users find and understand data.

  • ๐Ÿ‘‘ Master Data Management (MDM): Managing critical data (customer, product) for a single point of reference.

  • ✅ Data Quality: The accuracy, completeness, consistency, and timeliness of data.

  • ๐Ÿ“‰ BI Tool: Software for creating reports/dashboards (Tableau, Power BI).

  • ❓ Ad-hoc Query: A non-standard, one-time query created by a user.

  • ๐Ÿ’พ Materialized View: A pre-computed view stored physically to improve performance for complex queries.