Data Modelling Fundamentals

How do you structure and represent data efficiently to unlock valuable insights? By understanding the principles of data modelling and the terminology associated with it.

This chapter covers the foundational concepts you need to design and work with data models, empowering you to make informed decisions and extract meaningful insights from complex datasets.

What is Data Modelling?

Data modeling is the process of creating a visual representation of the relationships, structure, and organization of data within a system.

We design a database structure that accurately represents the data requirements of an organization or system. It helps define how data should be stored, processed, and accessed to meet the needs of an application or business process.

Data Modelling Techniques

There are several different data modeling techniques, each with their own strengths and weaknesses:

Entity-Relationship (ER) Modeling

ER modeling represents entities and their relationships to each other:

Entities represent real-world objects or concepts
Attributes describe the characteristics of those entities
Relationships are represented as lines connecting entities

Best for: Transactional systems (OLTP) requiring a normalized schema

Third Normal Form (3NF) Modeling

3NF modeling designs a database schema that eliminates redundancy and improves data consistency:

Break down data into smaller, atomic components
Organize components into normalized tables
Minimize data duplication

Best for: OLTP systems requiring a highly normalized schema

Dimensional Modeling

A technique for designing schemas that emphasize efficient and effective analysis:

Organize data into facts (numerical measurements) and dimensions (analysis perspectives)
Use denormalized schemas optimized for complex queries

Best for: OLAP systems requiring a denormalized schema for analytics

Key characteristics:

Concept	Description
Star Schema	Central fact table surrounded by dimension tables
Snowflake Schema	Dimension tables further normalized into multiple levels
Denormalized Structures	Fewer joins, faster queries
Business Focus	Aligned with key business entities and KPIs

Fact Tables vs Dimension Tables

Aspect	Fact Tables	Dimension Tables
Content	Numeric measures (sales, quantity, profit)	Descriptive attributes (product name, customer, date)
Keys	Foreign keys to dimensions	Primary keys referenced by facts
Granularity	Fine-grained transaction/event data	Categorical/qualitative data
Size	Large (many rows)	Smaller (fewer rows)
Examples	Sales Amount, Transactions	Product, Customer, Time, Geography

Data Vault Modeling

A technique focused on providing a flexible and scalable architecture for capturing and integrating data from various sources. Used in enterprise data warehousing scenarios.

Types of Data Models

Data models progress from abstract to concrete:

Model Type	Purpose	Detail Level
Conceptual	Communicate overall structure to stakeholders	High-level, no technical details
Logical	Define structure without physical implementation	Adds data types, constraints, relationships
Physical	Technical implementation in specific DBMS	Table structures, indexes, storage

Remember: Conceptual → Logical → Physical (highest to lowest abstraction)

Key Data Modelling Terms

Data Formats

Format	Description	Examples
Structured	Follows a schema, every row conforms	Relational DBs, spreadsheets
Semi-Structured	Tables exist, but rows may have different properties	XML, JSON, logs
Unstructured	No predefined structure	Images, videos, documents

Entity-Relationship Concepts

Entity: An object or concept relevant to the system (e.g., Customer, Product)
Attribute: A property of an entity (e.g., name, age, price)
Relationship: How entities are connected (one-to-one, one-to-many, many-to-many)
Primary Key: Unique identifier for a record
Foreign Key: Attribute referencing another table’s primary key

Cardinality

The number of unique values in a column relative to total rows:

1:1 — One-to-one
1:N — One-to-many
N:1 — Many-to-one
N:M — Many-to-many

Normalization vs Denormalization

Aspect	Normalization	Denormalization
Goal	Eliminate redundancy, improve integrity	Improve read performance
Tables	Many smaller tables	Fewer wider tables
Joins	More joins required	Fewer joins
Best for	WRITE operations (OLTP)	READ operations (OLAP)
Data redundancy	Minimal	Intentional duplication

When to Normalize

Transactional systems where data integrity is critical
Frequent INSERT, UPDATE, DELETE operations
Storage optimization is important

When to Denormalize

Analytical systems where query speed is critical
Read-heavy workloads
Reporting and dashboards

Star Schema vs Snowflake Schema

Star Schema ⭐

A central fact table directly connected to denormalized dimension tables:

        [Date Dim]
             |
[Product Dim]—[Sales Fact]—[Customer Dim]
             |
        [Store Dim]

Pros: Fewer joins, faster queries, simple to understand Cons: Data redundancy in dimensions

Snowflake Schema ❄️

Dimension tables are normalized into sub-dimensions:

[Year]—[Month]—[Date Dim]
                    |
[Brand]—[Category]—[Product Dim]—[Sales Fact]—[Customer Dim]—[City]—[Country]

Pros: Less redundancy, storage efficient Cons: More joins, complex queries, slower performance

For analytics, Star Schema is preferred due to simpler queries and faster performance.

Granularity

The level of detail in your data model:

Granularity	Description	Example
High (Fine)	Detailed, individual records	Each transaction, each click
Low (Coarse)	Aggregated, summarized	Daily totals, category summaries

Choose granularity based on:

Analysis requirements
Storage constraints
Query performance needs

Key Takeaways

ER/3NF modeling → Transactional systems (OLTP), normalized schemas
Dimensional modeling → Analytical systems (OLAP), denormalized schemas
Conceptual → Logical → Physical → Abstract to concrete
Star schema → Preferred for analytics (simple, fast)
Snowflake schema → When storage efficiency matters more than query speed
Normalize for writes, denormalize for reads