JAlcocerTech E-books

Data Modelling Fundamentals

How do you structure and represent data efficiently to unlock valuable insights? By understanding the principles of data modelling and the terminology associated with it.

This chapter covers the foundational concepts you need to design and work with data models, empowering you to make informed decisions and extract meaningful insights from complex datasets.


What is Data Modelling?

Data modeling is the process of creating a visual representation of the relationships, structure, and organization of data within a system.

We design a database structure that accurately represents the data requirements of an organization or system. It helps define how data should be stored, processed, and accessed to meet the needs of an application or business process.


Data Modelling Techniques

There are several different data modeling techniques, each with their own strengths and weaknesses:

Entity-Relationship (ER) Modeling

ER modeling represents entities and their relationships to each other:

  • Entities represent real-world objects or concepts
  • Attributes describe the characteristics of those entities
  • Relationships are represented as lines connecting entities

Best for: Transactional systems (OLTP) requiring a normalized schema

Third Normal Form (3NF) Modeling

3NF modeling designs a database schema that eliminates redundancy and improves data consistency:

  • Break down data into smaller, atomic components
  • Organize components into normalized tables
  • Minimize data duplication

Best for: OLTP systems requiring a highly normalized schema

Dimensional Modeling

A technique for designing schemas that emphasize efficient and effective analysis:

  • Organize data into facts (numerical measurements) and dimensions (analysis perspectives)
  • Use denormalized schemas optimized for complex queries

Best for: OLAP systems requiring a denormalized schema for analytics

Key characteristics:

ConceptDescription
Star SchemaCentral fact table surrounded by dimension tables
Snowflake SchemaDimension tables further normalized into multiple levels
Denormalized StructuresFewer joins, faster queries
Business FocusAligned with key business entities and KPIs

Fact Tables vs Dimension Tables

AspectFact TablesDimension Tables
ContentNumeric measures (sales, quantity, profit)Descriptive attributes (product name, customer, date)
KeysForeign keys to dimensionsPrimary keys referenced by facts
GranularityFine-grained transaction/event dataCategorical/qualitative data
SizeLarge (many rows)Smaller (fewer rows)
ExamplesSales Amount, TransactionsProduct, Customer, Time, Geography

Data Vault Modeling

A technique focused on providing a flexible and scalable architecture for capturing and integrating data from various sources. Used in enterprise data warehousing scenarios.


Types of Data Models

Data models progress from abstract to concrete:

Model TypePurposeDetail Level
ConceptualCommunicate overall structure to stakeholdersHigh-level, no technical details
LogicalDefine structure without physical implementationAdds data types, constraints, relationships
PhysicalTechnical implementation in specific DBMSTable structures, indexes, storage

Remember: Conceptual → Logical → Physical (highest to lowest abstraction)


Key Data Modelling Terms

Data Formats

FormatDescriptionExamples
StructuredFollows a schema, every row conformsRelational DBs, spreadsheets
Semi-StructuredTables exist, but rows may have different propertiesXML, JSON, logs
UnstructuredNo predefined structureImages, videos, documents

Entity-Relationship Concepts

  • Entity: An object or concept relevant to the system (e.g., Customer, Product)
  • Attribute: A property of an entity (e.g., name, age, price)
  • Relationship: How entities are connected (one-to-one, one-to-many, many-to-many)
  • Primary Key: Unique identifier for a record
  • Foreign Key: Attribute referencing another table’s primary key

Cardinality

The number of unique values in a column relative to total rows:

  • 1:1 — One-to-one
  • 1:N — One-to-many
  • N:1 — Many-to-one
  • N:M — Many-to-many

Normalization vs Denormalization

AspectNormalizationDenormalization
GoalEliminate redundancy, improve integrityImprove read performance
TablesMany smaller tablesFewer wider tables
JoinsMore joins requiredFewer joins
Best forWRITE operations (OLTP)READ operations (OLAP)
Data redundancyMinimalIntentional duplication

When to Normalize

  • Transactional systems where data integrity is critical
  • Frequent INSERT, UPDATE, DELETE operations
  • Storage optimization is important

When to Denormalize

  • Analytical systems where query speed is critical
  • Read-heavy workloads
  • Reporting and dashboards

Star Schema vs Snowflake Schema

Star Schema ⭐

A central fact table directly connected to denormalized dimension tables:

        [Date Dim]
             |
[Product Dim]—[Sales Fact]—[Customer Dim]
             |
        [Store Dim]

Pros: Fewer joins, faster queries, simple to understand Cons: Data redundancy in dimensions

Snowflake Schema ❄️

Dimension tables are normalized into sub-dimensions:

[Year]—[Month]—[Date Dim]
                    |
[Brand]—[Category]—[Product Dim]—[Sales Fact]—[Customer Dim]—[City]—[Country]

Pros: Less redundancy, storage efficient Cons: More joins, complex queries, slower performance

For analytics, Star Schema is preferred due to simpler queries and faster performance.


Granularity

The level of detail in your data model:

GranularityDescriptionExample
High (Fine)Detailed, individual recordsEach transaction, each click
Low (Coarse)Aggregated, summarizedDaily totals, category summaries

Choose granularity based on:

  • Analysis requirements
  • Storage constraints
  • Query performance needs

Key Takeaways

  1. ER/3NF modeling → Transactional systems (OLTP), normalized schemas
  2. Dimensional modeling → Analytical systems (OLAP), denormalized schemas
  3. Conceptual → Logical → Physical → Abstract to concrete
  4. Star schema → Preferred for analytics (simple, fast)
  5. Snowflake schema → When storage efficiency matters more than query speed
  6. Normalize for writes, denormalize for reads