Data Modelling Fundamentals
How do you structure and represent data efficiently to unlock valuable insights? By understanding the principles of data modelling and the terminology associated with it.
This chapter covers the foundational concepts you need to design and work with data models, empowering you to make informed decisions and extract meaningful insights from complex datasets.
What is Data Modelling?
Data modeling is the process of creating a visual representation of the relationships, structure, and organization of data within a system.
We design a database structure that accurately represents the data requirements of an organization or system. It helps define how data should be stored, processed, and accessed to meet the needs of an application or business process.
Data Modelling Techniques
There are several different data modeling techniques, each with their own strengths and weaknesses:
Entity-Relationship (ER) Modeling
ER modeling represents entities and their relationships to each other:
- Entities represent real-world objects or concepts
- Attributes describe the characteristics of those entities
- Relationships are represented as lines connecting entities
Best for: Transactional systems (OLTP) requiring a normalized schema
Third Normal Form (3NF) Modeling
3NF modeling designs a database schema that eliminates redundancy and improves data consistency:
- Break down data into smaller, atomic components
- Organize components into normalized tables
- Minimize data duplication
Best for: OLTP systems requiring a highly normalized schema
Dimensional Modeling
A technique for designing schemas that emphasize efficient and effective analysis:
- Organize data into facts (numerical measurements) and dimensions (analysis perspectives)
- Use denormalized schemas optimized for complex queries
Best for: OLAP systems requiring a denormalized schema for analytics
Key characteristics:
| Concept | Description |
|---|---|
| Star Schema | Central fact table surrounded by dimension tables |
| Snowflake Schema | Dimension tables further normalized into multiple levels |
| Denormalized Structures | Fewer joins, faster queries |
| Business Focus | Aligned with key business entities and KPIs |
Fact Tables vs Dimension Tables
| Aspect | Fact Tables | Dimension Tables |
|---|---|---|
| Content | Numeric measures (sales, quantity, profit) | Descriptive attributes (product name, customer, date) |
| Keys | Foreign keys to dimensions | Primary keys referenced by facts |
| Granularity | Fine-grained transaction/event data | Categorical/qualitative data |
| Size | Large (many rows) | Smaller (fewer rows) |
| Examples | Sales Amount, Transactions | Product, Customer, Time, Geography |
Data Vault Modeling
A technique focused on providing a flexible and scalable architecture for capturing and integrating data from various sources. Used in enterprise data warehousing scenarios.
Types of Data Models
Data models progress from abstract to concrete:
| Model Type | Purpose | Detail Level |
|---|---|---|
| Conceptual | Communicate overall structure to stakeholders | High-level, no technical details |
| Logical | Define structure without physical implementation | Adds data types, constraints, relationships |
| Physical | Technical implementation in specific DBMS | Table structures, indexes, storage |
Remember: Conceptual → Logical → Physical (highest to lowest abstraction)
Key Data Modelling Terms
Data Formats
| Format | Description | Examples |
|---|---|---|
| Structured | Follows a schema, every row conforms | Relational DBs, spreadsheets |
| Semi-Structured | Tables exist, but rows may have different properties | XML, JSON, logs |
| Unstructured | No predefined structure | Images, videos, documents |
Entity-Relationship Concepts
- Entity: An object or concept relevant to the system (e.g., Customer, Product)
- Attribute: A property of an entity (e.g., name, age, price)
- Relationship: How entities are connected (one-to-one, one-to-many, many-to-many)
- Primary Key: Unique identifier for a record
- Foreign Key: Attribute referencing another table’s primary key
Cardinality
The number of unique values in a column relative to total rows:
- 1:1 — One-to-one
- 1:N — One-to-many
- N:1 — Many-to-one
- N:M — Many-to-many
Normalization vs Denormalization
| Aspect | Normalization | Denormalization |
|---|---|---|
| Goal | Eliminate redundancy, improve integrity | Improve read performance |
| Tables | Many smaller tables | Fewer wider tables |
| Joins | More joins required | Fewer joins |
| Best for | WRITE operations (OLTP) | READ operations (OLAP) |
| Data redundancy | Minimal | Intentional duplication |
When to Normalize
- Transactional systems where data integrity is critical
- Frequent INSERT, UPDATE, DELETE operations
- Storage optimization is important
When to Denormalize
- Analytical systems where query speed is critical
- Read-heavy workloads
- Reporting and dashboards
Star Schema vs Snowflake Schema
Star Schema ⭐
A central fact table directly connected to denormalized dimension tables:
[Date Dim]
|
[Product Dim]—[Sales Fact]—[Customer Dim]
|
[Store Dim]
Pros: Fewer joins, faster queries, simple to understand Cons: Data redundancy in dimensions
Snowflake Schema ❄️
Dimension tables are normalized into sub-dimensions:
[Year]—[Month]—[Date Dim]
|
[Brand]—[Category]—[Product Dim]—[Sales Fact]—[Customer Dim]—[City]—[Country]
Pros: Less redundancy, storage efficient Cons: More joins, complex queries, slower performance
For analytics, Star Schema is preferred due to simpler queries and faster performance.
Granularity
The level of detail in your data model:
| Granularity | Description | Example |
|---|---|---|
| High (Fine) | Detailed, individual records | Each transaction, each click |
| Low (Coarse) | Aggregated, summarized | Daily totals, category summaries |
Choose granularity based on:
- Analysis requirements
- Storage constraints
- Query performance needs
Key Takeaways
- ER/3NF modeling → Transactional systems (OLTP), normalized schemas
- Dimensional modeling → Analytical systems (OLAP), denormalized schemas
- Conceptual → Logical → Physical → Abstract to concrete
- Star schema → Preferred for analytics (simple, fast)
- Snowflake schema → When storage efficiency matters more than query speed
- Normalize for writes, denormalize for reads